SPECIALIST Lexicon

N-gram Set by Split, Group, Filter, Combine and Sort (SGFCS) Algorithm

This page describes the details of generating n-grams (n = 1-5) from MEDLINE using split and combine algorithm. Due to the n-grams are too big for the Java HashMap limitation, the n-grams retrieving processes can be split (by

I. Basic N-gram Set
n-gram (uniGram, biGram, triGram, fourthGram, fifthGram) from MEDLINE are retrieved from Medline as follows:

II. Split (MEDLINE):

Split the total input MEDLINE files into N portions.

For 2014 release, there are 746 files (PmidTiAbS14n0001.txt - PmidTiAbS14n0746.txt)
This program can automatically split into N portions
The output file is: n-gram.out.N.heap.MAX_CL.sS.C
- N: n-gram
- MAX_CL: max. characters length (all n-grams is longer than MAX_CL is filtered out)
- S: no. of split
- C: current no. of split
For Example: nGram.out.5.heap.50.s20.15
- 5-gram (N = 5) with max. characters of 50 (MAX_CL), split into 20 (S), current portion of 15 (C, ~ PmidTiAbS14n0519.txt - PmidTiAbS14n0555.txt)
- S = 20, 746/20 = 37
- C = 15, 37 x (15-1) + 1 = 519; 37 x 15 = 555;

III. Group (by alphabetic order):

Group all split n-gram files with specified range of characters. All n-grams are independent if group (sorted and combined alphabetically) together. The alphabets are in the following order:

NO, ... 0-9, ... >, ?, @

A, B, C, ..., X, Y, Z

[, \, ], ^, _, `

a, b, c, ..., x, y, z

{, |, }, ... NO

The program allows users to specify the range of starting and ending characters
The output file is: nGram.out.N.heap.MAX_CL.sN.gS.SC-EC
- MAX_CL: max. characters length (all n-grams is longer than MAX_CL is filtered out)
- N: n-gram
- S: serial number
- SC: starting character (included)
- EC: ending character (not included)
For Example: nGram.out.5.heap.50.s20.g05.b-d
- 5-gram with max. characters of 50, with 20 split, group no. 5 by grouping all n-grams starting with b and c (ends with d, not included).

IV. Filter (by WC) Combine, then Sort:

Combine and filter out n-grams by WC (which take most portion in higher grams), then sort them by WC, DC and alphabetic order.

V. Exmaple:

The SPECIALIST Lexicon