Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

The SPECIALIST Lexicon

N-gram Set by Split, Group, Filter, Combine and Sort (SGFCS) Algorithm

This page describes the details of generating n-grams (n = 1-5) from MEDLINE using split and combine algorithm. Due to the n-grams are too big for the Java HashMap limitation, the n-grams retrieving processes can be split (by

I. Basic N-gram Set
n-gram (uniGram, biGram, triGram, fourthGram, fifthGram) from MEDLINE are retrieved from Medline as follows:

II. Split (MEDLINE):

Split the total input MEDLINE files into N portions.

  • For 2014 release, there are 746 files (PmidTiAbS14n0001.txt - PmidTiAbS14n0746.txt)
  • This program can automatically split into N portions
  • The output file is: n-gram.out.N.heap.MAX_CL.sS.C
    • N: n-gram
    • MAX_CL: max. characters length (all n-grams is longer than MAX_CL is filtered out)
    • S: no. of split
    • C: current no. of split

    For Example: nGram.out.5.heap.50.s20.15
    • 5-gram (N = 5) with max. characters of 50 (MAX_CL), split into 20 (S), current portion of 15 (C, ~ PmidTiAbS14n0519.txt - PmidTiAbS14n0555.txt)
    • S = 20, 746/20 = 37
    • C = 15, 37 x (15-1) + 1 = 519; 37 x 15 = 555;

III. Group (by alphabetic order):

Group all split n-gram files with specified range of characters. All n-grams are independent if group (sorted and combined alphabetically) together. The alphabets are in the following order:

NO, ... 0-9, ... >, ?, @A, B, C, ..., X, Y, Z[, \, ], ^, _, `a, b, c, ..., x, y, z{, |, }, ... NO
  • The program allows users to specify the range of starting and ending characters
  • The output file is: nGram.out.N.heap.MAX_CL.sN.gS.SC-EC
    • MAX_CL: max. characters length (all n-grams is longer than MAX_CL is filtered out)
    • N: n-gram
    • S: serial number
    • SC: starting character (included)
    • EC: ending character (not included)

    For Example: nGram.out.5.heap.50.s20.g05.b-d
    • 5-gram with max. characters of 50, with 20 split, group no. 5 by grouping all n-grams starting with b and c (ends with d, not included).

IV. Filter (by WC) Combine, then Sort:

Combine and filter out n-grams by WC (which take most portion in higher grams), then sort them by WC, DC and alphabetic order.

V. Exmaple: