Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

The SPECIALIST Lexicon

N-Gram Set Basic from MEDLINE

This page describes the basic steps of generating the MEDLINE N-grams set.

I. Basic N-gram Set Procedures
N-gram (uniGram, biGram, triGram, fourthGram, fifthGram) from MEDLINE are retrieved from Medline as follows:

  • All TI (titles) and AB (abstracts) are retrieved from MEDLINE, 2014.
  • TI and AB are tokenized into sentences.
  • Sentences are then tokenized into words (use space as word boundaries)
    • Warning message are sent out if the input sentence is illegal.
  • N-Gram are then generated for N = 1 ~ 5 to cover above 99% of words.
  • N-Gram are then generated, with WC (word count) and DC (document count)
    • Generate one N-gram at one time
    • Use a HashMap to store the information of nGram:

      Key (N-gram)N-Gram Obj (DC|WC|PMID)
      • PMID: is a local variable used for calculate document count. When the PMID is different from the saved value, it is a new document.
    • The final format is:

      DCWCN-gram

  • II. Issues & Solutions:
    Too many N-gram terms for computer memory (48Gb) when N =3, 4, 5 for 2014 release
    • Limited hardware resources (My Linux has only 48 Gb memory)
    • There are many illegal words exist for N >= 3
    • Some n-gram terms are very long (even with very low frequency)
      • Max. length of N-Gram is set to 50 to cover above 99% of words.

    • Due to above two reasons, it requires huge memory size to retrieve biGram (and above) if we keep all n-gram terms in memory. Also, the performance is very poor. In our case (for MEDLINE.2014), it only takes about 25 min. to retrieve uniGram. However, it takes 4 days to retrieve biGram.
    • We tried embedded database (HSqlDb) instead of keeping nGram in a HashTable in the memory. It takes 3.5 days for just completing uniGram.
    • We also tried database server (JavaDb) instead of keeping nGram in a HashTable in the memory. It takes more than 7 days and still not completing uniGram.

    • Use Prediction Filter
    • Use Split, Group, Filter, & Combine algorithm