The SPECIALIST Lexicon

N-Gram Set Basic from MEDLINE

This page describes the basic steps of generating the MEDLINE N-grams set.

I. Basic N-gram Set Procedures
N-gram (uniGram, biGram, triGram, fourthGram, fifthGram) from MEDLINE are retrieved from Medline as follows:

  • All TI (titles) and AB (abstracts) are retrieved from MEDLINE, 2014.
  • TI and AB are tokenized into sentences.
  • Sentences are then tokenized into words (use space as word boundaries)
    • Warning message are sent out if the input sentence is illegal.
  • N-Gram are then generated for N = 1 ~ 5 to cover above 99% of words.
  • N-Gram are then generated, with WC (word count) and DC (document count)
    • Generate one N-gram at one time
    • Use a HashMap to store the information of nGram:

      Key (N-gram)N-Gram Obj (DC|WC|PMID)
      • PMID: is a local variable used for calculate document count. When the PMID is different from the saved value, it is a new document.
    • The final format is:

      DCWCN-gram

  • II. Issues & Solutions:
    Too many N-gram terms for computer memory (48Gb) when N =3, 4, 5 for 2014 release
    • Limited hardware resources (My Linux has only 48 Gb memory)
    • There are many illegal words exist for N >= 3
    • Some n-gram terms are very long (even with very low frequency)
      • Max. length of N-Gram is set to 50 to cover above 99% of words.

    • Due to above two reasons, it requires huge memory size to retrieve biGram (and above) if we keep all n-gram terms in memory. Also, the performance is very poor. In our case (for MEDLINE.2014), it only takes about 25 min. to retrieve uniGram. However, it takes 4 days to retrieve biGram.
    • We tried embedded database (HSqlDb) instead of keeping nGram in a HashTable in the memory. It takes 3.5 days for just completing uniGram.
    • We also tried database server (JavaDb) instead of keeping nGram in a HashTable in the memory. It takes more than 7 days and still not completing uniGram.

    • Use Prediction Filter
    • Use Split, Group, Filter, & Combine algorithm