SPECIALIST Lexicon

N-Gram Set Basic from MEDLINE

This page describes the basic steps of generating the MEDLINE N-grams set.

I. Basic N-gram Set Procedures
N-gram (uniGram, biGram, triGram, fourthGram, fifthGram) from MEDLINE are retrieved from Medline as follows:

All TI (titles) and AB (abstracts) are retrieved from MEDLINE, 2014.
TI and AB are tokenized into sentences.
Sentences are then tokenized into words (use space as word boundaries)
- Warning message are sent out if the input sentence is illegal.
N-Gram are then generated for N = 1 ~ 5 to cover above 99% of words.
N-Gram are then generated, with WC (word count) and DC (document count)
- Generate one N-gram at one time
- Use a HashMap to store the information of nGram:
  
  Key (N-gram) N-Gram Obj (DC|WC|PMID)
  - PMID: is a local variable used for calculate document count. When the PMID is different from the saved value, it is a new document.
- The final format is:
  
  DC WC N-gram
II. Issues & Solutions:
Too many N-gram terms for computer memory (48Gb) when N =3, 4, 5 for 2014 release
- Limited hardware resources (My Linux has only 48 Gb memory)
- There are many illegal words exist for N >= 3
- Some n-gram terms are very long (even with very low frequency)
  - Max. length of N-Gram is set to 50 to cover above 99% of words.
- Due to above two reasons, it requires huge memory size to retrieve biGram (and above) if we keep all n-gram terms in memory. Also, the performance is very poor. In our case (for MEDLINE.2014), it only takes about 25 min. to retrieve uniGram. However, it takes 4 days to retrieve biGram.
- We tried embedded database (HSqlDb) instead of keeping nGram in a HashTable in the memory. It takes 3.5 days for just completing uniGram.
- We also tried database server (JavaDb) instead of keeping nGram in a HashTable in the memory. It takes more than 7 days and still not completing uniGram.
- Use Prediction Filter
- Use Split, Group, Filter, & Combine algorithm

The SPECIALIST Lexicon