N-Gram Set Basic from MEDLINE
This page describes the basic steps of generating the MEDLINE N-grams set.
I. Basic N-gram Set Procedures
N-gram (uniGram, biGram, triGram, fourthGram, fifthGram) from MEDLINE are retrieved from Medline as follows:
- All TI (titles) and AB (abstracts) are retrieved from MEDLINE, 2014.
- TI and AB are tokenized into sentences.
- Sentences are then tokenized into words (use space as word boundaries)
- Warning message are sent out if the input sentence is illegal.
- N-Gram are then generated for N = 1 ~ 5 to cover above 99% of words.
- N-Gram are then generated, with WC (word count) and DC (document count)
- Generate one N-gram at one time
- Use a HashMap to store the information of nGram:
Key (N-gram) | N-Gram Obj (DC|WC|PMID)
|
---|
- PMID: is a local variable used for calculate document count. When the PMID is different from the saved value, it is a new document.
- The final format is:
- II. Issues & Solutions:
Too many N-gram terms for computer memory (48Gb) when N =3, 4, 5 for 2014 release
- Limited hardware resources (My Linux has only 48 Gb memory)
- There are many illegal words exist for N >= 3
- Some n-gram terms are very long (even with very low frequency)
- Max. length of N-Gram is set to 50 to cover above 99% of words.
- Due to above two reasons, it requires huge memory size to retrieve biGram (and above) if we keep all n-gram terms in memory. Also, the performance is very poor. In our case (for MEDLINE.2014), it only takes about 25 min. to retrieve uniGram. However, it takes 4 days to retrieve biGram.
- We tried embedded database (HSqlDb) instead of keeping nGram in a HashTable in the memory. It takes 3.5 days for just completing uniGram.
- We also tried database server (JavaDb) instead of keeping nGram in a HashTable in the memory. It takes more than 7 days and still not completing uniGram.
- Use Prediction Filter
- Use Split, Group, Filter, & Combine algorithm