N-gram Set by MapDb (persistent key-value DB)
TBD...
This page describes the details of generating n-grams (n = 1-5) from MEDLINE using split and combine algorithm. Due to the nGrams are too big for the Java HashMap limitation, the nGrams retrieving processes can be split (by
I. Basic N-gram Set
n-gram (uniGram, biGram, triGram, fourthGram, fifthGram) from MEDLINE are retrieved from Medline as follows:
II. MapDb:
Split the total input MEDLINE files into N portions.
DB db = DBMaker .newFileDB(file) .transactionDisable() .mmapFileEnable() .cacheSize(1000000000) .make(); Map<String, NGramObjS> nGramMap = db.createHashMap("nGram") .valueSerializer(serializer) .make();
III. Filter (WC) and combine:
Combine and filter out nGrams by WC (which take most portion in higher grams)
IV. Example Walk-through (MEDLINE.2014):
Program | Preprocess | unigrams | bigrams | trigrams | fourgrams | fivegrams |
---|---|---|---|---|---|---|
N | n=1 | n=2 | n=3 | n=4 | n=5 | |
Option 1
|
| |||||
Option 9
|
|
|
|
|
| |
Option 9a
| nGram.out.1.50.MapDbS.30 -> nGram.out.1.${YEAR} | nGram.out.2.50.MapDbS.30 -> nGram.out.2.${YEAR} | nGram.out.3.50.MapDbS.30 -> nGram.out.3.${YEAR} | nGram.out.4.50.MapDbS.30 -> nGram.out.4.${YEAR} | nGram.out.5.50.MapDbS.30 -> nGram.out.5.${YEAR} |