Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.
N-gram Set by MapDb (persistent key-value DB)
TBD...
This page describes the details of generating n-grams (n = 1-5) from MEDLINE using split and combine algorithm. Due to the nGrams are too big for the Java HashMap limitation, the nGrams retrieving processes can be split (by
I. Basic N-gram Set
n-gram (uniGram, biGram, triGram, fourthGram, fifthGram) from MEDLINE are retrieved from Medline as follows:
II. MapDb:
Split the total input MEDLINE files into N portions.
DB db = DBMaker .newFileDB(file) .transactionDisable() .mmapFileEnable() .cacheSize(1000000000) .make(); Map<String, NGramObjS> nGramMap = db.createHashMap("nGram") .valueSerializer(serializer) .make();
III. Filter (WC) and combine:
Combine and filter out nGrams by WC (which take most portion in higher grams)
IV. Example Walk-through (MEDLINE.2014):
Program | Preprocess | unigrams | bigrams | trigrams | fourgrams | fivegrams |
---|---|---|---|---|---|---|
N | n=1 | n=2 | n=3 | n=4 | n=5 | |
Option 1
|
| |||||
Option 9
|
|
|
|
|
| |
Option 9a
| nGram.out.1.50.MapDbS.30 -> nGram.out.1.${YEAR} | nGram.out.2.50.MapDbS.30 -> nGram.out.2.${YEAR} | nGram.out.3.50.MapDbS.30 -> nGram.out.3.${YEAR} | nGram.out.4.50.MapDbS.30 -> nGram.out.4.${YEAR} | nGram.out.5.50.MapDbS.30 -> nGram.out.5.${YEAR} |