The SPECIALIST Lexicon

N-gram Set by MapDb (persistent key-value DB)

TBD...
This page describes the details of generating n-grams (n = 1-5) from MEDLINE using split and combine algorithm. Due to the nGrams are too big for the Java HashMap limitation, the nGrams retrieving processes can be split (by

I. Basic N-gram Set
n-gram (uniGram, biGram, triGram, fourthGram, fifthGram) from MEDLINE are retrieved from Medline as follows:

II. MapDb:

Split the total input MEDLINE files into N portions.

DB db = DBMaker
	.newFileDB(file)
	.transactionDisable()
	.mmapFileEnable()
	.cacheSize(1000000000)
	.make();

Map<String, NGramObjS> nGramMap = db.createHashMap("nGram")
	.valueSerializer(serializer)
	.make();

III. Filter (WC) and combine:

Combine and filter out nGrams by WC (which take most portion in higher grams)

IV. Example Walk-through (MEDLINE.2014):

ProgramPreprocessunigramsbigramstrigramsfourgramsfivegrams
N n=1n=2n=3n=4n=5
Option 1
  • GenPmidTiAbSentenceFiles
  • PmidTiAbSentences{YY}n{DDDD}.txt
  • ~45 min.
  • PmidTiAbS14: 1-746
     
Option 9
  • Gen nGram
  • GetNGramFromSFilesByMapDbS
  • MAX_CL = 50
 
  • nGram.out.1.50.MapDbS
  • Total time: 20629.90 Sec. (5.7 Hr.)
  • Documents: 22,356,869
  • Sentences: 126,612,705
  • Tokens: 2,610,209,406 (100%)
  • n-grams (unique tokens): 21,530,469
  • nGram.out.2.50.MapDbS
  • Total time: 139372.43 Sec. (38.7 Hr.)
  • Documents: 22,356,869
  • Sentences: 126,612,705
  • Tokens: 2,610,209,406 (100%)
  • n-grams (unique tokens): 205,868,398
  • nGram.out.3.50.MapDbS
  • Total time:
  • Documents: 22,356,869
  • Sentences: 126,612,705
  • Tokens: 2,610,209,406 (100%)
  • n-grams (unique tokens):
  • nGram.out.4.50.MapDbS
  • Total time:
  • Documents: 22,356,869
  • Sentences: 126,612,705
  • Tokens: 2,610,209,406 (100%)
  • n-grams (unique tokens):
  • nGram.out.5.50.MapDbS
  • Total time:
  • Documents: 22,356,869
  • Sentences: 126,612,705
  • Tokens: 2,610,209,406 (100%)
  • n-grams (unique tokens):
Option 9a
  • Filter nGrams by WC and then sort by dwt
  • NGramFilter
  nGram.out.1.50.MapDbS.30 -> nGram.out.1.${YEAR} nGram.out.2.50.MapDbS.30 -> nGram.out.2.${YEAR} nGram.out.3.50.MapDbS.30 -> nGram.out.3.${YEAR} nGram.out.4.50.MapDbS.30 -> nGram.out.4.${YEAR} nGram.out.5.50.MapDbS.30 -> nGram.out.5.${YEAR}