This page describes the details of generating n-grams (n = 1-5) from MEDLINE.2014 using split and combine algorithm. 
| Program | Preprocess | unigrams | bigrams | trigrams | fourgrams | fivegrams
 | 
|---|
| N |   | n=1 | n=2 | n=3 | n=4 | n=5
 | 
Option 1- GenPmidTiAbSentenceFiles
 - PmidTiAbSentences{YY}n{DDDD}.txt
  
 | - < 1 hr.
 - PmidTiAbS14: 1-746
  
 |   |   |   |   |  
 | 
Option 10- Gen split n-gram
 - GetNGramFromSentenceFiles
 - MAX_CL = 50 
 - 1.Split
  
 |  
 | 
- Documents: 22,356,869
 - Sentences: 126,612,705
 - Tokens:2,610,209,406 (100%)
  - split = 1, no split
 - n-grams (unique tokens): 21,530,469
  - nGram.out.1.heap.50.s1.1
  - split = 4
 - n-grams (not unique): 28,877,339
  
 | - split = 4
 - 2-gram (not unique): 270,862,934
 - nGram.out.2.heap.50.c4.1
 - ...
 - nGram.out.2.heap.50.c4.4
  
 | - split = 8
 - 3-gram (not unique): 952,453,940
 - nGram.out.3.heap.50.c8.1
 - ...
 - nGram.out.3.heap.50.c8.8
  
 | - split = 20
 - 4-gram (not unique): 1,659,414,636
 - nGram.out.4.heap.50.c20.1
 - ...
 - nGram.out.4.heap.50.c20.20
  
 | - split = 20
 - 5-gram (not unique): 1,882,559,441
 - nGram.out.5.heap.50.s20.1
 - ...
 - nGram.out.5.heap.50.s20.20
  
 | 
Option 11- Group split n-grams by alphabetic characters
 - GroupSpliteNGrams
 - 2.Group
 
  |  
 | - 1-gram: 21,530,469
 - size: 348 Mb
 - g1.NO-A
 - g2.A-Z
 - g3.Z-NO
  
 | - 2-gram: 205,868,398 
 - size: 4.2 Gb
 - g1.NO-A
 - g2.A-Z
 - g3.Z-NO
  
 | - 3-gram: 703,148,136
 - size: 18 Gb 
 - g1.NO-A
 - g2.A-Z
 - g3.Z-g
 - g4.g-s
 - g5.s-NO
  
 | - 4-gram:1,295,096,308
 - Size: 40 Gb
 - g01.No-A
 - g02.A-M
 - g03.M-Z
 - g04.Z-c
 - g05.c-e
 - g06.e-k
 - g07.k-p
 - g08.p-s
 - g09.s-v
 - g10.v-NO
  
 | - 5-gram: 1,665,248,566 
 - Size: 61 Gb
 - g01.NO-A
 - g02.A-M
 - g03.M-Z
 - g04.Z-b
 - g05.b-d
 - g06.d-e
 - g07.e-i
 - g08.i-n
 - g09.n-p
 - g10.p-q
 - g11.q-s
 - g12.s-t
 - g13.t-u
 - g14.u-NO
  
 | 
Option 12- Filter by WC and combine alphabetic n-grams
 - FilterWcCombineNGrams
 - WC = 30, 50
 - 3.FilterCombine
 
  |  
 | - 1-gram.2014.30 (804,382)
 - 1-gram.2014.50 (564,244)
- 2-gram.2014.30 (4,587,349)
 - 2-gram.2014.50 (2,979,558)
   
 | - 3-gram.2014.30 (6,287,536)
 - 3-gram.2014.50 (3,691,583)
  
 | - 4-gram.2014.30 (3,799,377)
 - 4-gram.2014.50 (2,039,445)
  
 | - 5-gram.2014.30 (1,545,175)
 - 5-gram.2014.50 (773,277)
  
 | Option 13- Sort n-grams by dwt, tdw
 - NGramFilter
 - 4.nGram
 
  |  
 |  |   
  |