This page describes the details of generating n-grams (n = 1-5) from MEDLINE.2015 using split and combine algorithm. 
| Program | Preprocess | unigrams | bigrams | trigrams | fourgrams | fivegrams
 | 
|---|
| N |   | n=1 | n=2 | n=3 | n=4 | n=5
 | 
| Appro. Time (Hr.) | 1 | 0.4 | 1.4 | 4.1 | 18.7 | 25.5
 | 
Option 1- GenPmidTiAbSentenceFiles
 - PmidTiAbSentences{YY}n{DDDD}.txt
  
 | - < 1 hr.
 - PmidTiAbS15: 1-779
  
 |   |   |   |   |  
 | 
Option 10- Gen split n-gram
 - GetNGramFromSentenceFiles
 - MAX_CL = 50 
 - 1.Split:
  
 |  
 | - param: 10,1
 - 20 min.
  - split: 1, no split 
 - Documents: 23,343,329
 - Sentences: 134,834,507
 - Tokens: 2,786,085,158
 - n-grams (unique tokens): 22,779,973
  - nGram.out.1.heap.50.s01.0001-0779
  
 | - param: 10,2
 - 55 min.
 - split: 2
 - 2-gram (not unique): 252,869,058
 - ${NGram}.s01.0001-0591
 - ${NGram}.s02.0592-0779
  
 | - param: 10,3
 - 1 hr. 50 min.
 - split: 7
 - 3-gram (not unique): 1,018,482,231
 - s01.0001-0298
 - s02.0299-0403
 - s03.0404-0492
 - s04.0493-0573
 - s05.0574-0651
 - s06.0652-0725
 - s07.0726-0779
  
 | - param: 10,4
 - 5 hr 43 min.
 - split: 12
 - 4-gram (not unique): 1,717,419,118
 - s01.0001-0238
 - s02.0239-0315
 - s03.0316-0376
 - s04.0377-0431
 - s05.0432-0482
 - s06.0483-0532
 - s07.0533-0580
 - s08.0581-0627
 - s09.0628-0673
 - s10.0674-0718
 - s11.0719-0762
 - s12.0763-0779
  
 | - param: 10,5
 - 11 hr.
 - split: 14
 - 5-gram (not unique): 1,991,428,282
 - s01.0001-0226
 - s02.0227-0295
 - s03.0296-0350
 - s04.0351-0399
 - s05.0400-0444
 - s06.0445-0488
 - s07.0489-0531
 - s08.0532-0573
 - s09.0574-0614
 - s10.0615-0655
 - s11.0656-0695
 - s12.0696-0734
 - s13.0735-0772
 - s14.0773-0779
  
 | 
Option 11- Group split n-grams by alphabetic characters
 - GroupSpliteNGrams
 - 2.Group:
 
  |  
 | - param:
 - 1 min.
 - ${NGram}.g01.NO-NO
 - 1-gram: 22,779,973
 - size: 367 Mb
  
 | - param:
 - 16 min.
 - ${NGram}.g01.NO-a
 - ${NGram}.g02.a-NO
 - 2-gram: 217,447,811 
 - size: 4.4 Gb
  
 | - param:
- 11,3,01,NO,Z
 - 11,3,02,Z,e
 - 11,3,03,e,k
 - 11,3,04,k,s
 - 11,3,05,s,NO
 
  - 2 hr. 5 min.
 - g01.NO-Z
 - g02.Z-e
 - g03.e-k
 - g04.k-s
 - g05.s-NO
 - 3-gram: 744,721,406
 - size: 19 Gb
   
 | - param:
- g01.NO-F
 - g02.F-a
 - g03.a-c
 - g04.c-e
 - g05.e-f
 - g06.f-k
 - g07.k-p
 - g08.p-s
 - g09.s-u
 - g10.u-NO
  
 - 12 hr. 40 min.
 - 4-gram:1,375,850,664
 - Size: 42 Gb
  
 | - param:
- g01.NO-A
 - g02.A-M
 - g03.M-a
 - g04.a-b
 - g05.b-d
 - g06.d-f
 - g07.f-i
 - g08.i-l
 - g09.l-n
 - g10.n-p
 - g11.p-r
 - g12.r-s
 - g13.s-t
 - g14.t-u
 - g15.u-NO
  
 - 14hr. 5 min.
 - 5-gram: 1,772,937,004 
 - Size: 64 Gb
  
 | 
Option 12- Filter by WC and combine alphabetic n-grams
 - FilterWcCombineNGrams
 - WC = 30
 - 3.FilterCombine:
 
  |  
 | - param: 12, 1, 30
 - 1 min 
 - 1-gram.2015.30: 843,206
  
 | - param: 12, 2, 30
 - 3 min. 
 - 2-gram.2015.30: 4,845,965
   
 | - param: 12, 3, 30
 - 10 min. 
 - 3-gram.2015.30: 6,702,194
  
 | - param: 12, 4, 30
 - 18 min. 
 - 4-gram.2015.30: 4,082,612
  
 | - param: 12, 5, 30
 - 25 min. 
 - 5-gram.2015.30: 1,674,715
  
 | 
Option 13- Sort n-grams by dwt, tdw
 - NGramFilter
 - 3.FilterCombine:
 
  |  
 | - param: 13, 1, 30
 - 1 min. 
 - 1-gram.2015.30.dwt: 843,206
  
 | - param: 13, 2, 30
 - 1 min. 
 - 2-gram.2015.30.dwt: 4,845,965
  
 | - param: 13, 3, 30
 - 1 min. 
 - 3-gram.2015: 6,702,194
  
 | - param: 13, 4, 30
 - 1 min. 
 - 4-gram.2015: 4,082,612
  
 | - param: 13, 5, 30
 - 1 min. 
 - 5-gram.2015: 1,674,715
  
 |