This page describes the details of generating n-grams (n = 1-5) from MEDLINE.2015 using split and combine algorithm.
Program | Preprocess | unigrams | bigrams | trigrams | fourgrams | fivegrams
|
---|
N | | n=1 | n=2 | n=3 | n=4 | n=5
|
Appro. Time (Hr.) | 1 | 0.4 | 1.4 | 4.1 | 18.7 | 25.5
|
Option 1- GenPmidTiAbSentenceFiles
- PmidTiAbSentences{YY}n{DDDD}.txt
| - < 1 hr.
- PmidTiAbS15: 1-779
| | | | |
|
Option 10- Gen split n-gram
- GetNGramFromSentenceFiles
- MAX_CL = 50
- 1.Split:
|
| - param: 10,1
- 20 min.
- split: 1, no split
- Documents: 23,343,329
- Sentences: 134,834,507
- Tokens: 2,786,085,158
- n-grams (unique tokens): 22,779,973
- nGram.out.1.heap.50.s01.0001-0779
| - param: 10,2
- 55 min.
- split: 2
- 2-gram (not unique): 252,869,058
- ${NGram}.s01.0001-0591
- ${NGram}.s02.0592-0779
| - param: 10,3
- 1 hr. 50 min.
- split: 7
- 3-gram (not unique): 1,018,482,231
- s01.0001-0298
- s02.0299-0403
- s03.0404-0492
- s04.0493-0573
- s05.0574-0651
- s06.0652-0725
- s07.0726-0779
| - param: 10,4
- 5 hr 43 min.
- split: 12
- 4-gram (not unique): 1,717,419,118
- s01.0001-0238
- s02.0239-0315
- s03.0316-0376
- s04.0377-0431
- s05.0432-0482
- s06.0483-0532
- s07.0533-0580
- s08.0581-0627
- s09.0628-0673
- s10.0674-0718
- s11.0719-0762
- s12.0763-0779
| - param: 10,5
- 11 hr.
- split: 14
- 5-gram (not unique): 1,991,428,282
- s01.0001-0226
- s02.0227-0295
- s03.0296-0350
- s04.0351-0399
- s05.0400-0444
- s06.0445-0488
- s07.0489-0531
- s08.0532-0573
- s09.0574-0614
- s10.0615-0655
- s11.0656-0695
- s12.0696-0734
- s13.0735-0772
- s14.0773-0779
|
Option 11- Group split n-grams by alphabetic characters
- GroupSpliteNGrams
- 2.Group:
|
| - param:
- 1 min.
- ${NGram}.g01.NO-NO
- 1-gram: 22,779,973
- size: 367 Mb
| - param:
- 16 min.
- ${NGram}.g01.NO-a
- ${NGram}.g02.a-NO
- 2-gram: 217,447,811
- size: 4.4 Gb
| - param:
- 11,3,01,NO,Z
- 11,3,02,Z,e
- 11,3,03,e,k
- 11,3,04,k,s
- 11,3,05,s,NO
- 2 hr. 5 min.
- g01.NO-Z
- g02.Z-e
- g03.e-k
- g04.k-s
- g05.s-NO
- 3-gram: 744,721,406
- size: 19 Gb
| - param:
- g01.NO-F
- g02.F-a
- g03.a-c
- g04.c-e
- g05.e-f
- g06.f-k
- g07.k-p
- g08.p-s
- g09.s-u
- g10.u-NO
- 12 hr. 40 min.
- 4-gram:1,375,850,664
- Size: 42 Gb
| - param:
- g01.NO-A
- g02.A-M
- g03.M-a
- g04.a-b
- g05.b-d
- g06.d-f
- g07.f-i
- g08.i-l
- g09.l-n
- g10.n-p
- g11.p-r
- g12.r-s
- g13.s-t
- g14.t-u
- g15.u-NO
- 14hr. 5 min.
- 5-gram: 1,772,937,004
- Size: 64 Gb
|
Option 12- Filter by WC and combine alphabetic n-grams
- FilterWcCombineNGrams
- WC = 30
- 3.FilterCombine:
|
| - param: 12, 1, 30
- 1 min
- 1-gram.2015.30: 843,206
| - param: 12, 2, 30
- 3 min.
- 2-gram.2015.30: 4,845,965
| - param: 12, 3, 30
- 10 min.
- 3-gram.2015.30: 6,702,194
| - param: 12, 4, 30
- 18 min.
- 4-gram.2015.30: 4,082,612
| - param: 12, 5, 30
- 25 min.
- 5-gram.2015.30: 1,674,715
|
Option 13- Sort n-grams by dwt, tdw
- NGramFilter
- 3.FilterCombine:
|
| - param: 13, 1, 30
- 1 min.
- 1-gram.2015.30.dwt: 843,206
| - param: 13, 2, 30
- 1 min.
- 2-gram.2015.30.dwt: 4,845,965
| - param: 13, 3, 30
- 1 min.
- 3-gram.2015: 6,702,194
| - param: 13, 4, 30
- 1 min.
- 4-gram.2015: 4,082,612
| - param: 13, 5, 30
- 1 min.
- 5-gram.2015: 1,674,715
|