This page describes the details of generating n-grams (n = 1-5) from MEDLINE.2014 using split and combine algorithm.
Program | Preprocess | unigrams | bigrams | trigrams | fourgrams | fivegrams
|
---|
N | | n=1 | n=2 | n=3 | n=4 | n=5
|
Option 1- GenPmidTiAbSentenceFiles
- PmidTiAbSentences{YY}n{DDDD}.txt
| - < 1 hr.
- PmidTiAbS14: 1-746
| | | | |
|
Option 10- Gen split n-gram
- GetNGramFromSentenceFiles
- MAX_CL = 50
- 1.Split
|
|
- Documents: 22,356,869
- Sentences: 126,612,705
- Tokens:2,610,209,406 (100%)
- split = 1, no split
- n-grams (unique tokens): 21,530,469
- nGram.out.1.heap.50.s1.1
- split = 4
- n-grams (not unique): 28,877,339
| - split = 4
- 2-gram (not unique): 270,862,934
- nGram.out.2.heap.50.c4.1
- ...
- nGram.out.2.heap.50.c4.4
| - split = 8
- 3-gram (not unique): 952,453,940
- nGram.out.3.heap.50.c8.1
- ...
- nGram.out.3.heap.50.c8.8
| - split = 20
- 4-gram (not unique): 1,659,414,636
- nGram.out.4.heap.50.c20.1
- ...
- nGram.out.4.heap.50.c20.20
| - split = 20
- 5-gram (not unique): 1,882,559,441
- nGram.out.5.heap.50.s20.1
- ...
- nGram.out.5.heap.50.s20.20
|
Option 11- Group split n-grams by alphabetic characters
- GroupSpliteNGrams
- 2.Group
|
| - 1-gram: 21,530,469
- size: 348 Mb
- g1.NO-A
- g2.A-Z
- g3.Z-NO
| - 2-gram: 205,868,398
- size: 4.2 Gb
- g1.NO-A
- g2.A-Z
- g3.Z-NO
| - 3-gram: 703,148,136
- size: 18 Gb
- g1.NO-A
- g2.A-Z
- g3.Z-g
- g4.g-s
- g5.s-NO
| - 4-gram:1,295,096,308
- Size: 40 Gb
- g01.No-A
- g02.A-M
- g03.M-Z
- g04.Z-c
- g05.c-e
- g06.e-k
- g07.k-p
- g08.p-s
- g09.s-v
- g10.v-NO
| - 5-gram: 1,665,248,566
- Size: 61 Gb
- g01.NO-A
- g02.A-M
- g03.M-Z
- g04.Z-b
- g05.b-d
- g06.d-e
- g07.e-i
- g08.i-n
- g09.n-p
- g10.p-q
- g11.q-s
- g12.s-t
- g13.t-u
- g14.u-NO
|
Option 12- Filter by WC and combine alphabetic n-grams
- FilterWcCombineNGrams
- WC = 30, 50
- 3.FilterCombine
|
| - 1-gram.2014.30 (804,382)
- 1-gram.2014.50 (564,244)
- 2-gram.2014.30 (4,587,349)
- 2-gram.2014.50 (2,979,558)
| - 3-gram.2014.30 (6,287,536)
- 3-gram.2014.50 (3,691,583)
| - 4-gram.2014.30 (3,799,377)
- 4-gram.2014.50 (2,039,445)
| - 5-gram.2014.30 (1,545,175)
- 5-gram.2014.50 (773,277)
| Option 13- Sort n-grams by dwt, tdw
- NGramFilter
- 4.nGram
|
| |
|