Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

The SPECIALIST Lexicon

The MEDLINE.2015 N-gram Set

This page describes the details of generating n-grams (n = 1-5) from MEDLINE.2015 using split and combine algorithm.

ProgramPreprocessunigramsbigramstrigramsfourgramsfivegrams
N n=1n=2n=3n=4n=5
Appro. Time (Hr.)10.41.44.118.725.5
Option 1
  • GenPmidTiAbSentenceFiles
  • PmidTiAbSentences{YY}n{DDDD}.txt
  • < 1 hr.
  • PmidTiAbS15: 1-779
     
Option 10
  • Gen split n-gram
  • GetNGramFromSentenceFiles
  • MAX_CL = 50

  • 1.Split:
 
  • param: 10,1
  • 20 min.

  • split: 1, no split
  • Documents: 23,343,329
  • Sentences: 134,834,507
  • Tokens: 2,786,085,158
  • n-grams (unique tokens): 22,779,973

  • nGram.out.1.heap.50.s01.0001-0779
  • param: 10,2
  • 55 min.

  • split: 2
  • 2-gram (not unique): 252,869,058

  • ${NGram}.s01.0001-0591
  • ${NGram}.s02.0592-0779
  • param: 10,3
  • 1 hr. 50 min.

  • split: 7
  • 3-gram (not unique): 1,018,482,231

  • s01.0001-0298
  • s02.0299-0403
  • s03.0404-0492
  • s04.0493-0573
  • s05.0574-0651
  • s06.0652-0725
  • s07.0726-0779
  • param: 10,4
  • 5 hr 43 min.

  • split: 12
  • 4-gram (not unique): 1,717,419,118

  • s01.0001-0238
  • s02.0239-0315
  • s03.0316-0376
  • s04.0377-0431
  • s05.0432-0482
  • s06.0483-0532
  • s07.0533-0580
  • s08.0581-0627
  • s09.0628-0673
  • s10.0674-0718
  • s11.0719-0762
  • s12.0763-0779
  • param: 10,5
  • 11 hr.

  • split: 14
  • 5-gram (not unique): 1,991,428,282

  • s01.0001-0226
  • s02.0227-0295
  • s03.0296-0350
  • s04.0351-0399
  • s05.0400-0444
  • s06.0445-0488
  • s07.0489-0531
  • s08.0532-0573
  • s09.0574-0614
  • s10.0615-0655
  • s11.0656-0695
  • s12.0696-0734
  • s13.0735-0772
  • s14.0773-0779
Option 11
  • Group split n-grams by alphabetic characters
  • GroupSpliteNGrams

  • 2.Group:
 
  • param:
    • 11,1,01,NO,NO
  • 1 min.

  • ${NGram}.g01.NO-NO

  • 1-gram: 22,779,973
  • size: 367 Mb
  • param:
    • 11,2,01,NO,a
    • 11,2,02,a,NO
  • 16 min.

  • ${NGram}.g01.NO-a
  • ${NGram}.g02.a-NO

  • 2-gram: 217,447,811
  • size: 4.4 Gb
  • param:
    • 11,3,01,NO,Z
    • 11,3,02,Z,e
    • 11,3,03,e,k
    • 11,3,04,k,s
    • 11,3,05,s,NO
  • 2 hr. 5 min.

  • g01.NO-Z
  • g02.Z-e
  • g03.e-k
  • g04.k-s
  • g05.s-NO

  • 3-gram: 744,721,406
  • size: 19 Gb
  • param:
    • g01.NO-F
    • g02.F-a
    • g03.a-c
    • g04.c-e
    • g05.e-f
    • g06.f-k
    • g07.k-p
    • g08.p-s
    • g09.s-u
    • g10.u-NO
  • 12 hr. 40 min.

  • 4-gram:1,375,850,664
  • Size: 42 Gb
  • param:
    • g01.NO-A
    • g02.A-M
    • g03.M-a
    • g04.a-b
    • g05.b-d
    • g06.d-f
    • g07.f-i
    • g08.i-l
    • g09.l-n
    • g10.n-p
    • g11.p-r
    • g12.r-s
    • g13.s-t
    • g14.t-u
    • g15.u-NO
  • 14hr. 5 min.

  • 5-gram: 1,772,937,004
  • Size: 64 Gb
Option 12
  • Filter by WC and combine alphabetic n-grams
  • FilterWcCombineNGrams
  • WC = 30

  • 3.FilterCombine:
 
  • param: 12, 1, 30
  • 1 min

  • 1-gram.2015.30: 843,206
  • param: 12, 2, 30
  • 3 min.

  • 2-gram.2015.30: 4,845,965
  • param: 12, 3, 30
  • 10 min.

  • 3-gram.2015.30: 6,702,194
  • param: 12, 4, 30
  • 18 min.

  • 4-gram.2015.30: 4,082,612
  • param: 12, 5, 30
  • 25 min.

  • 5-gram.2015.30: 1,674,715
Option 13
  • Sort n-grams by dwt, tdw
  • NGramFilter

  • 3.FilterCombine:
 
  • param: 13, 1, 30
  • 1 min.

  • 1-gram.2015.30.dwt: 843,206
  • param: 13, 2, 30
  • 1 min.

  • 2-gram.2015.30.dwt: 4,845,965
  • param: 13, 3, 30
  • 1 min.

  • 3-gram.2015: 6,702,194
  • param: 13, 4, 30
  • 1 min.

  • 4-gram.2015: 4,082,612
  • param: 13, 5, 30
  • 1 min.

  • 5-gram.2015: 1,674,715