Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

The SPECIALIST Lexicon

The MEDLINE.2017 N-gram Set

This page describes the details of generating n-grams (n = 1-5) from MEDLINE.2017 using split, combine, and filter algorithm.

ProgramNAppro. Time (Hr.) Option 1
  • GenPmidTiAbSentenceFiles
  • PmidTiAbSentences{YY}n{DDDD}.txt
Option 10
  • Gen split n-gram
  • GetNGramFromSentenceFiles
  • MAX_CL = 50

  • 1.Split:
Option 11
  • Group split n-grams by alphabetic characters
  • GroupSpliteNGrams

  • 2.Group:
Option 12
  • Filter by WC and combine alphabetic n-grams
  • FilterWcCombineNGrams
  • WC = 30

  • 3.FilterCombine:
Option 13
  • Sort n-grams by dwt, tdw
  • NGramFilter

  • 3.FilterCombine:
Preprocess 1
  • ~ 1 hr.
  • PmidTiAbS17: 1-892
    
unigramsn=10.5 hr. 
  • param: 10,1, (150000000)
  • 20 min.

  • Documents: 26,759,399
  • Sentences: 163,021,640
  • Tokens: 3,386,661,350

  • split: 1, no split
  • 1-grams (not unique): 27,261,960
    (it is unique beacuse no split)

  • Files:
    • nGram.out.1.heap.50.s01.0001-0892 (440 Mb)
  • param:
    • 11,1,01,NO,NO
  • 1 min.

  • Group Alphabetically
  • 1-gram (unique): 27,261,960

  • Files:
    • ${NGram}.g01.NO-NO (440 Mb)
  • param: 12, 1, 30
  • 1 min

  • 1-gram (WC >= 30): 976,872

  • File:
    • 1-gram.${YEAR}.30 (16 Mb)
  • param: 13, 1, 30
  • 1 min.

  • 1-gram (sorted): 976,872

  • File:
    • 1-gram.${YEAR}.30.dwt (16 Mb)
bigramsn=21.5 hr. 
  • param: 10,2, (150000000)
  • 1.1 hr.

  • split: 2
  • 2-gram (not unique): 300,462,134

  • Files:
    • s01.0001-0580 (3.1 Gb)
    • s02.0581-0892 (3.0 Gb)
  • param: see file names below
    • 11,2,01,NO,a
    • 11,2,02,a,NO
  • 20 min.

  • Group Alphabetically
  • 2-gram (unique, use wc -l): 258,150,841

  • Files:
    • ${NGram}.g01.NO-a (1.9 Gb)
    • ${NGram}.g02.a-NO (3.4 Gb)
  • param: 12, 2, 30
  • 2 min.

  • 2-gram (WC >= 30): 5,722,210

  • File:
    • 2-gram.${YEAR}.30 (122 Mb)
  • param: 13, 2, 30
  • 1 min.

  • 2-gram (sorted): 5,722,210

  • File:
    • 2-gram.${YEAR}.30.dwt (122 Mb)
trigramsn=37 hr. 
  • param: 10,3, (150000000)
  • 2.4 hr.

  • split: 9
  • 3-gram (not unique): 1,260,815,630

  • Files:
    • s01.0001-0207 (3.8 Gb)
    • s02.0208-0322 (3.7 Gb)
    • s03.0323-0420 (3.8 Gb)
    • s04.0421-0557 (3.8 Gb)
    • s05.0558-0644 (3.8 Gb)
    • s06.0645-0727 (3.8 Gb)
    • s07.0728-0800 (3.8 Gb)
    • s08.0801-0870 (3.8 Gb)
    • s09.0871-0892 (1.4 Gb)
  • param: see file names below
  • 4.5 hr.

  • Group Alphabetically
  • 3-gram (unique): 887,664,290

  • Files:
    • g01.NO-U (4.9 Gb)
    • g02.U-d (4.3 Gb)
    • g03.d-k (4.2 Gb)
    • g04.k-re (4.3 Gb)
    • g05.re-NO (4.8 Gb)
  • param: 12, 3, 30
  • 10 min.

  • 3-gram (WC >= 30): 8,096,532

  • File:
    • 3-gram.${YEAR}.30 (207 Mb)
  • param: 13, 3, 30
  • 1 min.

  • 3-gram (sorted): 8,096,532

  • File:
    • 3-gram.${YEAR}.30.dwt (207 Mb)
fourgramsn=410.5 hr. 
  • param: 10,4, (130000000)
  • 3 hr.

  • split: 17
  • 4-gram (not unique): 2,127,650,711

  • Files:
    • s01.0001-0076 (3.9 Gb)
    • s02.0077-0199 (3.9 Gb)
    • s03.0200-0270 (3.9 Gb)
    • s04.0271-0317 (3.9 Gb)
    • s05.0318-0370 (4.0 Gb)
    • s06.0371-0417 (4.0 Gb)
    • s07.0418-0515 (4.0 Gb)
    • s08.0516-0557 (3.9 Gb)
    • s09.0558-0604 (4.0 Gb)
    • s10.0605-0648 (4.0 Gb)
    • s11.0649-0695 (4.0 Gb)
    • s12.0696-0734 (4.0 Gb)
    • s13.0735-0772 (4.0 Gb)
    • s14.0773-0810 (4.0 Gb)
    • s15.0811-0847 (4.0 Gb)
    • s16.0848-0885 (4.0 Gb)
    • s17.0886-0892 (816 Mb)
  • param: see file names below
  • 7.2 hr.

  • Group Alphabetically
  • 4-gram (unique): 1,650,912,612

  • Files:
    • g01.NO-F (4.8 Gb)
    • g02.F-ab (4.9 Gb)
    • g03.ab-b (4.6 Gb)
    • g04.b-d (4.5 Gb)
    • g05.d-fq (4.7 Gb)
    • g06.fq-is (4.7 Gb)
    • g07.is-o (4.2 Gb)
    • g08.o-pm (4.4 Gb)
    • g09.pm-si (4.4 Gb)
    • g10.si-th (2.5 Gb)
    • g11.th-u (4.3 Gb)
    • g12.u-NO (2.9 Gb)
  • param: 12, 4, 30
  • 20 min.

  • 4-gram (WC >= 30): 5,044,153

  • File:
    • 4-gram.${YEAR}.30 (152 Mb)
  • param: 13, 4, 30
  • 1 min.

  • 4-gram (sorted): 5,044,153

  • File:
    • 4-gram.${YEAR}.30.dwt (152 Mb)
fivegramsn=517 hr. 
  • param: 10,5, (120000000)
  • 3.6 hr.

  • split: 18
  • 5-gram (not unique): 2,448,680,409

    Files:

    • s01.0001-0064 (4.3 Gb)
    • s02.0065-0112 (4.4 Gb)
    • s03.0113-0232 (4.3 Gb)
    • s04.0233-0279 (4.4 Gb)
    • s05.0280-0316 (4.4 Gb)
    • s06.0317-0360 (4.4 Gb)
    • s07.0361-0398 (4.4 Gb)
    • s08.0399-0481 (4.3 Gb)
    • s09.0482-0523 (4.4 Gb)
    • s10.0524-0557 (4.4 Gb)
    • s11.0558-0596 (4.4 Gb)
    • s12.0597-0632 (4.4 Gb)
    • s13.0633-0668 (4.4 Gb)
    • s14.0669-0704 (4.4 Gb)
    • s15.0705-0735 (4.4 Gb)
    • s16.0736-0766 (4.4 Gb)
    • s17.0767-0797 (4.4 Gb)
    • s18.0798-0827 (4.4 Gb)
    • s19.0828-0857 (4.4 Gb)
    • s20.0858-0888 (4.4 Gb)
    • s21.0889-0892 (619 Mb)
  • param: see file names below
  • 13 hr.

  • Group Alphabetically
  • 5-gram (unique): 2,138,854,513

  • Files:
    • g01.NO-A (3.8 Gb)
    • g02.A-M (4.5 Gb)
    • g03.M-a (4.6 Gb)
    • g04.a-and (3.6 Gb)
    • g05.and-b (4.7 Gb)
    • g06.b-cf (3.5 Gb)
    • g07.cf-d (3.3 Gb)
    • g08.d-em (3.5 Gb)
    • g09.em-g (4.3 Gb)
    • g10.g-inc (4.6 Gb)
    • g11.inc-m (4.2 Gb)
    • g12.m-o (4.1 Gb)
    • g13.o-on (3.7 Gb)
    • g14.on-pp (3.9 Gb)
    • g15.pp-s (4.6 Gb)
    • g16.s-t (4.5 Gb)
    • g17.t-thf (4.8 Gb)
    • g18.thf-v (3.8 Gb)
    • g19.v-NO (3.9 Gb)
  • param: 12, 5, 30
  • 25 min.

  • 5-gram (WC >= 30): 1,812,223

  • File:
    • 5-gram.2016.30 (64 Mb)
  • param: 13, 5, 30
  • 1 min.

  • 5-gram (sorted): 1,812,223

  • File:
    • 5-gram.2016.30.dwt (64 Mb)