Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

The SPECIALIST Lexicon

The MEDLINE.2016 N-gram Set

This page describes the details of generating n-grams (n = 1-5) from MEDLINE.2016 using split and combine algorithm.

ProgramNAppro. Time (Hr.) Option 1
  • GenPmidTiAbSentenceFiles
  • PmidTiAbSentences{YY}n{DDDD}.txt
Option 10
  • Gen split n-gram
  • GetNGramFromSentenceFiles
  • MAX_CL = 50

  • 1.Split:
Option 11
  • Group split n-grams by alphabetic characters
  • GroupSpliteNGrams

  • 2.Group:
Option 12
  • Filter by WC and combine alphabetic n-grams
  • FilterWcCombineNGrams
  • WC = 30

  • 3.FilterCombine:
Option 13
  • Sort n-grams by dwt, tdw
  • NGramFilter

  • 3.FilterCombine:
Preprocess 1
  • ~ 1 hr.
  • PmidTiAbS15: 1-812
    
unigramsn=10.5 hr. 
  • param: 10,1, (150000000)
  • 23 min.

  • Documents: 24,358,442
  • Sentences: 143,471,776
  • Tokens: 2,971,013,236

  • split: 1, no split
  • 1-grams (not unique): 24,121,470
    (it is unique beacuse no split)

  • Files:
    • nGram.out.1.heap.50.s01.0001-0812 (389 Mb)
  • param:
    • 11,1,01,NO,NO
  • 1 min.

  • Group Alphabetically
  • 1-gram (unique): 24,121,470

  • Files:
    • ${NGram}.g01.NO-NO (389 Mb)
  • param: 12, 1, 30
  • 1 min

  • 1-gram (WC >= 30): 883,287

  • File:
    • 1-gram.2016.30 (15 Mb)
  • param: 13, 1, 30
  • 1 min.

  • 1-gram (sorted): 883,287

  • File:
    • 1-gram.2016.30.dwt (15 Mb)
bigramsn=21.5 hr. 
  • param: 10,2, (150000000)
  • 1 hr.

  • split: 2
  • 2-gram (not unique): 267,460,094

  • Files:
    • s01.0001-0591 (3.1 Gb)
    • s02.0592-0812 (2.4 Gb)
  • param: see file names below
    • 11,2,01,NO,a
    • 11,2,02,a,NO
  • 19 min.

  • Group Alphabetically
  • 2-gram (unique): 229,691,126

  • Files:
    • ${NGram}.g01.NO-a (1.7 Gb)
    • ${NGram}.g02.a-NO (3.1 Gb)
  • param: 12, 2, 30
  • 4 min.

  • 2-gram (WC >= 30): 5,114,547

  • File:
    • 2-gram.2016.30 (109 Mb)
  • param: 13, 2, 30
  • 1 min.

  • 2-gram (sorted): 5,114,547

  • File:
    • 2-gram.2016.30.dwt (109 Mb)
trigramsn=35 hr. 
  • param: 10,3, (150000000)
  • 2 hr.

  • split: 8
  • 3-gram (not unique): 1,092,117,562

  • Files:
    • s01.0001-0298 (3.8 Gb)
    • s02.0299-0403 (3.7 Gb)
    • s03.0404-0492 (3.7 Gb)
    • s04.0493-0573 (3.7 Gb)
    • s05.0574-0651 (3.8 Gb)
    • s06.0652-0725 (3.7 Gb)
    • s07.0726-0797 (3.8 Gb)
    • s08.0798-0812 (953 Mb)
  • param: see file names below
  • 2.5 hr.

  • Group Alphabetically
  • 3-gram (unique): 788,417,523

  • Files:
    • g01.NO-U (4.2 Gb)
    • g02.U-d (3.9 Gb)
    • g03.d-k (3.7 Gb)
    • g04.k-re (3.9 Gb)
    • g05.re-NO (4.3 Gb)
  • param: 12, 3, 30
  • 10 min.

  • 3-gram (WC >= 30): 7,134,807

  • File:
    • 3-gram.2016.30 (182 Mb)
  • param: 13, 3, 30
  • 1 min.

  • 3-gram (sorted): 7,134,807

  • File:
    • 3-gram.2016.30.dwt (182 Mb)
fourgramsn=410 hr. 
  • param: 10,4, (130000000)
  • 2.5 hr.

  • split: 14
  • 4-gram (not unique): 1,855,460,574

  • Files:
    • s01.0001-0226 (4.0 Gb)
    • s02.0227-0296 (3.9 Gb)
    • s03.0297-0351 (3.9 Gb)
    • s04.0352-0400 (4.0 Gb)
    • s05.0401-0446 (4.0 Gb)
    • s06.0447-0490 (3.9 Gb)
    • s07.0491-0533 (4.0 Gb)
    • s08.0534-0574 (4.0 Gb)
    • s09.0575-0614 (3.9 Gb)
    • s10.0615-0654 (4.0 Gb)
    • s11.0655-0693 (4.0 Gb)
    • s12.0694-0731 (4.0 Gb)
    • s13.0732-0769 (4.0 Gb)
    • s14.0770-0806 (4.0 Gb)
    • s15.0807-0812 (653 Mb)
  • param: see file names below
  • 7 hr.

  • Group Alphabetically
  • 4-gram (unique): 1,460,588,176

  • Files:
    • g01.NO-F (4.2 Gb)
    • g02.F-ab (4.2 Gb)
    • g03.ab-b (4.1 Gb)
    • g04.b-d (4.0 Gb)
    • g05.d-fq (4.1 Gb)
    • g06.fq-is (4.2 Gb)
    • g07.is-o (3.7 Gb)
    • g08.o-pm (3.9 Gb)
    • g09.pm-si (3.9 Gb)
    • g10.si-th (2.2 Gb)
    • g11.th-u (3.8 Gb)
    • g12.u-NO (2.5 Gb)
  • param: 12, 4, 30
  • 20 min.

  • 4-gram (WC >= 30): 4,380,474

  • File:
    • 4-gram.2016.30 (132 Mb)
  • param: 13, 4, 30
  • 1 min.

  • 4-gram (sorted): 4,380,474

  • File:
    • 4-gram.2016.30.dwt (132 Mb)
fivegramsn=518 hr. 
  • param: 10,5, (120000000)
  • 3 hr.

  • split: 18
  • 5-gram (not unique): 2,143,203,249

    Files:

    • s01.0001-0208 (4.3 Gb)
    • s02.0209-0268 (4.3 Gb)
    • s03.0269-0317 (4.3 Gb)
    • s04.0318-0359 (4.3 Gb)
    • s05.0360-0398 (4.4 Gb)
    • s06.0399-0435 (4.4 Gb)
    • s07.0436-0470 (4.4 Gb)
    • s08.0471-0505 (4.4 Gb)
    • s09.0506-0539 (4.4 Gb)
    • s10.0540-0572 (4.4 Gb)
    • s11.0573-0605 (4.4 Gb)
    • s12.0606-0638 (4.5 Gb)
    • s13.0639-0670 (4.4 Gb)
    • s14.0671-0702 (4.5 Gb)
    • s15.0703-0733 (4.4 Gb)
    • s16.0734-0764 (4.5 Gb)
    • s17.0765-0794 (4.4 Gb)
    • s18.0795-0812 (2.6 Gb)
  • param: see file names below
  • 12 hr.

  • Group Alphabetically
  • 5-gram (unique): 1,885,969,537

  • Files:
    • g01.NO-A (3.3 Gb)
    • g02.A-M (3.9 Gb)
    • g03.M-a (3.9 Gb)
    • g04.a-and (3.2 Gb)
    • g05.and-b (4.1 Gb)
    • g06.b-cf (3.1 Gb)
    • g07.cf-d (2.9 Gb)
    • g08.d-em (3.1 Gb)
    • g09.em-g (3.8 Gb)
    • g10.g-inc (4.1 Gb)
    • g11.inc-m (3.8 Gb)
    • g12.m-o (3.6 Gb)
    • g13.o-on (3.3 Gb)
    • g14.on-pp (3.4 Gb)
    • g15.pp-s (4.1 Gb)
    • g16.s-t (4.0 Gb)
    • g17.t-thf (4.2 Gb)
    • g18.thf-v (3.3 Gb)
    • g19.v-NO (3.4 Gb)
  • param: 12, 5, 30
  • 30 min.

  • 5-gram (WC >= 30): 1,812,223

  • File:
    • 5-gram.2016.30 (64 Mb)
  • param: 13, 5, 30
  • 1 min.

  • 5-gram (sorted): 1,812,223

  • File:
    • 5-gram.2016.30.dwt (64 Mb)