Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

The SPECIALIST Lexicon

The MEDLINE.2025 N-gram Set

This page describes the details of generating n-grams (n = 1-5) from MEDLINE. Please make sure all n-grams are generated correctly in the step-11 (group) with correct setup. 2025 MEDLINE n-gram set release using split, combine, and filter algorithm.
The data of these tables are from:

I. Log and data files

DescrptionLocationNotes
Input options for runGen${N}GramAll${MULTIWORDS}/bin/02.NGramGenAll/inData/${YEAR}/${N}-gram
  • Parameters for options 10-13
runGen${N}GramAll log${MULTIWORDS}/bin/02.NGramGenAll/log.${N}
  • run time
log file for options 10~13${MULTIWORDS}/bin/02.NGramGenAll/logData/${YEAR}/${N}-gram
  • Detail log for option 10-13
  • Not used in the table below
1.Split log${MULTIWORDS}/bin/Log.${YEAR}/02.NGramGen/log.heap.${N}.50
  • Total documents, Sentences, Tokens counts
  • Unique unigram count
  • not unique split gram count
N-gram out files${MULTIWORDS}/data/${YEAR}/outData/02.NGram/${N}-gram/*
  • Result files from Split, Group, and FilterCombine

II. Detail logs

ProgramNAppro. Time (Hr.) Option 1.1
  • GenPmidTiAbSentenceFromXmls
  • pubmed{YY}n{DDDD}.xml
  • PmidTiAbSentences{YY}n{DDDD}.txt
Option 10
  • Gen split n-gram
  • GetNGramFromSentenceFiles
  • MAX_CL = 50

  • 1.Split:
Option 11
  • Group split n-grams by alphabetic characters
  • GroupSpliteNGrams

  • 2.Group:
Option 12
  • Filter by WC and combine alphabetic n-grams
  • FilterWcCombineNGrams
  • WC = 30

  • 3.FilterCombine:
Option 13
  • Sort n-grams by dwt, tdw
  • NGramFilter

  • 3.FilterCombine:
Preprocess 4.0
  • ~5.0 hr (~300 files/hr).
  • PmidTiAbS24n: 0001-1274
    
unigramsn=1< 1.0 hr.
(from ./bin/02.NGramGenAll/logData/${YEAR}/N-gram/*.log)
 
  • param: 10,1, (150000000)
  • 50 min.


    from ./bin/Log.${YER}/02-NGramGen/log.heap.1.50:

  • Documents: 38,201,553
  • Sentences: 270,098,242
  • Tokens: 5,676,864,905

  • split: 1, no split
  • 1-grams (not unique, from log.heap.1.50): 46,147,938
    (it is unique beacuse no split, use wc -l)

  • Files:
    • nGram.out.1.heap.50.s01.0001-1274 (760 MB, use ls -alh)
  • param:
    • 11,1,01,NO,NO
  • 2 min.

  • Group Alphabetically
  • 1-gram (unique): 46,147,938

  • Files:
    • ${NGram}.g01.NO-NO (760 MB|46MB, from ./logData/${YEAR}/1-gram/11-1.log)
  • param: 12, 1, 30
  • 1 min

  • 1-gram (WC >= 30): 1,441,038

  • File:
    • 1-gram.${YEAR}.30 (24 MB)
  • param: 13, 1, 30
  • 1 min.

  • 1-gram (sorted): 1,441,038

  • File:
    • 1-gram.${YEAR}.30.dwt (24 MB)
bigramsn=27.1 hr. 
  • param: 10,2, (150000000)
  • 3.0 hr.

  • split: 4
  • 2-gram (not unique from log.heap.2.50): 548,125,974

  • Files:
    • s01.0001-0583 (3.1 GB)
    • s02.0584-0890 (3.0 GB)
    • s03.0891-1139 (3.0 GB)
    • s02.1140-1274 (2.0 GB)
  • param: see file names below
    • 11,2,01,NO,M
    • 11,2,02,M,k
    • 11,2,03,k,NO
  • 1.0 hr.

  • Group Alphabetically
  • 2-gram (unique, use wc -l): 407,154,719

  • Files:
    • ${NGram}.g01.NO-M (2.3GB|120MB)
    • ${NGram}.g02.M-k (3.3GB|157MB)
    • ${NGram}.g03.k-NO (2.8GB|128MB)
  • param: 12, 2, 30
  • 3.0 hr.

  • 2-gram (WC >= 30): 8,825,402

  • File:
    • 2-gram.${YEAR}.30 (189 MB)
  • param: 13, 2, 30
  • 2 min.

  • 2-gram (sorted): 8,825,402

  • File:
    • 2-gram.${YEAR}.30.dwt (189 MB)
trigramsn=314.0 hr. 
  • param: 10,3, (150000000)
  • 4.5 hr.

  • split: 14
  • 3-gram (not unique - from log.heap.3.50): 2,190,857,442

  • Files:
    • s01.0001-0151 (3.5 GB)
    • s02.0152-0309 (3.5 GB)
    • s03.0310-0400 (3.5 GB)
    • s04.0401-0532 (3.5 GB)
    • s05.0533-0614 (3.5 GB)
    • s06.0615-0692 (3.5 GB)
    • s07.0693-0760 (3.5 GB)
    • s08.0761-0826 (3.5 GB)
    • s09.0827-0890 (3.5 GB)
    • s10.0891-0954 (3.6 GB)
    • s11.0955-1014 (3.5 GB)
    • s12.1015-1073 (3.5 GB)
    • s13.1074-1131 (3.6 GB)
    • s14.1132-1189 (3.6 GB)
    • s15.1190-1246 (3.6 GB)
    • s16.1247-1274 (2.0 GB)
  • param: see file names below
  • 9.0 hr.

  • Group Alphabetically
  • 3-gram (unique wc -l): 1,407,583,824

  • Files:
    • g01.NO-E (4.7GB|199MB)
    • g02.E-Z (4.1GB|165MB)
    • g03.Z-c (4.3GB|168MB)
    • g04.c-f (4.4GB|162MB)
    • g05.f-j (4.1GB|156MB)
    • g06.j-o (2.8GB|103MB)
    • g07.o-r (3.8GB|147MB)
    • g08.r-th (3.6GB|133MB)
    • g09.th-NO (4.3GB|170MB)
  • param: 12, 3, 30
  • 0.5 hr.

  • 3-gram (WC >= 30): 13,303,488

  • File:
    • 3-gram.${YEAR}.30 (344 MB)
  • param: 13, 3, 30
  • 3 min.

  • 3-gram (sorted): 13,303,488

  • File:
    • 3-gram.${YEAR}.30.dwt (344 MB)
fourgramsn=431.5 hr. 
  • param: 10,4, (130000000)
  • 5.5 hr.

  • split: 25
  • 4-gram (not unique - from log.heap.4.50): 3,587,660,090

  • Files:
    • s01.0001-0077 (4.0 GB)
    • s02.0078-0204 (3.9 GB)
    • s03.0205-0272 (3.9 GB)
    • s04.0273-0319 (3.9 GB)
    • s05.0320-0372 (4.0 GB)
    • s06.0373-0419 (4.0 GB)
    • s07.0420-0517 (4.0 GB)
    • s08.0518-0559 (3.9 GB)
    • s09.0560-0605 (3.9 GB)
    • s10.0606-0649 (4.0 GB)
    • s11.0650-0695 (3.9 GB)
    • s12.0696-0734 (4.0 GB)
    • s13.0735-0772 (4.0 GB)
    • s14.0773-0810 (4.0 GB)
    • s15.0811-0847 (4.0 GB)
    • s16.0848-0884 (4.0 GB)
    • s17.0885-0920 (4.0 GB)
    • s18.0921-0958 (4.1 GB)
    • s19.0959-0994 (4.1 GB)
    • s20.0995-1029 (4.1 GB)
    • s21.1030-1064 (4.1 GB)
    • s22.1065-1098 (4.0 GB)
    • s23.1099-1131 (4.0 GB)
    • S24.1132-1165 (4.1 GB)
    • s25.1166-1199 (4.1 GB)
    • S26.1200-1233 (4.1 GB)
    • s27.1234-1266 (4.1 GB)
    • S28.1267-1274 (1.1 GB)
  • param: see file names below
  • 25 hr.

  • Group Alphabetically
  • 4-gram (unique): 2,664,791,419

  • Files:
    • g01.NO-8 (4.6GB|164MB)
    • g02.8-H (4.4GB|144MB)
    • g03.H-S (4.2GB|135MB)
    • g04.S-ad (4.4GB|143MB)
    • g05.ad-anl (4.6GB|151MB)
    • g06.anl-c (4.3GB|139MB)
    • g07.c-d (4.9GB|148MB)
    • g08.d-es (4.4GB|134MB)
    • g09.es-gm (4.6GB|145MB)
    • g10.gm-ine (4.7GB|151MB)
    • g11.ine-m (3.7GB|115MB)
    • g12.m-o (4.5GB|139MB)
    • g13.o-p (4.9GB|167MB)
    • g14.p-r (4.7GB|142MB)
    • g15.r-sh (4.1GB|127MB)
    • g16.sh-th (4.3GB|133MB)
    • g17.th-to (4.3GB|144MB)
    • g17.to-w (4.1GB|134MB)
    • g19.w-NO (3.0GB|101MB)
  • param: 12, 4, 30
  • 1.0 hr.

  • 4-gram (WC >= 30): 8,817,816

  • File:
    • 4-gram.${YEAR}.30 (270 MB)
  • param: 13, 4, 30
  • 3 min.

  • 4-gram (sorted): 8,817,816

  • File:
    • 4-gram.${YEAR}.30.dwt (270 MB)
fivegramsn=544.7 hr. 
  • param: 10,5, (120000000)
  • 6.0 hr.

  • split: 30
  • 5-gram (not unique): 4,108,847,218

    Files:

    • s01.0001-0064 (4.3 GB)
    • s02.0065-0112 (4.3 GB)
    • s03.0113-0233 (4.3 GB)
    • s04.0234-0279 (4.3 GB)
    • s05.0280-0316 (4.4 GB)
    • s06.0317-0360 (4.4 GB)
    • s07.0361-0398 (4.4 GB)
    • s08.0399-0482 (4.4 GB)
    • s09.0483-0524 (4.4 GB)
    • s10.0525-0558 (4.4 GB)
    • s11.0559-0597 (4.4 GB)
    • s12.0598-0633 (4.4 GB)
    • s13.0634-0671 (4.4 GB)
    • s14.0672-0705 (4.4 GB)
    • s15.0706-0736 (4.4 GB)
    • s16.0737-0767 (4.4 GB)
    • s17.0768-0798 (4.4 GB)
    • s18.0799-0828 (4.4 GB)
    • s19.0829-0858 (4.4 GB)
    • s20.0859-0889 (4.5 GB)
    • s21.0890-0918 (4.4 GB)
    • s22.0919-0950 (4.5 GB)
    • s23.0951-0979 (4.4 GB)
    • s24.0980-1007 (4.4 GB)
    • s25.1008-1035 (4.4 GB)
    • s26.1036-1063 (4.5 GB)
    • s27.1064-1091 (4.5 GB)
    • s28.1092-1118 (4.4 GB)
    • s29.1119-1145 (4.4 GB)
    • s30.1146-1172 (4.4 GB)
    • s31.1173-1199 (4.4 GB)
    • s32.1200-1226 (4.4 GB)
    • s33.1227-1253 (4.6 GB)
    • s34.1254-1274 (3.5 GB)
  • param: see file names below
  • 40.5 hr.

  • Group Alphabetically
  • 5-gram (unique): 3,504,818,804

  • Files:
    • g01.NO-2 (4.2GB|121MB)
    • g02.2-C (4.3GB|127MB)
    • g03.C-I (4.3GB|115MB)
    • g04.I-R (4.6GB|124MB)
    • g05.R-a (5.0GB|136MB)
    • g06.a-an (5.1GB|139MB)
    • g07.an-ane (5.1GB|144MB)
    • g08.ane-b (3.4GB|93)
    • g09.b-c (3.7GB|101MB)
    • g10.c-com (3.6GB|95MB)
    • g11.com-d (3.7GB|94MB)
    • g12.d-ef (4.9GB|127MB)
    • g13.ef-fol (5.1GB|132MB)
    • g14.fol-h (4.4GB|120MB)
    • g15.h-in (3.6GB|93MB)
    • g16.in-int (4.5GB|123MB)
    • g17.int-m (4.8GB|129MB)
    • g18.m-n (4.7GB|123MB)
    • g19.n-of (2.5GB|64MB)
    • g20.of-ofa (5.2GB|152MB)
    • g21.ofa-pl (5.3GB|142MB)
    • g22.pl-re (5.0GB|129MB)
    • g23.re-s (3.9GB|101MB)
    • g24.s-st (4.2GB|111MB)
    • g25.st-the (5.1GB|137MB)
    • g26.the-thea (5.0GB|142MB)
    • g27.thea-toa (3.8GB|108MB)
    • g28.toa-w (4.5GB|119MB)
    • g29.w-NO (5.3GB|150MB)
  • param: 12, 5, 30
  • 1.0 hr.

  • 5-gram (WC >= 30): 3,982,724

  • File:
    • 5-gram.${YEAR}.30 (142 MB)
  • param: 13, 5, 30
  • 2 min.

  • 5-gram (sorted): 3,982,724

  • File:
    • 5-gram.${YEAR}.30.dwt (142 MB)