The SPECIALIST Lexicon

The MEDLINE.2019 N-gram Set

This page describes the details of generating n-grams (n = 1-5) from MEDLINE. Please make sure all n-grams are generated correctly in the step-11 (group) with correct setup. 2019 using split, combine, and filter algorithm.
The data of these tables are from:

I. Log and data files

DescrptionLocationNotes
Input options for runGen${N}GramAll${MULTIWORDS}/bin/02.NGramGenAll/inData/${YEAR}/${N}-gram
  • Parameters for options 10-13
runGen${N}GramAll log${MULTIWORDS}/bin/02.NGramGenAll/log.${N}
  • run time
log file for options 10~13${MULTIWORDS}/bin/02.NGramGenAll/logData/${YEAR}/${N}-gram
  • Detail log for option 10-13
  • Not used in the table below
1.Split log${MULTIWORDS}/bin/Log.${YEAR}/02.NGramGen/log.heap.${N}.50
  • Total documents, Sentences, Tokens counts
  • Unique unigram count
  • not unique split gram count
N-gram out files${MULTIWORDS}/data/${YEAR}/outData/02.NGram/${N}-gram/*
  • Result files from Split, Group, and FilterCombine

II. Detail logs

ProgramNAppro. Time (Hr.) Option 1.1
  • GenPmidTiAbSentenceFromXmls
  • pubmed{YY}n{DDDD}.xml
  • PmidTiAbSentences{YY}n{DDDD}.txt
Option 10
  • Gen split n-gram
  • GetNGramFromSentenceFiles
  • MAX_CL = 50

  • 1.Split:
Option 11
  • Group split n-grams by alphabetic characters
  • GroupSpliteNGrams

  • 2.Group:
Option 12
  • Filter by WC and combine alphabetic n-grams
  • FilterWcCombineNGrams
  • WC = 30

  • 3.FilterCombine:
Option 13
  • Sort n-grams by dwt, tdw
  • NGramFilter

  • 3.FilterCombine:
Preprocess 2.5
  • ~ 2.5 hr.
  • PmidTiAbS19n: 0001-0972
    
unigramsn=1< 1.0 hr. 
  • param: 10,1, (150000000)
  • 30 min.

  • Documents: 29,138,919
  • Sentences: 185,619,887
  • Tokens: 3,824,268,997

  • split: 1, no split
  • 1-grams (not unique): 32,469,239
    (it is unique beacuse no split, use wc -l)

  • Files:
    • nGram.out.1.heap.50.s01.0001-0972 (529 MB, use ls -alh)
  • param:
    • 11,1,01,NO,NO
  • 2 min.

  • Group Alphabetically
  • 1-gram (unique): 32,469,239

  • Files:
    • ${NGram}.g01.NO-NO (529MB|32M)
  • param: 12, 1, 30
  • 1 min

  • 1-gram (WC >= 30): 1,075,227

  • File:
    • 1-gram.${YEAR}.30 (18 MB)
  • param: 13, 1, 30
  • 1 min.

  • 1-gram (sorted): 1,075,227

  • File:
    • 1-gram.${YEAR}.30.dwt (18 MB)
bigramsn=22.5 hr. 
  • param: 10,2, (150000000)
  • 1.6 hr.

  • split: 3
  • 2-gram (not unique): 363,115,724

  • Files:
    • s01.0001-0584 (3.1 GB)
    • s02.0585-0889 (3.0 GB)
    • s03.0890-0972 (1.3 GB)
  • param: see file names below
    • 11,2,01,NO,c
    • 11,2,02,c,NO
  • 40 min.

  • Group Alphabetically
  • 2-gram (unique, use wc -l): 290,801,313

  • Files:
    • ${NGram}.g01.NO-c (2.7GB|139M)
    • ${NGram}.g02.c-NO (3.2GB|150M)
  • param: 12, 2, 30
  • 5 min.

  • 2-gram (WC >= 30): 6,336,698

  • File:
    • 2-gram.${YEAR}.30 (135 MB)
  • param: 13, 2, 30
  • 2 min.

  • 2-gram (sorted): 6,336,698

  • File:
    • 2-gram.${YEAR}.30.dwt (135 MB)
trigramsn=39.0 hr. 
  • param: 10,3, (150000000)
  • 3.5 hr.

  • split: 10
  • 3-gram (not unique): 1,434,436,582

  • Files:
    • s01.0001-0209 (3.8 GB)
    • s02.0210-0324 (3.7 GB)
    • s03.0325-0422 (3.8 GB)
    • s04.0423-0559 (3.7 GB)
    • s05.0560-0647 (3.8 GB)
    • s06.0648-0729 (3.8 GB)
    • s07.0730-0801 (3.8 GB)
    • s08.0802-0871 (3.8 GB)
    • s09.0872-0939 (3.8 GB)
    • s10.0940-0972 (2.0 GB)
  • param: see file names below
  • 5.0 hr.

  • Group Alphabetically
  • 3-gram (unique): 993,227,476

  • Files:
    • g01.NO-M (4.1GB|171M)
    • g02.M-b (4.2GB|165M)
    • g03.b-fn (4.1GB|155M)
    • g04.fn-n (4.1GB|158M)
    • g05.n-sh (4.4GB|170M)
    • g06.sh-NO (4.3GB|171M)
  • param: 12, 3, 30
  • 24 min.

  • 3-gram (WC >= 30): 9,078,536

  • File:
    • 3-gram.${YEAR}.30 (233 MB)
  • param: 13, 3, 30
  • 2 min.

  • 3-gram (sorted): 9,078,536

  • File:
    • 3-gram.${YEAR}.30.dwt (233 MB)
fourgramsn=419.5 hr. 
  • param: 10,4, (130000000)
  • 4.5 hr.

  • split: 19
  • 4-gram (not unique): 2,406,442,157

  • Files:
    • s01.0001-0077 (4.0 GB)
    • s02.0078-0204 (3.9 GB)
    • s03.0205-0272 (3.9 GB)
    • s04.0273-0319 (3.9 GB)
    • s05.0320-0372 (4.0 GB)
    • s06.0373-0419 (4.0 GB)
    • s07.0420-0517 (4.0 GB)
    • s08.0518-0559 (3.9 GB)
    • s09.0560-0606 (4.0 GB)
    • s10.0607-0650 (4.0 GB)
    • s11.0651-0697 (4.0 GB)
    • s12.0698-0736 (4.0 GB)
    • s13.0737-0774 (4.0 GB)
    • s14.0775-0812 (4.0 GB)
    • s15.0813-0849 (4.0 GB)
    • s16.0850-0886 (4.0 GB)
    • s17.0887-0923 (4.0 GB)
    • s18.0924-0960 (4.0 GB)
    • s19.0961-0972 (1.3 GB)
  • param: see file names below
  • 13.5 hr.

  • Group Alphabetically
  • 4-gram (unique): 1,849,662,679

  • Files:
    • g01.NO-C (4.2GB|149M)
    • g02.C-S (4.4GB|142M)
    • g03.S-and (4.4GB|141M)
    • g04.and-bcm (3.4GB|113M)
    • g05.bcm-cs (4.7GB|146M)
    • g06.cs-f (4.0GB|124M)
    • g07.f-i (3.8GB|122M)
    • g08.i-j (3.2GB|102M)
    • g09.j-o (4.3GB|137M)
    • g10.o-pe (4.1GB|140M)
    • g11.pe-ri (4.5GB|136M)
    • g12.ri-th (4.0GB|125M)
    • g13.th-tr (4.0GB|136M)
    • g14.tr-NO (4.0GB|130M)
  • param: 12, 4, 30
  • 30 min.

  • 4-gram (WC >= 30): 5,729,590

  • File:
    • 4-gram.${YEAR}.30 (174 MB)
  • param: 13, 4, 30
  • 2 min.

  • 4-gram (sorted): 5,729,590

  • File:
    • 4-gram.${YEAR}.30.dwt (174 MB)
fivegramsn=527.5 hr. 
  • param: 10,5, (120000000)
  • 5.0 hr.

  • split: 23
  • 5-gram (not unique): 2,762,785,456

    Files:

    • s01.0001-0064 (4.3 GB)
    • s02.0065-0112 (4.3 GB)
    • s03.0113-0233 (4.3 GB)
    • s04.0234-0279 (4.3 GB)
    • s05.0280-0316 (4.4 GB)
    • s06.0317-0360 (4.4 GB)
    • s07.0361-0398 (4.4 GB)
    • s08.0399-0482 (4.4 GB)
    • s09.0483-0524 (4.4 GB)
    • s10.0525-0558 (4.4 GB)
    • s11.0559-0597 (4.4 GB)
    • s12.0598-0633 (4.4 GB)
    • s13.0634-0671 (4.4 GB)
    • s14.0672-0705 (4.4 GB)
    • s15.0706-0736 (4.4 GB)
    • s16.0737-0767 (4.4 GB)
    • s17.0768-0798 (4.4 GB)
    • s18.0799-0828 (4.4 GB)
    • s19.0829-0858 (4.4 GB)
    • s20.0859-0889 (4.5 GB)
    • s21.0890-0919 (4.5 GB)
    • s22.0920-0951 (4.5 GB)
    • s23.0952-0972 (3.1 GB)
  • param: see file names below
  • 20.0 hr.

  • Group Alphabetically
  • 5-gram (unique): 2,401,418,219

  • Files:
    • g01.NO-A (4.3GB|128M)
    • g02.A-I (4.1GB|110M)
    • g03.I-T (4.0GB|109M)
    • g04.T-a (2.3GB|61M)
    • g05.a-am (3.4GB|93M)
    • g06.am-ao (4.0GB|112M)
    • g07.ao-c (4.5GB|126M)
    • g08.c-d (5.0GB|131M)
    • g09.d-ep (4.4GB|114M)
    • g10.ep-g (4.3GB|116M)
    • g11.g-in (3.5GB|93M)
    • g12.in-io (3.9GB|105M)
    • g13.io-mk (4.6GB|126M)
    • g14.mk-of (2.9GB|75M)
    • g15.of-og (3.8GB|110M)
    • g16.og-pm (3.8GB|102M)
    • g17.pm-rem (4.4GB|112M)
    • g18.rem-sp (4.0GB|107M)
    • g19.sp-the (3.9GB|104M)
    • g20.the-thf (4.0GB|114M)
    • g21.thf-v (4.2GB|117M)
    • g22.v-NO (4.5GB|126M)
  • param: 12, 5, 30
  • 50 min.

  • 5-gram (WC >= 30): 2,446,765

  • File:
    • 5-gram.${YEAR}.30 (86 MB)
  • param: 13, 5, 30
  • 1 min.

  • 5-gram (sorted): 2,446,765

  • File:
    • 5-gram.${YEAR}.30.dwt (86 MB)