Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

MEDLINE N-Gram Set

The MEDLINE N-gram Set 2025: by Split, Group, Filter, Combine and Sort Algorithm

The MEDLINE n-gram set (generated by split, group, filter, combine and sort - SGFCS algorithm) is listed as bellows. For each MEDLINE record, title and abstract are used as the source of n-grams. They are combined, tokenized into sentences, and then tokenized into tokens (words use space as word boundary). Finally, n-grams are generated by filtering out terms with more than 50 characters or the total word count is less than 30. The specifications of generating these n-grams are listed as follows:

  • MEDLINE: 2025 - TI and AB (from MEDLINE Baseline Repository - MBR, pubmed25nXXXX.xml -> PmidTiAbS25nXXXX.txt: 1 ~ 1274)
  • Method: Split, Group, Filter, Combine and Sort Algorithm
  • Max. Character Size: 50
  • Min. word count: 30
  • Min. document count: 1

  • Total document count: 38,201,553
  • Total sentence count: 270,098,242
  • Total token count: 5,676,864,905

  • N-gram files
    • File format - 3 fields:
      Document countWord countN-gram
    • Sorted by document count, word count, then alphabetic order of n-grams. N-gram set is not sorted. It can be sorted by nGramUtil package.

  • Download:
    N-gramsFileZip SizeActual SizeNo. of n-grams
    Unigrams1-gram.2025.tgz9.6 MB24 MB1,441,038
    Bigrams2-gram.2025.tgz64 MB189 MB8,825,402
    Trigrams3-gram.2025.tgz104 MB344 MB13,303,488
    Four-grams4-gram.2025.tgz77 MB270 MB8,817,816
    Five-grams5-gram.2025.tgz40 MB142 MB3,982,724
    N-gram SetnGramSet.2025.30.tgz293 MB967 MB36,370,468
    Distilled N-gram SetdistilledNGram.2025.tgz119 MB392 MB14,722,972