Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

The SPECIALIST Lexicon

The MEDLINE N-gram Set Specifications

This page describes the specifications of the MEDLINE n-gram set by LSG:

  • How many grams?
    First, we need to decide the range of grams (N). We assessed all terms (valid words) in the Lexicon under the assumption of Lexicon is a representative subset of general English. The result shows up to 5-grams cover 99.47%, as shown in the following table. Thus, we decided to generate 1~5 grams.

    There are 875,090 words:

    • Single word: 457,335 (52.2615%)
    • Total word: 417,755 (47.7385%)

    NWord CountCumulative Word Count
    1457,335 (52.2615%)457,335(52.2615%)
    2281,857 (32.2089%)739,192(84.4704%)
    393,011 (10.6287%)832,203(95.0991%)
    429,905 (3.4174%)862,108(98.5165%)
    583,58 (0.9551%)870,466(99.4716%)
    62,846 (0.3252%)873,312(99.7968%)
    71,211 (0.1384%)874,523(99.9352%)
    8390 (0.0446%)874,913(99.9798%)
    9104 (0.0119%)875,017(99.9917%)
    1029 (0.0033%)875,046(99.9950%)

  • Contents
    Titles and abstracts from MEDLINE.2014

  • Tokenizer
    • Tokenize titles and abstracts into sentences
      • All titles are considered as a separated sentence (by adding a period and space afterward)
      • 126,612,705 sentences are tokenized
      • 14,314 unrecognized pattern warning are found from sentence tokenizer
    • Tokenize all sentences into words (use space and tab as word boundary)
      • use space and tab as word boundary
      • 2,610,209,406 words are tokenized

  • Other Information
    Word count and document count are calculated with n-gram

  • Filters
    • length of terms (<= 50)
      Use Lexicon.2014 as example,
      • shortest word: 1
        • a
        • ...
      • longest word: 103
        • matrix-assisted laser desorption/ionization Fourier-transform ion cyclotron resonance mass spectrometry
        • ...
      • Words with length of 50 characters cover 99.5508% (871,159/875,090)
    • word count (>= 30)
    • document count (>= 1)