SPECIALIST Lexicon

The MEDLINE N-gram Set Specifications

This page describes the specifications of the MEDLINE n-gram set by LSG:

How many grams?
First, we need to decide the range of grams (N). We assessed all terms (valid words) in the Lexicon under the assumption of Lexicon is a representative subset of general English. The result shows up to 5-grams cover 99.47%, as shown in the following table. Thus, we decided to generate 1~5 grams.

There are 875,090 words:

Single word: 457,335 (52.2615%)
Total word: 417,755 (47.7385%)

N	Word Count	Cumulative Word Count
1	457,335 (52.2615%)	457,335(52.2615%)
2	281,857 (32.2089%)	739,192(84.4704%)
3	93,011 (10.6287%)	832,203(95.0991%)
4	29,905 (3.4174%)	862,108(98.5165%)
5	83,58 (0.9551%)	870,466(99.4716%)
6	2,846 (0.3252%)	873,312(99.7968%)
7	1,211 (0.1384%)	874,523(99.9352%)
8	390 (0.0446%)	874,913(99.9798%)
9	104 (0.0119%)	875,017(99.9917%)
10	29 (0.0033%)	875,046(99.9950%)

Contents
Titles and abstracts from MEDLINE.2014
Tokenizer
- Tokenize titles and abstracts into sentences
  - All titles are considered as a separated sentence (by adding a period and space afterward)
  - 126,612,705 sentences are tokenized
  - 14,314 unrecognized pattern warning are found from sentence tokenizer
- Tokenize all sentences into words (use space and tab as word boundary)
  - use space and tab as word boundary
  - 2,610,209,406 words are tokenized
Other Information
Word count and document count are calculated with n-gram
Filters
- length of terms (<= 50)
  Use Lexicon.2014 as example,
  - shortest word: 1
    - a
    - ...
  - longest word: 103
    - matrix-assisted laser desorption/ionization Fourier-transform ion cyclotron resonance mass spectrometry
    - ...
  - Words with length of 50 characters cover 99.5508% (871,159/875,090)
- word count (>= 30)
- document count (>= 1)

The SPECIALIST Lexicon