The MEDLINE N-gram Set Specifications
This page describes the specifications of the MEDLINE n-gram set by LSG:
- How many grams?
First, we need to decide the range of grams (N). We assessed all terms (valid words) in the Lexicon under the assumption of Lexicon is a representative subset of general English. The result shows up to 5-grams cover 99.47%, as shown in the following table. Thus, we decided to generate 1~5 grams.
There are 875,090 words:
- Single word: 457,335 (52.2615%)
- Total word: 417,755 (47.7385%)
N | Word Count | Cumulative Word Count
|
---|
1 | 457,335 (52.2615%) | 457,335(52.2615%)
|
2 | 281,857 (32.2089%) | 739,192(84.4704%)
|
3 | 93,011 (10.6287%) | 832,203(95.0991%)
|
4 | 29,905 (3.4174%) | 862,108(98.5165%)
|
5 | 83,58 (0.9551%) | 870,466(99.4716%)
|
6 | 2,846 (0.3252%) | 873,312(99.7968%)
|
7 | 1,211 (0.1384%) | 874,523(99.9352%)
|
8 | 390 (0.0446%) | 874,913(99.9798%)
|
9 | 104 (0.0119%) | 875,017(99.9917%)
|
10 | 29 (0.0033%) | 875,046(99.9950%)
|
- Contents
Titles and abstracts from MEDLINE.2014
- Tokenizer
- Tokenize titles and abstracts into sentences
- All titles are considered as a separated sentence (by adding a period and space afterward)
- 126,612,705 sentences are tokenized
- 14,314 unrecognized pattern warning are found from sentence tokenizer
- Tokenize all sentences into words (use space and tab as word boundary)
- use space and tab as word boundary
- 2,610,209,406 words are tokenized
- Other Information
Word count and document count are calculated with n-gram
- Filters
- length of terms (<= 50)
Use Lexicon.2014 as example,
- shortest word: 1
- longest word: 103
- matrix-assisted laser desorption/ionization Fourier-transform ion cyclotron resonance mass spectrometry
- ...
- Words with length of 50 characters cover 99.5508% (871,159/875,090)
- word count (>= 30)
- document count (>= 1)