The MEDLINE N-gram Set 2014: by Split, Group, Filter, and Combine Algorithm
The MEDLINE n-gram set - 2014 (generated by split, group, filter, and combine algorithm) is listed as bellows. For each MEDLINE record, title and abstract are used as the source of n-grams. They are combined, tokenized into sentences, and then tokenized into tokens (words use space as word boundary). Finally, n-grams are generated by filtering out terms have than 50 characters or the total word count is less than 30. The specifications of generating these n-grams are listed as follows:
Document count | Word count | N-gram |
Download:
N-grams | File | Zip Size | Actual Size | No. of n-grams |
---|---|---|---|---|
Unigrams | 1-gram.2014.tgz | 5.4MB | 14MB | 804,382 |
Bigrams | 2-gram.2014.tgz | 33MB | 98MB | 4,587,349 |
Trigrams | 3-gram.2014.tgz | 49MB | 160MB | 6,287,536 |
Four-grams | 4-gram.2014.tgz | 33MB | 114MB | 3,799,377 |
Five-grams | 5-gram.2014.tgz | 15MB | 54MB | 1,545,175 |
N-gram Set | nGramSet.2014.30.tgz | 170MB | 437MB | 17,023,819 |
Distilled N-gram Set | distilledNGram.2014.tgz | 51MB | 164MB | 6,351,392 |