MEDLINE N-Gram Set

The MEDLINE N-gram Set 2014: by Split, Group, Filter, and Combine Algorithm

The MEDLINE n-gram set - 2014 (generated by split, group, filter, and combine algorithm) is listed as bellows. For each MEDLINE record, title and abstract are used as the source of n-grams. They are combined, tokenized into sentences, and then tokenized into tokens (words use space as word boundary). Finally, n-grams are generated by filtering out terms have than 50 characters or the total word count is less than 30. The specifications of generating these n-grams are listed as follows:

MEDLINE: 2014 - TI and AB (from PmidTiAbS14nXXXX.txt: 1 ~ 746)
Method: Split, Combine, Filter Algorithm
Max. Character Size: 50
Min. word count: 30
Min. document count: 1
Total document count: 22,356,869
Total sentence count: 126,612,705
Total token count: 2,610,209,406

N-gram files

File format - 3 fields:

Document count Word count N-gram
Sorted by document count, word count, then alphabetic order of n-grams.

Download:

N-grams	File	Zip Size	Actual Size	No. of n-grams
Unigrams	1-gram.2014.tgz	5.4MB	14MB	804,382
Bigrams	2-gram.2014.tgz	33MB	98MB	4,587,349
Trigrams	3-gram.2014.tgz	49MB	160MB	6,287,536
Four-grams	4-gram.2014.tgz	33MB	114MB	3,799,377
Five-grams	5-gram.2014.tgz	15MB	54MB	1,545,175

N-gram Set	nGramSet.2014.30.tgz	170MB	437MB	17,023,819
Distilled N-gram Set	distilledNGram.2014.tgz	51MB	164MB	6,351,392