The MEDLINE N-gram Set 2020: by Split, Group, Filter, and Combine Algorithm
The MEDLINE n-gram set (generated by split, group, filter, and combine algorithm) is listed as bellows. For each MEDLINE record, title and abstract are used as the source of n-grams. They are combined, tokenized into sentences, and then tokenized into tokens (words use space as word boundary). Finally, n-grams are generated by filtering out terms with more than 50 characters or the total word count is less than 30. The specifications of generating these n-grams are listed as follows:
Document count | Word count | N-gram |
N-grams | File | Zip Size | Actual Size | No. of n-grams |
---|---|---|---|---|
Unigrams | 1-gram.2020.tgz | 7.6 MB | 19 MB | 1,126,766 |
Bigrams | 2-gram.2020.tgz | 49 MB | 143 MB | 6,702,698 |
Trigrams | 3-gram.2020.tgz | 75 MB | 248 MB | 9,677,700 |
Four-grams | 4-gram.2020.tgz | 54 MB | 187 MB | 6,154,320 |
Five-grams | 5-gram.2020.tgz | 26 MB | 94 MB | 2,649,324 |
N-gram Set | nGramSet.2020.30.tgz | 210 MB | 689 MB | 26,310,808 |
Distilled N-gram Set | distilledNGram.2020.tgz | 84 MB | 271 MB | 10,354,021 |