The MEDLINE N-gram Set 2024: by Split, Group, Filter, Combine and Sort Algorithm
The MEDLINE n-gram set (generated by split, group, filter, combine and sort - SGFCS algorithm) is listed as bellows. For each MEDLINE record, title and abstract are used as the source of n-grams. They are combined, tokenized into sentences, and then tokenized into tokens (words use space as word boundary). Finally, n-grams are generated by filtering out terms with more than 50 characters or the total word count is less than 30. The specifications of generating these n-grams are listed as follows:
Document count | Word count | N-gram |
N-grams | File | Zip Size | Actual Size | No. of n-grams |
---|---|---|---|---|
Unigrams | 1-gram.2024.tgz | 9.2 MB | 23 MB | 1,374,878 |
Bigrams | 2-gram.2024.tgz | 61 MB | 179 MB | 8,369,463 |
Trigrams | 3-gram.2024.tgz | 97 MB | 323 MB | 12,511,710 |
Four-grams | 4-gram.2024.tgz | 72 MB | 251 MB | 8,226,169 |
Five-grams | 5-gram.2024.tgz | 37 MB | 131 MB | 3,678,688 |
N-gram Set | nGramSet.2024.30.tgz | 275 MB | 905 MB | 34,160,908 |
Distilled N-gram Set | distilledNGram.2024.tgz | 112 MB | 366 MB | 13,775,979 |