MEDLINE N-Gram Set

The MEDLINE N-gram Set 2015: by Split, Group, Filter, and Combine Algorithm

The MEDLINE n-gram set (generated by split, group, filter, and combine algorithm) is listed as bellows. For each MEDLINE record, title and abstract are used as the source of n-grams. They are combined, tokenized into sentences, and then tokenized into tokens (words use space as word boundary). Finally, n-grams are generated by filtering out terms with more than 50 characters or the total word count is less than 30. The specifications of generating these n-grams are listed as follows: