Element Words from MEDLINE
I. What are Element Words?
Element words are lowercase single words without punctuation and are not stop words (such as: the, of, and, etc.). They are used as seeds to find single word and multiwords (contains the associated element words).
II. Algorithm
Type/Symbol | Descriptions | Example |
---|---|---|
EWT_LEXICON | Single words in Lexicon | diabetes, disease |
EWT_NUMBER | Numbers in Lexicon | five, fifty |
EWT_DIGIT | pure digit | 5, 15, 015 |
EWT_MULTIWORD | Not a single word, but part of multiword in Lexicon | mellitus, vitro, aeruginosa |
EWT_NONWORD | Single words in Lexicon | 3h |
EWT_NEW | New element words, not in above types | cdh, mfi |
III. Results of Element Words from MEDLINE.2014
- Total word num: 3264205 -- Word num - Lexicon: 291271 - 8.9232% -- Word num - Number: 61 - 0.0019% -- Word num - Digit: 75406 - 2.3101% -- Word num - Multiword: 42045 - 1.2881% -- Word num - Nonword: 0 - 0.0000% -- Word num - TBD: 2855422 - 87.4768% -- Total word count: 2725710505 (2725710505) -- Word count - Lexicon: 2542758048 - 93.2879% -- Word count - Number: 7797019 - 0.2861% -- Word count - Digit: 126635190 - 4.6460% -- Word count - Multiword: 18549715 - 0.6805% -- Word count - Nonword: 0 - 0.0000% -- Word count - TBD: 29970533 - 1.0995%
The format of the result file wordCount.rpt is:
Rank | Element word | Element Word Type (EWT) | Word Count (WC) | Cumm. WC | Cumm. % |
IV. Usage of Element Words
V. Updates Lexical Records in Lexicon
The SPECIALIST Lexicon has been distributed annually by the National Library of Medicine (NLM) since 1994. New words/multiwords are created on the daily base. LexBuild randomly picks up 10 Lexical records daily for each lexBuilders (linguists) to review and updates. There are words added to Lexicon years ago might not be updated and thus new multiwords (associated with existing element words) might not covered in Lexicon. This approach allows us to review and update lexical records associated with (high frequency) element words systematically.