Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.
Element Words from MEDLINE
I. What are Element Words?
Element words are lowercase single words without punctuation and are not stop words (such as: the, of, and, etc.). They are used as seeds to find single word and multiwords (contains the associated element words).
II. Algorithm
Type/Symbol | Descriptions | Example |
---|---|---|
EWT_LEXICON | Single words in Lexicon | diabetes, disease |
EWT_NUMBER | Numbers in Lexicon | five, fifty |
EWT_DIGIT | pure digit | 5, 15, 015 |
EWT_MULTIWORD | Not a single word, but part of multiword in Lexicon | mellitus, vitro, aeruginosa |
EWT_NONWORD | Single words in Lexicon | 3h |
EWT_NEW | New element words, not in above types | cdh, mfi |
III. Results of Element Words from MEDLINE.2014
- Total word num: 3264205 -- Word num - Lexicon: 291271 - 8.9232% -- Word num - Number: 61 - 0.0019% -- Word num - Digit: 75406 - 2.3101% -- Word num - Multiword: 42045 - 1.2881% -- Word num - Nonword: 0 - 0.0000% -- Word num - TBD: 2855422 - 87.4768% -- Total word count: 2725710505 (2725710505) -- Word count - Lexicon: 2542758048 - 93.2879% -- Word count - Number: 7797019 - 0.2861% -- Word count - Digit: 126635190 - 4.6460% -- Word count - Multiword: 18549715 - 0.6805% -- Word count - Nonword: 0 - 0.0000% -- Word count - TBD: 29970533 - 1.0995%
The format of the result file wordCount.rpt is:
Rank | Element word | Element Word Type (EWT) | Word Count (WC) | Cumm. WC | Cumm. % |
IV. Usage of Element Words
V. Updates Lexical Records in Lexicon
The SPECIALIST Lexicon has been distributed annually by the National Library of Medicine (NLM) since 1994. New words/multiwords are created on the daily base. LexBuild randomly picks up 10 Lexical records daily for each lexBuilders (linguists) to review and updates. There are words added to Lexicon years ago might not be updated and thus new multiwords (associated with existing element words) might not covered in Lexicon. This approach allows us to review and update lexical records associated with (high frequency) element words systematically.