The SPECIALIST Lexicon

Element Words from MEDLINE

I. What are Element Words?

Element words are lowercase single words without punctuation and are not stop words (such as: the, of, and, etc.). They are used as seeds to find single word and multiwords (contains the associated element words).

II. Algorithm

  • Retrieve element words through tokenization to strip punctuation, lowercase, and use space/tab as word boundaries from MEDLINE titles and abstracts (or other interested corpus).
  • Categorize these element words into following types:
    Type/SymbolDescriptionsExample
    EWT_LEXICONSingle words in Lexicondiabetes, disease
    EWT_NUMBERNumbers in Lexiconfive, fifty
    EWT_DIGITpure digit5, 15, 015
    EWT_MULTIWORDNot a single word, but part of multiword in Lexiconmellitus, vitro, aeruginosa
    EWT_NONWORDSingle words in Lexicon3h
    EWT_NEWNew element words, not in above types cdh, mfi
  • Calculate word count (WC)

III. Results of Element Words from MEDLINE.2014

- Total word num: 3264205
-- Word num - Lexicon: 291271 - 8.9232%
-- Word num - Number: 61 - 0.0019%
-- Word num - Digit: 75406 - 2.3101%
-- Word num - Multiword: 42045 - 1.2881%
-- Word num - Nonword: 0 - 0.0000%
-- Word num - TBD: 2855422 - 87.4768%

-- Total word count: 2725710505 (2725710505)
-- Word count - Lexicon: 2542758048 - 93.2879%
-- Word count - Number: 7797019 - 0.2861%
-- Word count - Digit: 126635190 - 4.6460%
-- Word count - Multiword: 18549715 - 0.6805%
-- Word count - Nonword: 0 - 0.0000%
-- Word count - TBD: 29970533 - 1.0995%

The format of the result file wordCount.rpt is:

RankElement wordElement Word Type (EWT)Word Count (WC)Cumm. WCCumm. %

IV. Usage of Element Words

  • New element words (EWT_NEW) with high frequency are sent to linguists for review to cover single words and multiwords in MEDLINE.
  • A list of candidate multiwords for element words (EWT_NEW, EWT_LEXICON and EWT_MULTIWORD) can be generated and send to linguists for review to cover more multiwords or updates existing lexical records.

V. Updates Lexical Records in Lexicon

The SPECIALIST Lexicon has been distributed annually by the National Library of Medicine (NLM) since 1994. New words/multiwords are created on the daily base. LexBuild randomly picks up 10 Lexical records daily for each lexBuilders (linguists) to review and updates. There are words added to Lexicon years ago might not be updated and thus new multiwords (associated with existing element words) might not covered in Lexicon. This approach allows us to review and update lexical records associated with (high frequency) element words systematically.