The SPECIALIST Lexicon

Frequency Analysis

I. Introduction

This page describes the frequency and valid word (single words and multiwords). Frequency (word count) in the MEDLINE n-gram set are used for frequency analysis. Data of 2015 are used in the example below.

II. Word Count Class vs. Tern Number

  • Approach:
    • Split Lexicon to single words (464,781) and multiwords (431,432)
    • LMW candidates from Acronym Expansion Pattern matcher to MEDLINE n-gram set
    • Word court class: WC 100 incremental

  • Results (word-court-class vs. term number):

    • Most valid words are located in the low WC range
    • Same result as "Alice in Wonderland"

III. Word Count Class vs. Precision, Recall, F1

  • Approach:
    • Only use on LMW candidates from Acronym Expansion Pattern matcher to MEDLINE n-gram set
    • Word court class: WC 100 incremental
    • Local precision = (valid tags/total tags)

      only focus on local word count class

    • Local recall = (valid tags/total valid tags)

      Normalized to 0 ~ 1, use the max. recall as 1.

  • Results:

    • Low Frequency has higher recall and F1 score, with precision above 0.8.
    • LMW aquisition is set on the low WC range (100 - 10000)
    • LSW (Single word) is set on high WC range