Frequency Analysis
I. Introduction
This page describes the frequency and valid word (single words and multiwords). Frequency (word count) in the MEDLINE n-gram set are used for frequency analysis. Data of 2015 are used in the example below.
II. Word Count Class vs. Tern Number
- Approach:
- Split Lexicon to single words (464,781) and multiwords (431,432)
- LMW candidates from Acronym Expansion Pattern matcher to MEDLINE n-gram set
- Word court class: WC 100 incremental
- Results (word-court-class vs. term number):
- Most valid words are located in the low WC range
- Same result as "Alice in Wonderland"
III. Word Count Class vs. Precision, Recall, F1
- Approach:
- Results:
- Low Frequency has higher recall and F1 score, with precision above 0.8.
- LMW aquisition is set on the low WC range (100 - 10000)
- LSW (Single word) is set on high WC range