Multiwords: Normalization
I. Why Normalization?
A same term could be represented in many different forms (of genitive, punctuation, and case) in MEDLINE. For example, "diabetes mellitus" appears in the following n-gram terms from MEDLINE:
- diabetes mellitus
- diabetes mellitus,
- diabetes mellitus]
- diabetes mellitus:
- diabetes mellitus.
- [diabetes mellitus
- diabetes mellitus)
- (diabetes mellitus
- (diabetes mellitus,
- diabetes mellitus),
- (diabetes mellitus;
- diabetes mellitus?]
- (diabetes mellitus)
- diabetes mellitus -
- Diabetes mellitus
- Diabetes Mellitus
- DIABETES MELLITUS
- Diabetes mellitus,
- Diabetes mellitus.
- [Diabetes mellitus
- [Diabetes Mellitus:
- [Diabetes mellitus]
- Diabetes Mellitus:
- Diabetes Mellitus,
- DIABETES MELLITUS]
Normalization (by abstracting away from genitive, punctuation, and case) is applied to n-gram terms so that these terms can be grouped for further reviewed and analysis. Also, the word count of normalized n-gram terms reflects true frequency of usage on the n-gram term.
II. Normalization
- The normalization uses function of Lexical Tools flow components
- -f:g (remove genitive)
- -f:o (replace punctation with space)
- -f:l (lowercase)
III. Normalization Usage in N-gram to generate (multi)words
We used normalization as follows:
- Use the WC of normalized terms for the prediction filter to generate high frequency n-gram
- Candidate multiwords filtered from MEDLINE n-grams are grouped by normalized terms. Both normalized n-gram and original n-gram are sent to linguists for review.