Synonym Norm Development
I. Requirements
Use normalization to aggressively map a term to its synonyms by abstracting away from
- g: Genitive
- rs: parenthetical plural forms (s), (es), (ies)
- o: Punctuation
- l: cases
- Ct: spelling variants and inflectional variants
- remove duplicated spaces
- trim
- duplicated results
II. Developments
- Approach 1 (Ct on input term):
- use lvg -f:g:rs:o:l:Ct
- Ct is to get the citation form on the input term
- fast performance
- lower coverage rate (98% of method bellows)
- Example 1:
ID | Term | norm term | synonym substitutions | CUI
|
---|
KP102818 | CLOTTING FACTOR DEFICIENCY, CONGENITAL | | not found
|
- Approach 2 (Ct on every words of input term):
- Use lvg -f:g:rs:o:l:Ct
- Customize Ct to get the citation form on every words of the input term
- More mutation and results slower performance and high coverage rate
- Example 1:
ID | Term | norm term | synonym substitutions | CUI
|
---|
KP102818 | CLOTTING FACTOR DEFICIENCY, CONGENITAL | clot factor deficiency congenital | - coagulation factor deficiency hereditary
- ...
| C0272316
|
- However, still misses some mapping when the citation form has punctuation, such as "carcino-embryonic" is the citation of "carcinoembryonic"
- Example 2:
ID | Term | norm term | synonym substitutions | CUI
|
---|
KP194142 | Elevated carcinoembryonic antigen | elevate carcino-embryonic antigen |
- increase carcino-embryonic antigen
- increased carcino-embryonic antigen
- high carcino-embryonic antigen
- ...
| C0549371
|
- Approach 3 (Move Ct before removing punctuation):
- Use lvg -f:g:rs:Ct:l:o
- Example 2:
ID | Term | norm term | synonym substitutions | CUI
|
---|
KP194142 | Elevated carcinoembryonic antigen | elevate carcino embryonic antigen |
- increase carcino embryonic antigen
- increased carcino embryonic antigen
- high carcino embryonic antigen
- ...
| C0549371
|
| | C0742014
|
- Add remove genitive after Ct:
- E0000135|Addison's disease|Addisons disease
- There are no records with CT has (s), (es), (ies), so no need for -f:rs
- Use Database for CUI mapping to improve performance
III. Comparisons
| Approach 1 (Ct on term) | Approach 2 (CuiMap) | Approach 3 (Smt)
|
---|
Performance | Fast | Slow | Fast
|
Coverage-KP (26890 terms) |
- CUI with Norm: 12165 - 45.24%
- CUI with 1 synonyms: 1673 - 6.22%
- CUI with 2 synonyms: 168 - 0.62%
- No CUI found: 12884 - 47.91%
- Total term-CUIs found: 31643
|
- CUI with Norm: 12165 - 45.24%
- CUI with 1 synonyms: 1692 - 6.29%
- CUI with 2 synonyms: 174 - 0.65%
- No CUI found: 12859 - 47.82%
- Total term-CUIs found: 31660
|
- CUI with Norm: 12165 - 45.24%
- CUI with 1 synonyms: 1692 - 6.29%
- CUI with 2 synonyms: 174 - 0.65%
- No CUI found: 12859 - 47.82%
- Total term-CUIs found: 31661
|
Coverage-VA (21221 terms) |
- CUI with Norm: 16937 - 79.81%
- CUI with 1 synonyms: 221 - 1.04%
- CUI with 2 synonyms: 12 - 0.06%
- No CUI found: 4051 - 19.09%
- Total term-CUIs found: 27478
|
- CUI with Norm: 16937 - 79.81%
- CUI with 1 synonyms: 228 - 1.07%
- CUI with 2 synonyms: 15 - 0.07%
- No CUI found: 4041 - 19.04%
- Total term-CUIs found: 27498
|
- CUI with Norm: 16937 - 79.81%
- CUI with 1 synonyms: 228 - 1.07%
- CUI with 2 synonyms: 15 - 0.07%
- No CUI found: 4041 - 19.04%
- Total term-CUIs found: 27498
|
IV. Notes
In practice, we only normalize key of the synonym pair. This might cause non-symmetric issues. For example:
synonym pair: impaired|abnormality
are stored as follows in the database table:
normalized key | synonym
|
---|
impair | abnormality
|
abnormality | impaired
impair|abnormality
|
The mapping results in non-symmetric lookup:
- abnormality -> abnormality -> impaired
- impair -> impair -> abnormality (not symmetric)
- impaired -> impair -> abnormality