CC Source Model - Co-occurrence in Corpus (MEDLINE)
I. Introduction
Co-occurrence hypothesis is one of the most popular approaches for antonym identification [1989 Charles & Miller, 1995 Fllbaum, 2015 Tesfaye]. In this Co-occurrence in Corpus (CC) model, first, we enhanced co-occurrence patterns from previous researches [Justeson and Katz, 1991] to identify 10 co-occurrence patterns. These patterns are derived from a collection of 1000 antonyms from the internet domain [Lu, 2021]. The MEDLINE n-gram set [Lu 2015] is used as the corpus. These patterns are in the format of [X keyword Y], while keywords include: -and-, -or-, -to-, -versus-, -than-, -vs-, -from-, -nor-, -and/or- and -as well as-. High frequency co-occurrence terms that meet these patterns from the corpus (MEDLINE n-gram set) that are not Lexicon synonyms [Lu 2017], has CUIs, and meet STI rules are retrieved as aPair candidates, such as [above|below|prep], [accept|reject|verb], [sick|well|adj] and [birth|death|noun]. Both frequency in the MEDLINE (word count) and in the keywords (pattern count) are taken into consideration during this process.
II. Design
Two MEDLINE n-grams files are used for this model:
- 3-gram.2024.30.core: for [X keyword Y], where keywords are: -and-, -or-, -to-, -versus-, -than-, -vs-, -from-, -nor-, -and/or-.
- 5-gram.2024.30.core: for [X as well as Y]
Derived Pattern Details, please see design documents for details:
Ant-2 | Ant-2 | Co-occurrence Examples
|
---|
normal | abnormal |
- 11160|normal and abnormal
- 2387|normal nor abnormal
- 1917|normal or abnormal
- 463|abnormal and normal
- 385|normal from abnormal
- 243|normal versus abnormal
- 159|normal to abnormal
- 125|abnormal or normal
- 69|abnormal as well as normal
external | internal |
- 15160|internal and external
- 6836|external and internal
- 1667|internal or external
- 898|external or internal
- 184|internal versus external
- 164|internal as well as external
- 124|internal to external
- 122|internal, and external
- 116|internal and/or external
- 114|external to internal
...
| | |
We observed from above table,
- Most of these aPairs fall into the collocate patterns of [Ant-1 keyword Ant-2]. Keywords are in the middle of the 3-gram, including “and”, “or”, “versus”, “to”, etc.
- Some aPairs, such as calm|excited, buyer|seller, are not co-occurring in the MEDLINE n-grams. The reasonable guesses are:
- the MEDLINE n-gram set does not cover these aPairs. In such case, we suggest applying this co-occurrences model with another corpus to find collocate patterns.
- These aPairs cannot be derived by collocate model. In such case, we suggest performing more research and focus on the semantics. These types of aPairs are categorized with source of [SN] (semantic in corpus).
III. Implementation
Java source codes are implemented in the directory of Medline:
- GetAntCandFrom3GramPatMid.java
- GetAntCandFrom5GramPatMid.java
Algorithm:
- go through all n-grams (N = 3 or 5) to retrieve antonyms from the normalized (coreterm) 1st and last grams. The middle word(s) are used as keywords.
- check if middle word(s) match key words
- check if the normalized 1st and last grams meet the criteria of antonyms:
- have EUIs (in the Lexicon)
- single words
- have the same POS
- not invalid words for antonym in CC model, such as "the", "a", "which", "not", etc.
- not synonyms
- have CUIs
- have STIs, either same STIs or legal STI pairs
legal STI was derived from tagged aPair candidates with occurrence above 10 for canonical aPairs. The report is in the file: ${ANTONYM}/${YEAR}/output/Analysis/antCand.data.tag.cuiSti.rpt.
STI-1 | STI-2 | Frequency
|
---|
T033|Finding | T080|Qualitative Concept | 38
|
T033|Finding | T121|Pharmacologic Substance | 10
|
T033|Finding | T169|Functional Concept | 19
|
T033|Finding | T170|Intellectual Product | 11
|
T033|Finding | T184|Sign or Symptom | 15
|
|
T078|Idea or Concept | T080|Qualitative Concept | 10
|
|
T080|Qualitative Concept | T081|Quantitative Concept | 13
|
T080|Qualitative Concept | T082|Spatial Concept | 10
|
T080|Qualitative Concept | T121|Pharmacologic Substance | 10
|
T080|Qualitative Concept | T169|Functional Concept | 37
|
|
T121|Pharmacologic Substance | T169|Functional Concept | 10
|
- convert to base form (citation form) for aPair candidates
IV. References
- Walter G. Charles, George A. Miller, Contexts of antonymous adjectives, Applied Psycholinguistics (1989) 10, 357-375
- Christiane Fellbaum, Co-Occurrence and Antonymmy, International Journal of Lexicography, Vol 8 no 4, 1995 Oxford University Press, 281-303
- Debela Tesfaye, Carita Paradis, On the use of antonyms and synonyms from a domain perspective, Proceedings of the NetWordS Final Conference, Pisa, March 30-April 1, 2015, 150-154
- John S. Justeson, Slava M. Katz, Co-occurrences of Antonymous Adjectives
and Their Contexts, Computational Linguistics, Vol 17, No 1, Association for Computational Linguistics, 1991, 1-19