The SPECIALIST Lexicon

Antonym - Processes for Antonym Generation

I. Directory & Set Up

  • base directory: ${ANTONYM_DIR}
  • binary scripts: ./bin
  • data: ./data
    • 0.Antonym
    • 1.Lexicon
    • 2.SuffixD
    • 3.PrefixD
    • 4.TtSet
    • 5.Medline
    • 6.WordNet

Some aPairs might come from different sources, such as

absolute|E0006593|relative|E0052609|adj|Y|B|O|quality|CC
absolute|E0006593|relative|E0052609|adj|Y|B|O|quality|SN
adjusted|E0007411|unadjusted|E0312692|adj|Y|B|O|quality|PD
adjusted|E0007411|unadjusted|E0312692|adj|Y|B|O|quality|SN
Thus, a heuristics rule is made to used the priority of source as follows: LEX > SD > PD > CC > SN
Accordingly, the order of running process below must be followed.
Also, directories and input/output files requires manual operations.
shell>cd ${ANTONYM_DIR}/bin
shell>GetAntonyms ${YEAR}

II. Processes

  • Antonyms from the Lexicon
    Use the latest Lexicon
    Option and DescriptioninputOutputNotes
    10
    • generate antonym candidates from the Lexicon (with negative tags)
    • Lexicon.GenAntCandFromLexicon.java
    • 1.Lexicon/${YEAR}/input/LEXICON
    • 0.Antonym/${YEAR}/input/antCand.data.tag
    • 0.Antonym/${YEAR}/input/domain.data
    • ./output/antCandLexicon.data
    • ./output/Cand/antCandLexicon.data.tag
    • ./output/Cand/antCandLexicon.data.tbd
    • ./output/candTagged/antCandLexicon.data.tag.tagged
    • mkdir needed directories (Cand, candTagged)
    • copy needed files
    • send antCandLexicon.data.tbd to linguist to tag, add to antCand.data.tag, and reun until tbd = 0
    • copy antCandLexicon.data.tag.tagged to antCandLexicon.data.tag.tagged.${YEAR}
      This is antonyms from the source of LEX
    11
    • Validate and fix tags of antonym candidates (LEX)
    • Antonym.ValidateTaggedCand.java
    • ./output/candTagged/antCandLexicon.data.tag.tagged
    • 0.Antonym/${YEAR}/input/domain.data
    • ./output/candTagged/antCandLexicon.data.tag.fixed
    • Make sure the tag and fixed files are the same
    12
    • update release antonyms tagged file from LEX
    • automatically assign type to [NA] and domain to [DOMAIN_NONE] if Canon is [N]
    • check for new domains
    • Antonym.UpdateAllTaggedFile.java
    • ./output/candTagged/antCandLexicon.data.tag.tagged.${YEAR}
    • ${ANT_DIR}/input/antCand.data.tag.${YEAR}
    • ${ANT_DIR}/input/domain.data
    • ${ANT_DIR}/input/antCand.data.tag.updated
    • The step auto-update all antonym candidate tag file
    • The output file is used to generate antonym and negation files for the release.
    • Manully copy antCand.data.tag.updated to antCand.data.tag.updated.LEX
      cp -rp antCand.data.tag.updated antCand.data.tag.updated.1.LEX

  • Antonyms from SuffixD
    Use the latest SuffixD (derivation.data.${YEAR}) and inflVars.data
    option and DescriptioninputOutputNotes
    20
    • Get antonym candidates from SuffixD
    • Derivation.GetAntCandFromSuffixD.java
    • ${SD_DIR}/input/derivation.data
    • ${LEX_DIR}/input/inflVars.data
    • ${ANT_DIR}/input/antCand.data.tag.${YEAR}
    • ${ANT_DIR}/input/domain.data
    • ./output/Cand/antCandSuffixD.data
    • ./output/Cand/antCandSuffixD.data.tag
      => aPairs alreayd tagged
    • ${SD_DIR}/output/Cand/antCandSuffixD.data.tbd
      => aPairs to be tagged, send to linguists, need to be 0
    • ./output/candTagged/antCandSuffixD.data.tag.tagged
    • If the first time:
      • mkdir ./${YEAR}/output/Cand
      • mkdir ./${YEAR}/output/candTagged
    • Use updated derivation.data and inflVars.data
    • Send antCandSuffixD.data.tbd to linguist to compelte the tags
    • Complete Steps 21-22, then re-run this step until TBD = 0
    21
    • Validate and fix tags of antonym candidates (SD)
    • Antonym.ValidateTaggedCand.java
    • ./output/candTagged/antCandSuffixD.data.tag.tagged
    • ${ANT_DIR}/input/domain.data
    • ./output/candTagged/antCandSuffixD.data.tag.fixed
    • Append linguist's tags to ${SD_DIR}/output/candTagged/antCandSuffixD.data.tag.tagged
    • Run this step until the tag and fixed files are the same
      • Fixed file is the auto-fixes on [TYPE_TBD] and [DOMAIN_TBD] to [NA] and [DOMAIN_NONE].
      • Fixed file is sorted by alphabetical order.
      • Manually copy the fixed file to tagged file
    • Manually copy antCandSuffixD.data.tag.tagged to antCandSuffixD.data.tag.tagged.${YEAR}
    • rerun this step after Step 20 has TBD=0, so the fixed file is sorted alphabeticaly
    22
    • Update release antonyms tagged file form SD
    • Antonym.UpdateAllTaggedFile.java
    • ./output/candTagged/antCandSuffixD.data.tag.tagged.${YEAR}
    • ${ANT_DIR}/input/antCand.data.tag.${YEAR}
    • ${ANT_DIR}/input/domain.data
    • ${ANT_DIR}/input/antCand.data.tag.updated
    • The step automatically updates all antonym candidate tag file
    • Manully copy antCand.data.tag.updated to antCand.data.tag.updated.2.SD
    • Manully copy antCand.data.tag.updated to antCand.data.tag.${YEAR}
    • The output file is used to generate antonym and negation files for the release.
    • Re-run steps 20-22 until it passes all steps.
      • tag conflict no = 0
      • source conflict no = 0
      • duplicate tag = 0

  • antonyms from the prefixd
    use the latest prefixd
    antCand from previous prefixD are not completed yet (~7,320 for 2024 release)
    option and descriptioninputoutputnotes
    30
    • get antonym candidates from prefixd
    • derivation.getantcandfromprefixd.java
    • ${pd_dir}/input/derivation.data
    • ${lex_dir}/input/inflvars.data
    • ${ant_dir}/input/antcand.data.tag.${year}
    • ${ant_dir}/input/domain.data
    • ./output/cand/antcandprefixd.data
    • ./output/cand/antcandprefixd.data.tag
      => apairs alreayd tagged
    • ./output/cand/antcandprefixd.data.tbd
      => apairs to be done, need to be 0
    • ./output/candtagged/antcandprefixd.data.tag.tagged
    • if the first time:
      • mkdir ./${year}/output/cand
      • mkdir ./${year}/output/candtagged
    • use updated derivation.data and inflvars.data
    • send antcandprefixd.data.tbd to linguist to compelte the tags
      as for 2024 release, there are 7k+ tbd apairs needs to be tagged. This number is expected to be much less (only for the annual growth of the prefixD) during the annual release once this is completed tagged.
    31
    • Validate and fix tags of antonym candidates (PD)
    • Antonym.ValidateTaggedCand.java
    • ./output/candTagged/antCandPrefixD.data.tag.tagged
    • ${ANT_DIR}/input/domain.data
    • ./output/candTagged/antCandPrefixD.data.tag.fixed
    • Append linguist's tag to ${PD_DIR}/output/candTagged/antCandPrefixD.data.tag.tagged
    • Run this step until the tag and fixed file are the same
      • Fixed file is the auto-fixes on [TYPE_TBD] and [DOMAIN_TBD] to [NA] and [DOMAIN_NONE].
      • shell> sort -u antCandPrefixD.data.tag.fixed > antCandPrefixD.data.tag.fixed.uSort
      • Manually copy the sorted-fixed file to tagged file
    • Manually copy antCandPrefixD.data.tag.tagged to antCandPrefixD.data.tag.tagged.${YEAR}.${NO}
    32
    • Update release antonyms tagged file form PD
    • Antonym.UpdateAllTaggedFile
    • ./output/candTagged/antCandPrefixD.data.tag.tagged.${YEAR}
    • ${ANT_DIR}/input/antCand.data.tag.${YEAR}
    • ${ANT_DIR}/input/domain.data
    • ${ANT_DIR}/input/antCand.data.tag.updated

    • ${ANT_DIR}/input/antCand.data.tag.updated.srcConflict
    • ${ANT_DIR}/input/antCand.data.tag.updated.tarConflict
    • This step auto-update all antonym candidate tag file
    • Manully copy antCand.data.tag.updated to antCand.data.tag.updated.3.PD
    • Manully copy/link antCand.data.tag.updated to antCand.data.tag.${YEAR}
    • src could be conflicted (form differernt sources), for example:
      • activate|E0007090|deactivate|E0417566|verb|Y|UB|BN2|quality|SN
      • activate|E0007090|deactivate|E0417566|verb|Y|UB|BN2|quality|PD
    • The output file is used to generate antonym and negation files for the release.
    • Re-run steps 30-32 until it passes all steps.

  • Antonyms from the Training and Test set
    option and DescriptioninputOutputNotes
    40
    • Collect and retag source from [TT] to [CC|SN] of antonym in the traing and test set
    • TtSet.CollectAntonyms.java
    • TtSet.RetagSrcOnAntRaw.java
    • ${TT_DIR}/input/antonymSource.data (use 2021)

    • ${ML_DIR}/input/3-gram.${YEAR}.30.core (previous_year)
      Use shell> 06.NGramUtil ${PREV_YEAR}, option 3.
    • ./output/PreCand/antonymTtSet.data.TT

    • ./output/PreCand/antonymTtSet.data
    • If it is the first time run,
      • shell> mkdir ./output/PreCand
      • link ${ML_DIR}/input/3-gram.${YEAR}.30.core
        => need to run option 3 on ${LMW}/bin/06.NGramUtil ${PREV_YEAR} first
    • Retag [TT] to sources of [LEX|SD|PD|CC|SN]
    41
    • No need for release!
    42
    • Get antonym candidates from TtSet Collections
    • TtSet.GenAntCandFromTtSet
    • ${TT_DIR}/output/antonymTtSet.data
    • ${ANT_DIR}/input/antCand.data.tag.${YEAR}
    • ${LEX_DIR}/input/inflVars.data
    • ${ANT_DIR}/input/domain.data
    • ./output/Cand/antCandTtSet.data
    • ./output/Cand/antCandTtSet.data.tbd
    • ./output/Cand/antCandTtSet.data.tag
    • ./output/candTagged/antCandTtSet.data.tag.tagged
    • TBD file should be 0 (or same number as following exceptions): post|E0049060|pre|EUI_TBD|noun|CANON_TBD|TYPE_TBD|NEG_TBD|DOMAIN_TBD|CC post|E0049061|pre|EUI_TBD|verb|CANON_TBD|TYPE_TBD|NEG_TBD|DOMAIN_TBD|CC
      convert to:
      post|E0049060|pre|EUI_NONE|noun|N|NA|O|DOMAIN_NONE|CC
      post|E0049061|pre|EUI_NONE|verb|N|NA|O|DOMAIN_NONE|CC
    • Send TBD file (othere than above 2) to linguists to tag
    43
    • Validate and fix tags of antonym candidates (TT)
    • Antonym.ValidateTaggedCand.java
    • ./output/candTagged/antCandTtSet.data.tag.tagged
    • ${ANT_DIR}/input/domain.data
    • ./output/candTagged/antCandTtSet.data.tag.fixed
    • Append tagged candidates to antCandTtSet.data.tag.tagged
      post|E0049060|pre|EUI_NONE|noun|N|NA|O|DOMAIN_NONE|CC
      post|E0049061|pre|EUI_NONE|verb|N|NA|O|DOMAIN_NONE|CC
    • run this step until tag and fixed files are the same
      • Fixed file is the auto-fixes on [TYPE_TBD] and [DOMAIN_TBD] to [NA] and [DOMAIN_NONE].
      • Manually fix know exceptions (2).
      • Manually copy the fixed file to tagged file
    • Manually copy antCandTtSet.data.tag.tagged to antCandTtSet.data.tag.tagged.${YEAR}
    44
    • Update release antonyms tagged file form TT
    • Antonym.UpdateAllTaggedFile
    • ./output/candTagged/antCandTtSet.data.tag.tagged.${YEAR}
    • ${ANT_DIR}/input/antCand.data.tag.${YEAR}
    • ${ANT_DIR}/input/domain.data
    • ${ANT_DIR}/input/antCand.data.tag.updated
    • This step auto-update all antonym candidate tag file
    • Manully copy antCand.data.tag.updated to antCand.data.tag.updated.TT
    • The output file is used to generate antonym and negation files for the release.
    • Re-run steps 40-44 until it passes all steps
    • TT should be run once and pass steps from 40-44 after year 2023+.

  • Antonyms from the collocates in a corpus
    option and DescriptioninputOutputNotes
    65
    • Get Antonyms from MEDLINE 3-grams by a specify middle keyword (and/or):
    • Medline.GetAntCandFrom3GramPatMid.java
    • ${ML_NGRAM_DIR}/input/3-gram.${YEAR}.30.core
    • ${META_DIR}/input/normTermCui.data
    • ${META_DIR}/input/MRSTY.RRF
    • ${LEX_DIR}/input/inflVars.data
    • ${LEX_DIR}/input/synonym.data
    • ${ANT_DIR}/input/antCand.data.tag.${YEAR}
    • ${ANT_DIR}/input/domain.data
    • /nfsvol/lex/Lu/Projects/LVG/lvg${LVG_YEAR}/data/config/lvg.properties
    • ./output/PreCand/antCandPatMid.andOr.data
    • This step is not used in the annual process. But, it might need before step-66.
    • This step is used to pre-run Step-66 by using 1 middle word in 3-grams to get collocates for antonyms. Must run this to make sure everything is OK before running Step-66.
    • If run the 1st time:
      • shell> mkdir ./output/PreCand
      • make sure all input files are setup correctly
    • Different versions of data are used due to different released dates of data:
      • Lexicon Antonym release: ${YEAR}
      • META-thesaurus: ${PREV_YEAR}AA
      • MEDLINE: ${PREV_YEAR}
      • LVG: ${PREV_YEAR}
    • This program set the defaults keyword to "and/or".
    66
    • Get Antonyms from MEDLINE 3-grams by specify middle keywords
    • Medline.GetAntCandFrom3GramPatMid.java
    • ${ML_NGRAM_DIR}/input/3-gram.${YEAR}.30.core
    • ${META_DIR}/input/normTermCui.data
    • ${META_DIR}/input/MRSTY.RRF
    • ${LEX_DIR}/input/inflVars.data
    • ${LEX_DIR}/input/synonym.data
    • ${ANT_DIR}/input/antCand.data.tag.${YEAR}
    • ${ANT_DIR}/input/domain.data
    • /nfsvol/lex/Lu/Projects/LVG/lvg${LVG_YEAR}/data/config/lvg.properties
    • ./output/PreCand/antCandPatMid.${KEY_WORD}.data
    • Currently, this program inlcudes the top 8 highest frequency keywords: "and or to versus than vs from and|or", as defined in the scripts.
    • The latest data are used with different version, because of different released dates of data:
      • Lexicon Antonym release: ${YEAR}
      • Lexicon: ${YEAR}
      • META-thesaurus: ${PREV_YEAR}AA
      • MEDLINE: ${PREV_YEAR}
      • LVG: ${PREV_YEAR}
    67
    • Get antCand by combining results from above steps: 65-66
    • Medline.CombineAntCandFrom3GramPatMid.java
    • ./output/PreCand/antCandPatMid.${KEY_WROD}.data.wc
    • ./output/PreCand/keyWords.data
    • ./output/PreCand/antCandPatMid.cand.data.raw
      => include raw collocates that happen once in 1 of 8 keywords
    • ./output/PreCand/antCandPatMid.cand.data.filtered
      Heuristic filter rules:
      => include filtered collocates: happen in 3 of 8 keywords, not include "other|E0044444", and not self-aPairs
      => is the sum of files: tag + tbd
    • ./output/PreCand/antCandPatMid.cand.data.tag
    • ./output/PreCand/antCandPatMid.cand.data.tag.CC
    • ${ML_DIR}/output/Cand/antCandPatMid.cand.data.tbd
    • If run the first time:
      • shell> mkdir Cand
      • shell> mkdir candTagged
      • copy ${PreCand}/keyWords.data from ${PREV_YEAR}
    • TBD should be 0
    • If not, send cand ${ML_DIR}/output/Cand/antCandPatMid.cand.data.tbd to linguist to tag
    68
    • Validate and fix tags of antonym candidates (CC)
    • Antonym.ValidateTaggedCand.java
    • ${CC_DIR}/output/candTagged/antCandPatMid.data.tag.tagged
    • ${ANT_DIR}/input/domain.data
    • ${CC_DIR}/output/candTagged/antCandPatMid.data.tag.fixed
    • Prepare/add tagged candidates to antCandPatMid.data.tag.tagged
      • convert tagged candidate file to standard format:
        shell> flds 3,4,5,6,7,8,9,10,11,12 antCandPatMid.cand.data.tbd.{YEAR}.${NO}.tagged > antCandPatMid.data.data.tbd.${YEAR}.${NO}.tagged.3-12
      • append antCandPatMid.data.data.tbd.${YEAR}.${NO}.tagged.3-12 to antCandPatMid.data.tag.tagged.${YEAR}.${NO}
      • sort -u antCandPatMid.data.tag.tagged.${YEAR}.${NO} > antCandPatMid.data.tag.tagged.${YEAR}.${NO}.uSort
      • shell> ln -sf antCandPatMid.data.tag.tagged.${YEAR}.${NO}.uSort antCandPatMid.data.tag.tagged
    • run this step (68) until tag and fixed files are the same
      • Fixed file is the auto-fixes on [TYPE_TBD] and [DOMAIN_TBD] to [NA] and [DOMAIN_NONE].
      • Manually copy the fixed file to tagged file, then run it again until they are the same
    • Manually copy antCandPatMid.data.tag.tagged to antCandPatMid.data.tag.tagged.${YEAR}
    69
    • Update release antonyms tagged file form CC
    • Antonym.UpdateAllTaggedFile.java
    • ${CC_DIR}/output/candTagged/antCandPatMid.data.tag.tagged.${YEAR}
    • ${ANT_DIR}/input/antCand.data.tag.${YEAR}
    • ${ANT_DIR}/input/domain.data
    • ${ANT_DIR}/input/antCand.data.tag.updated
    • This step auto-update all antonym candidate tag file
    • Manully copy antCand.data.tag.updated to antCand.data.tag.updated.CC
    • Manully copy antCand.data.tag.updated to antCand.data.tag.${YEAR}
    • The output file is used to generate antonym and negation files for the release.
    • Re-run steps 66-69 until it passes all steps
    • Re-run 66-67 to gen the latest aPair candidate list for linugists

  • Antonyms from the semantics in a corpus (WordNet)
    Use the latest inflVars.data and WordNet 3.0
    antCand are not completed yet (about 5K left)
    option and DescriptioninputOutputNotes
    70
    • Unify and sort aPairs from Anotnyms in WordNet
    • WordNet.WnAPairFile.java
    • ${WN_DIR}inData/WnAPairs.data.${WN_YEAR}
    • ./output/PreCand/WnAPairs.unique.data.${WN_YEAR}
    • if run the first time:
      • shell> mkdir ./output/PreCand
    • sort and unify aPairs in the WordNet
    • the format is [antonym-1|antonym-2|POS]
    • antonym-1 and antonym-2 are sorted by alphabetical order
    • The output file should be the same if the same WordNet version (3.) is used. And thus, data from previous years can be used
    71
    • No need after 2023+
    • Generate word candidates from aPairs in WordNet
    • GenWordCandFromAPairs.java
    • antCandWordNet.data.b4 (must run step 72 first)

    • ${LMW_DIR}/inData/invalidLeadTerms.data.abs
    • ${LMW_DIR}/inData/invalidEndTerms.data.abs
    • ${LMW_DIR}/inData/invalidLeadEndTermCandidates.data
    • ${LMW_DIR}/inData/validLeadTerms.data.pat
    • ${LMW_DIR}/inData/validEndTerms.data.pat
    • wn.Ap.wordCand
    • Word candidate must be completed before completing the aPair from WordNet because all antonyms must be in the Lexicon.

    • ${LMW_DIR} uses ${PREV_YEAR}
    • This step is completed once and no needed after 2023+.
    • The output should be the same because WordNet does not change! And we use same release of MEDLINE and Metathesaurus. Too less difference and too much efforats to update above two fitlers.
    72
    • Generate antonym candidates from WordNet
    • WordNet.GenAPairCand.java
    • ./output/PreCand/WnAPairs.unique.data

    • ${LEXICON_DIR}input/inflVars.data
    • ${LEXICON_DIR}input/synonym.data

    • ${META_DIR}/input/normTermCui.data
    • ${META_DIR}/input/MRSTY.RRF

    • /nfsvol/lex/Lu/Projects/LVG/lvg${LVG_YEAR}/data/config/lvg.properties

    • ./output/Cand/antCand.data.tag
    • ...
    • ./output/Cand/antCandWordNet.data.b4tag
    • ./output/Cand/antCandWordNet.data.all
    • ./output/Cand/antCandWordNet.data.notLex
    • ./output/Cand/antCandWordNet.data.yes
    • ./output/Cand/antCandWordNet.data.no
    • ./output/Cand/antCandWordNet.data.tbd

    • ./output/Cand/antCandWordNet.data.trap.spVar
    • ./output/Cand/antCandWordNet.data.trap.cf
    • ./output/Cand/antCandWordNet.data.trap.sp
    • retrieve unique aPair candidates from aPairs in WordNet
    • filter out illegal aPairs,
      • spellnig variants
      • combin filter to exclude illegal words
      • multiwords.
      • synonyms
    • check antonym criteria: STI, CUI, etc. (not used in the current 2023 model)
    • Convert aPairs from WordNet (3 fields) to aPair candidates format (10 fields)
      • Get EUI by inflVars|pos
      • Check if known to Lexicon
      • Auto-tagged from previous tagged if known to lexicon
    • outNo = notLexNo + noNo + yesNo + tbdNo

    • TBD should be 0 (antCandWordNet.data.tbd)
    • If not, send antCandWordNet.data.tbd to linguist to tag
    73
    • Validate and fix tags of candidates (SN)
    • Antonym.ValidateTaggedCand.java
    • ./output/candTagged/antCandWordNet.data.tag.tagged
    • ${ANT_DIR}/input/domain.data
    • ./output/candTagged/antCandWordNet.data.tag.fixed
    • Append tagged candiddates to antCandWordNet.data.tag.tagged
    • Run this step until tag and fixed files are the same
      • Fixed file is the auto-fixes on [TYPE_TBD] and [DOMAIN_TBD] to [NA] and [DOMAIN_NONE].
      • Fixed file is sorted by alphabetical order.
      • Manually copy the fixed file to tagged file
    • Manually copy antCandWordNet.data.tag.tagged to antCandWordNet.data.tag.tagged.${YEAR}
    74
    • Update tagged candidates (SN) to release tagged file
    • Antonym.UpdateAllTaggedFile.java
    • ./output/candTagged/antCandWordNet.data.tag.tagged.${YEAR}
    • ${ANT_DIR}/input/antCand.data.tag.${YEAR}
    • ${ANT_DIR}/input/domain.data
    • ${ANT_DIR}/input/antCand.data.tag.updated

    • ${ANT_DIR}/input/antCand.data.tag.updated.srcConflict
      => same aPiar with same tag, but different source model
      => could > 0
    • ${ANT_DIR}/input/antCand.data.tag.updated.tagConflict
      => same aPiar with different tag, send to linguist to re-tag
      => must = 0
    • This step automatically updates all antonym candidate tag file
    • Manully copy antCand.data.tag.updated to antCand.data.tag.updated.5.SN
    • Manully copy antCand.data.tag.updated to antCand.data.tag.${YEAR}
    • The output file is used to generate antonym and negation files for the release.
    • Re-run steps 72-75 until it passes all steps