SPECIALIST Lexicon

Antonym - Processes for Antonym Generation

base directory: ${ANTONYM_DIR}
binary scripts: ./bin
data: ./data
- 0.Antonym
- 1.Lexicon
- 2.SuffixD
- 3.PrefixD
- 4.TtSet
- 5.Medline
- 6.WordNet

Some aPairs might come from different sources, such as

absolute|E0006593|relative|E0052609|adj|Y|B|O|quality|CC
absolute|E0006593|relative|E0052609|adj|Y|B|O|quality|SN

adjusted|E0007411|unadjusted|E0312692|adj|Y|B|O|quality|PD
adjusted|E0007411|unadjusted|E0312692|adj|Y|B|O|quality|SN

Thus, a heuristics rule is made to used the priority of source as follows:


LEX > SD > PD > CC > SN

Accordingly, the order of running process below must be followed.
Also, directories and input/output files requires manual operations.

shell>cd ${ANTONYM_DIR}/bin
shell>GetAntonyms ${YEAR}

II. Processes

Antonyms from the Lexicon
Use the latest Lexicon

Option and Description	input	Output	Notes
10 generate antonym candidates from the Lexicon (with negative tags) Lexicon.GenAntCandFromLexicon.java	1.Lexicon/${YEAR}/input/LEXICON 0.Antonym/${YEAR}/input/antCand.data.tag 0.Antonym/${YEAR}/input/domain.data	./output/antCandLexicon.data ./output/Cand/antCandLexicon.data.tag ./output/Cand/antCandLexicon.data.tbd ./output/candTagged/antCandLexicon.data.tag.tagged	mkdir needed directories (Cand, candTagged) copy needed files send antCandLexicon.data.tbd to linguist to tag, add to antCand.data.tag, and reun until tbd = 0 copy antCandLexicon.data.tag.tagged to antCandLexicon.data.tag.tagged.${YEAR} This is antonyms from the source of LEX
11 Validate and fix tags of antonym candidates (LEX) Antonym.ValidateTaggedCand.java	./output/candTagged/antCandLexicon.data.tag.tagged 0.Antonym/${YEAR}/input/domain.data	./output/candTagged/antCandLexicon.data.tag.fixed	Make sure the tag and fixed files are the same
12 update release antonyms tagged file from LEX automatically assign type to [NA] and domain to [DOMAIN_NONE] if Canon is [N] check for new domains Antonym.UpdateAllTaggedFile.java	./output/candTagged/antCandLexicon.data.tag.tagged.${YEAR} ${ANT_DIR}/input/antCand.data.tag.${YEAR} ${ANT_DIR}/input/domain.data	${ANT_DIR}/input/antCand.data.tag.updated	The step auto-update all antonym candidate tag file The output file is used to generate antonym and negation files for the release. Manully copy antCand.data.tag.updated to antCand.data.tag.updated.LEX `cp -rp antCand.data.tag.updated antCand.data.tag.updated.1.LEX`

Antonyms from SuffixD
Use the latest SuffixD (derivation.data.${YEAR}) and inflVars.data

option and Description	input	Output	Notes
20 Get antonym candidates from SuffixD Derivation.GetAntCandFromSuffixD.java	${SD_DIR}/input/derivation.data ${LEX_DIR}/input/inflVars.data ${ANT_DIR}/input/antCand.data.tag.${YEAR} ${ANT_DIR}/input/domain.data	./output/Cand/antCandSuffixD.data ./output/Cand/antCandSuffixD.data.tag => aPairs alreayd tagged ${SD_DIR}/output/Cand/antCandSuffixD.data.tbd => aPairs to be tagged, send to linguists, need to be 0 ./output/candTagged/antCandSuffixD.data.tag.tagged	If the first time: mkdir ./${YEAR}/output/Cand mkdir ./${YEAR}/output/candTagged Use updated derivation.data and inflVars.data Send antCandSuffixD.data.tbd to linguist to compelte the tags Complete Steps 21-22, then re-run this step until TBD = 0
21 Validate and fix tags of antonym candidates (SD) Antonym.ValidateTaggedCand.java	./output/candTagged/antCandSuffixD.data.tag.tagged ${ANT_DIR}/input/domain.data	./output/candTagged/antCandSuffixD.data.tag.fixed	Append linguist's tags to ${SD_DIR}/output/candTagged/antCandSuffixD.data.tag.tagged Run this step until the tag and fixed files are the same Fixed file is the auto-fixes on [TYPE_TBD] and [DOMAIN_TBD] to [NA] and [DOMAIN_NONE]. Fixed file is sorted by alphabetical order. Manually copy the fixed file to tagged file Manually copy antCandSuffixD.data.tag.tagged to antCandSuffixD.data.tag.tagged.${YEAR} rerun this step after Step 20 has TBD=0, so the fixed file is sorted alphabeticaly
22 Update release antonyms tagged file form SD Antonym.UpdateAllTaggedFile.java	./output/candTagged/antCandSuffixD.data.tag.tagged.${YEAR} ${ANT_DIR}/input/antCand.data.tag.${YEAR} ${ANT_DIR}/input/domain.data	${ANT_DIR}/input/antCand.data.tag.updated	The step automatically updates all antonym candidate tag file Manully copy antCand.data.tag.updated to antCand.data.tag.updated.2.SD Manully copy antCand.data.tag.updated to antCand.data.tag.${YEAR} The output file is used to generate antonym and negation files for the release. Re-run steps 20-22 until it passes all steps. tag conflict no = 0 source conflict no = 0 duplicate tag = 0

antonyms from the prefixd
use the latest prefixd
antCand from previous prefixD are not completed yet (~7,320 for 2024 release)

option and description	input	output	notes
30 get antonym candidates from prefixd derivation.getantcandfromprefixd.java	${pd_dir}/input/derivation.data ${lex_dir}/input/inflvars.data ${ant_dir}/input/antcand.data.tag.${year} ${ant_dir}/input/domain.data	./output/cand/antcandprefixd.data ./output/cand/antcandprefixd.data.tag => apairs alreayd tagged ./output/cand/antcandprefixd.data.tbd => apairs to be done, need to be 0 ./output/candtagged/antcandprefixd.data.tag.tagged	if the first time: mkdir ./${year}/output/cand mkdir ./${year}/output/candtagged use updated derivation.data and inflvars.data send antcandprefixd.data.tbd to linguist to compelte the tags as for 2024 release, there are 7k+ tbd apairs needs to be tagged. This number is expected to be much less (only for the annual growth of the prefixD) during the annual release once this is completed tagged.
31 Validate and fix tags of antonym candidates (PD) Antonym.ValidateTaggedCand.java	./output/candTagged/antCandPrefixD.data.tag.tagged ${ANT_DIR}/input/domain.data	./output/candTagged/antCandPrefixD.data.tag.fixed	Append linguist's tag to ${PD_DIR}/output/candTagged/antCandPrefixD.data.tag.tagged Run this step until the tag and fixed file are the same Fixed file is the auto-fixes on [TYPE_TBD] and [DOMAIN_TBD] to [NA] and [DOMAIN_NONE]. shell> sort -u antCandPrefixD.data.tag.fixed > antCandPrefixD.data.tag.fixed.uSort Manually copy the sorted-fixed file to tagged file Manually copy antCandPrefixD.data.tag.tagged to antCandPrefixD.data.tag.tagged.${YEAR}.${NO}
32 Update release antonyms tagged file form PD Antonym.UpdateAllTaggedFile	./output/candTagged/antCandPrefixD.data.tag.tagged.${YEAR} ${ANT_DIR}/input/antCand.data.tag.${YEAR} ${ANT_DIR}/input/domain.data	${ANT_DIR}/input/antCand.data.tag.updated ${ANT_DIR}/input/antCand.data.tag.updated.srcConflict ${ANT_DIR}/input/antCand.data.tag.updated.tarConflict	This step auto-update all antonym candidate tag file Manully copy antCand.data.tag.updated to antCand.data.tag.updated.3.PD Manully copy/link antCand.data.tag.updated to antCand.data.tag.${YEAR} src could be conflicted (form differernt sources), for example: activate\|E0007090\|deactivate\|E0417566\|verb\|Y\|UB\|BN2\|quality\|SN activate\|E0007090\|deactivate\|E0417566\|verb\|Y\|UB\|BN2\|quality\|PD The output file is used to generate antonym and negation files for the release. Re-run steps 30-32 until it passes all steps.

Antonyms from the Training and Test set

option and Description	input	Output	Notes
40 Collect and retag source from [TT] to [CC\|SN] of antonym in the traing and test set TtSet.CollectAntonyms.java TtSet.RetagSrcOnAntRaw.java	${TT_DIR}/input/antonymSource.data (use 2021) ${ML_DIR}/input/3-gram.${YEAR}.30.core (previous_year) Use `shell> 06.NGramUtil ${PREV_YEAR}, option 3`.	./output/PreCand/antonymTtSet.data.TT ./output/PreCand/antonymTtSet.data	If it is the first time run, shell> mkdir ./output/PreCand link ${ML_DIR}/input/3-gram.${YEAR}.30.core => need to run option 3 on ${LMW}/bin/06.NGramUtil ${PREV_YEAR} first Retag [TT] to sources of [LEX\|SD\|PD\|CC\|SN]
41 No need for release!
42 Get antonym candidates from TtSet Collections TtSet.GenAntCandFromTtSet	${TT_DIR}/output/antonymTtSet.data ${ANT_DIR}/input/antCand.data.tag.${YEAR} ${LEX_DIR}/input/inflVars.data ${ANT_DIR}/input/domain.data	./output/Cand/antCandTtSet.data ./output/Cand/antCandTtSet.data.tbd ./output/Cand/antCandTtSet.data.tag ./output/candTagged/antCandTtSet.data.tag.tagged	TBD file should be 0 (or same number as following exceptions): `post\|E0049060\|pre\|EUI_TBD\|noun\|CANON_TBD\|TYPE_TBD\|NEG_TBD\|DOMAIN_TBD\|CC post\|E0049061\|pre\|EUI_TBD\|verb\|CANON_TBD\|TYPE_TBD\|NEG_TBD\|DOMAIN_TBD\|CC` convert to: `post\|E0049060\|pre\|EUI_NONE\|noun\|N\|NA\|O\|DOMAIN_NONE\|CC post\|E0049061\|pre\|EUI_NONE\|verb\|N\|NA\|O\|DOMAIN_NONE\|CC` Send TBD file (othere than above 2) to linguists to tag
43 Validate and fix tags of antonym candidates (TT) Antonym.ValidateTaggedCand.java	./output/candTagged/antCandTtSet.data.tag.tagged ${ANT_DIR}/input/domain.data	./output/candTagged/antCandTtSet.data.tag.fixed	Append tagged candidates to antCandTtSet.data.tag.tagged `post\|E0049060\|pre\|EUI_NONE\|noun\|N\|NA\|O\|DOMAIN_NONE\|CC post\|E0049061\|pre\|EUI_NONE\|verb\|N\|NA\|O\|DOMAIN_NONE\|CC` run this step until tag and fixed files are the same Fixed file is the auto-fixes on [TYPE_TBD] and [DOMAIN_TBD] to [NA] and [DOMAIN_NONE]. Manually fix know exceptions (2). Manually copy the fixed file to tagged file Manually copy antCandTtSet.data.tag.tagged to antCandTtSet.data.tag.tagged.${YEAR}
44 Update release antonyms tagged file form TT Antonym.UpdateAllTaggedFile	./output/candTagged/antCandTtSet.data.tag.tagged.${YEAR} ${ANT_DIR}/input/antCand.data.tag.${YEAR} ${ANT_DIR}/input/domain.data	${ANT_DIR}/input/antCand.data.tag.updated	This step auto-update all antonym candidate tag file Manully copy antCand.data.tag.updated to antCand.data.tag.updated.TT The output file is used to generate antonym and negation files for the release. Re-run steps 40-44 until it passes all steps TT should be run once and pass steps from 40-44 after year 2023+.

Antonyms from the collocates in a corpus

option and Description	input	Output	Notes
65 Get Antonyms from MEDLINE 3-grams by a specify middle keyword (and/or): Medline.GetAntCandFrom3GramPatMid.java	${ML_NGRAM_DIR}/input/3-gram.${YEAR}.30.core ${META_DIR}/input/normTermCui.data ${META_DIR}/input/MRSTY.RRF ${LEX_DIR}/input/inflVars.data ${LEX_DIR}/input/synonym.data ${ANT_DIR}/input/antCand.data.tag.${YEAR} ${ANT_DIR}/input/domain.data /nfsvol/lex/Lu/Projects/LVG/lvg${LVG_YEAR}/data/config/lvg.properties	./output/PreCand/antCandPatMid.andOr.data	This step is not used in the annual process. But, it might need before step-66. This step is used to pre-run Step-66 by using 1 middle word in 3-grams to get collocates for antonyms. Must run this to make sure everything is OK before running Step-66. If run the 1st time: shell> mkdir ./output/PreCand make sure all input files are setup correctly Different versions of data are used due to different released dates of data: Lexicon Antonym release: ${YEAR} META-thesaurus: ${PREV_YEAR}AA MEDLINE: ${PREV_YEAR} LVG: ${PREV_YEAR} This program set the defaults keyword to "and/or".
66 Get Antonyms from MEDLINE 3-grams by specify middle keywords Medline.GetAntCandFrom3GramPatMid.java	${ML_NGRAM_DIR}/input/3-gram.${YEAR}.30.core ${META_DIR}/input/normTermCui.data ${META_DIR}/input/MRSTY.RRF ${LEX_DIR}/input/inflVars.data ${LEX_DIR}/input/synonym.data ${ANT_DIR}/input/antCand.data.tag.${YEAR} ${ANT_DIR}/input/domain.data /nfsvol/lex/Lu/Projects/LVG/lvg${LVG_YEAR}/data/config/lvg.properties	./output/PreCand/antCandPatMid.${KEY_WORD}.data	Currently, this program inlcudes the top 8 highest frequency keywords: "and or to versus than vs from and\|or", as defined in the scripts. The latest data are used with different version, because of different released dates of data: Lexicon Antonym release: ${YEAR} Lexicon: ${YEAR} META-thesaurus: ${PREV_YEAR}AA MEDLINE: ${PREV_YEAR} LVG: ${PREV_YEAR}
67 Get antCand by combining results from above steps: 65-66 Medline.CombineAntCandFrom3GramPatMid.java	./output/PreCand/antCandPatMid.${KEY_WROD}.data.wc ./output/PreCand/keyWords.data	./output/PreCand/antCandPatMid.cand.data.raw => include raw collocates that happen once in 1 of 8 keywords ./output/PreCand/antCandPatMid.cand.data.filtered Heuristic filter rules: => include filtered collocates: happen in 3 of 8 keywords, not include "other\|E0044444", and not self-aPairs => is the sum of files: tag + tbd ./output/PreCand/antCandPatMid.cand.data.tag ./output/PreCand/antCandPatMid.cand.data.tag.CC ${ML_DIR}/output/Cand/antCandPatMid.cand.data.tbd	If run the first time: shell> mkdir Cand shell> mkdir candTagged copy ${PreCand}/keyWords.data from ${PREV_YEAR} TBD should be 0 If not, send cand ${ML_DIR}/output/Cand/antCandPatMid.cand.data.tbd to linguist to tag
68 Validate and fix tags of antonym candidates (CC) Antonym.ValidateTaggedCand.java	${CC_DIR}/output/candTagged/antCandPatMid.data.tag.tagged ${ANT_DIR}/input/domain.data	${CC_DIR}/output/candTagged/antCandPatMid.data.tag.fixed	Prepare/add tagged candidates to antCandPatMid.data.tag.tagged convert tagged candidate file to standard format: `shell> flds 3,4,5,6,7,8,9,10,11,12 antCandPatMid.cand.data.tbd.{YEAR}.${NO}.tagged > antCandPatMid.data.data.tbd.${YEAR}.${NO}.tagged.3-12` append antCandPatMid.data.data.tbd.${YEAR}.${NO}.tagged.3-12 to antCandPatMid.data.tag.tagged.${YEAR}.${NO} sort -u antCandPatMid.data.tag.tagged.${YEAR}.${NO} > antCandPatMid.data.tag.tagged.${YEAR}.${NO}.uSort `shell> ln -sf antCandPatMid.data.tag.tagged.${YEAR}.${NO}.uSort antCandPatMid.data.tag.tagged` run this step (68) until tag and fixed files are the same Fixed file is the auto-fixes on [TYPE_TBD] and [DOMAIN_TBD] to [NA] and [DOMAIN_NONE]. Manually copy the fixed file to tagged file, then run it again until they are the same Manually copy antCandPatMid.data.tag.tagged to antCandPatMid.data.tag.tagged.${YEAR}
69 Update release antonyms tagged file form CC Antonym.UpdateAllTaggedFile.java	${CC_DIR}/output/candTagged/antCandPatMid.data.tag.tagged.${YEAR} ${ANT_DIR}/input/antCand.data.tag.${YEAR} ${ANT_DIR}/input/domain.data	${ANT_DIR}/input/antCand.data.tag.updated	This step auto-update all antonym candidate tag file Manully copy antCand.data.tag.updated to antCand.data.tag.updated.CC Manully copy antCand.data.tag.updated to antCand.data.tag.${YEAR} The output file is used to generate antonym and negation files for the release. Re-run steps 66-69 until it passes all steps Re-run 66-67 to gen the latest aPair candidate list for linugists

Antonyms from the semantics in a corpus (WordNet)
Use the latest inflVars.data and WordNet 3.0
antCand are not completed yet (about 5K left)

option and Description	input	Output	Notes
70 Unify and sort aPairs from Anotnyms in WordNet WordNet.WnAPairFile.java	${WN_DIR}inData/WnAPairs.data.${WN_YEAR}	./output/PreCand/WnAPairs.unique.data.${WN_YEAR}	if run the first time: shell> mkdir ./output/PreCand sort and unify aPairs in the WordNet the format is [antonym-1\|antonym-2\|POS] antonym-1 and antonym-2 are sorted by alphabetical order The output file should be the same if the same WordNet version (3.) is used. And thus, data from previous years can be used
71 No need after 2023+ Generate word candidates from aPairs in WordNet GenWordCandFromAPairs.java	antCandWordNet.data.b4 (must run step 72 first) ${LMW_DIR}/inData/invalidLeadTerms.data.abs ${LMW_DIR}/inData/invalidEndTerms.data.abs ${LMW_DIR}/inData/invalidLeadEndTermCandidates.data ${LMW_DIR}/inData/validLeadTerms.data.pat ${LMW_DIR}/inData/validEndTerms.data.pat	wn.Ap.wordCand	Word candidate must be completed before completing the aPair from WordNet because all antonyms must be in the Lexicon. ${LMW_DIR} uses ${PREV_YEAR} This step is completed once and no needed after 2023+. The output should be the same because WordNet does not change! And we use same release of MEDLINE and Metathesaurus. Too less difference and too much efforats to update above two fitlers.
72 Generate antonym candidates from WordNet WordNet.GenAPairCand.java	./output/PreCand/WnAPairs.unique.data ${LEXICON_DIR}input/inflVars.data ${LEXICON_DIR}input/synonym.data ${META_DIR}/input/normTermCui.data ${META_DIR}/input/MRSTY.RRF /nfsvol/lex/Lu/Projects/LVG/lvg${LVG_YEAR}/data/config/lvg.properties ./output/Cand/antCand.data.tag ...	./output/Cand/antCandWordNet.data.b4tag ./output/Cand/antCandWordNet.data.all ./output/Cand/antCandWordNet.data.notLex ./output/Cand/antCandWordNet.data.yes ./output/Cand/antCandWordNet.data.no ./output/Cand/antCandWordNet.data.tbd ./output/Cand/antCandWordNet.data.trap.spVar ./output/Cand/antCandWordNet.data.trap.cf ./output/Cand/antCandWordNet.data.trap.sp	retrieve unique aPair candidates from aPairs in WordNet filter out illegal aPairs, spellnig variants combin filter to exclude illegal words multiwords. synonyms check antonym criteria: STI, CUI, etc. (not used in the current 2023 model) Convert aPairs from WordNet (3 fields) to aPair candidates format (10 fields) Get EUI by inflVars\|pos Check if known to Lexicon Auto-tagged from previous tagged if known to lexicon outNo = notLexNo + noNo + yesNo + tbdNo TBD should be 0 (antCandWordNet.data.tbd) If not, send antCandWordNet.data.tbd to linguist to tag
73 Validate and fix tags of candidates (SN) Antonym.ValidateTaggedCand.java	./output/candTagged/antCandWordNet.data.tag.tagged ${ANT_DIR}/input/domain.data	./output/candTagged/antCandWordNet.data.tag.fixed	Append tagged candiddates to antCandWordNet.data.tag.tagged Run this step until tag and fixed files are the same Fixed file is the auto-fixes on [TYPE_TBD] and [DOMAIN_TBD] to [NA] and [DOMAIN_NONE]. Fixed file is sorted by alphabetical order. Manually copy the fixed file to tagged file Manually copy antCandWordNet.data.tag.tagged to antCandWordNet.data.tag.tagged.${YEAR}
74 Update tagged candidates (SN) to release tagged file Antonym.UpdateAllTaggedFile.java	./output/candTagged/antCandWordNet.data.tag.tagged.${YEAR} ${ANT_DIR}/input/antCand.data.tag.${YEAR} ${ANT_DIR}/input/domain.data	${ANT_DIR}/input/antCand.data.tag.updated ${ANT_DIR}/input/antCand.data.tag.updated.srcConflict => same aPiar with same tag, but different source model => could > 0 ${ANT_DIR}/input/antCand.data.tag.updated.tagConflict => same aPiar with different tag, send to linguist to re-tag => must = 0	This step automatically updates all antonym candidate tag file Manully copy antCand.data.tag.updated to antCand.data.tag.updated.5.SN Manully copy antCand.data.tag.updated to antCand.data.tag.${YEAR} The output file is used to generate antonym and negation files for the release. Re-run steps 72-75 until it passes all steps

The SPECIALIST Lexicon