Auto-tag Processes
This page describes the auto-tag processes in LMW candidate list generation:
- I. Auto-tag raw LMW candidate list:
- Raw candidate lists are generated from:
- These raw candidate list have to go through processes in 00.CandidateList (step 10) to auto-tag:
- valid LMWs (terms in the latest Lexicon inflVars.data)
- invalid LMWs (known invalid LMWs, Total.data.no)
- CandList.rmYesNo are new terms are not auto-tagged. They are used as candidate list (to calculate precision), as shown at the end of the process in the following diagram.
- CandList.rmYesTagNo is the actual file sent to linguists so they can have another chance to review terms with AUTO_TAG_NO to ensure they are invalid LMWs.
- The precision of all models for the latest Lexicon are in the prevCand.data.rpt.
- II. Auto-tag completed LMW candidate list:
- Once candidate list are completed (submitted and approved) by the linguists, the list need to go through processes in 00.CandidateList (step 1-3) to update invalid LMWs (prevCand.data.no)
- Add completed candidate list to appropriated directory
- Run 00.CandidateList step 1.
- candidates are in the latest inflVars.data are valid LMWs (prevCand.data.yes)
- candidates are not in the latest inflVars.data are invalid LMWs (prevCand.data.no)
- Invalid LMWs in the candidate list are automatic updated, prevCand.data.no in diagram below
- All skipped candidates are considered as invalid LMWs in this process (that is why the program provides a 2nd chance for linguist to review invalid LMWs - ATUO_TAG_NO in the final tagging process .
- CandList.rmYesNo should become empty by runing through the above process with latest Lexicon and invalid LMWs when it is completed.
- Valid candidates is added to the Lexicon
- Invalid candidates is added to the invalid LMWs file.
- Other invalid terms (notBaseLmw.data.no) are static legacy data used before 2019-. (no more update after 2020+)
- The file Total.data.no (= prevCand.data.no + notBaseLmw.data.no) are used as the latest invalid LMW collections. This file should be used in LexAccess.Files after 2020+.