SPECIALIST Lexicon

Auto-tag Processes

This page describes the auto-tag processes in LMW candidate list generation:

I. Auto-tag raw LMW candidate list:
- Raw candidate lists are generated from:
- These raw candidate list have to go through processes in 00.CandidateList (step 10) to auto-tag:
  - valid LMWs (terms in the latest Lexicon inflVars.data)
  - invalid LMWs (known invalid LMWs, Total.data.no)
  - CandList.rmYesNo are new terms are not auto-tagged. They are used as candidate list (to calculate precision), as shown at the end of the process in the following diagram.
  - CandList.rmYesTagNo is the actual file sent to linguists so they can have another chance to review terms with AUTO_TAG_NO to ensure they are invalid LMWs.
  - The precision of all models for the latest Lexicon are in the prevCand.data.rpt.
II. Auto-tag completed LMW candidate list:
- Once candidate list are completed (submitted and approved) by the linguists, the list need to go through processes in 00.CandidateList (step 1-3) to update invalid LMWs (prevCand.data.no)
  - Add completed candidate list to appropriated directory
  - Run 00.CandidateList step 1.
  - candidates are in the latest inflVars.data are valid LMWs (prevCand.data.yes)
  - candidates are not in the latest inflVars.data are invalid LMWs (prevCand.data.no)
    - Invalid LMWs in the candidate list are automatic updated, prevCand.data.no in diagram below
    - All skipped candidates are considered as invalid LMWs in this process (that is why the program provides a 2nd chance for linguist to review invalid LMWs - ATUO_TAG_NO in the final tagging process .
  - CandList.rmYesNo should become empty by runing through the above process with latest Lexicon and invalid LMWs when it is completed.
    - Valid candidates is added to the Lexicon
    - Invalid candidates is added to the invalid LMWs file.
- Other invalid terms (notBaseLmw.data.no) are static legacy data used before 2019-. (no more update after 2020+)
- The file Total.data.no (= prevCand.data.no + notBaseLmw.data.no) are used as the latest invalid LMW collections. This file should be used in LexAccess.Files after 2020+.

The SPECIALIST Lexicon