The SPECIALIST Lexicon

Multiword Candidates Routine Generation Procedures

In addition to user requested terms/sources, LSG also practice routine procedures to generate multiword candidate list to build the Lexicon:

Post-processes:

  • Before 2018-
    • Tagged Invalid LMW Candidates

      All the invalid terms from tagged candidate list should be retrieved and save to "invalidLmwList.out". These invalid LMW terms and words in the Lexicon should be filtered out from the candidate list.

    • Calculate precision
      • Total precision: (newYes + AutoYes)/total candidates
      • New precision: (newYes)/new candidates

      • total candidate = autoYes + autoNo + newYes + newNo
      • autoYes (terms in Lexicon) and autoNo (terms in Invalid LMW candidates)
      • newYes and newNo are manually tagged by linguist
  • After 2018+

    A systematic post-process was implemented to filtered valid and invalid LMW from candidate list:

    • I. Logic: use the latest Lexicon to tag valid|invalid LMWs
      • Valid LMWs:
        • Collect all terms from the latest Lexicon (inflVars). These are valid LMWs.
        • Generate inflVars from LexBuild - postPorcess
      • Invalid LMWs:
        • Get all terms from previous candidate list (without |ATUO_N tag)
        • Remove valid LMWs from above
        • The rest are invalid LMWs
        • Tag invalid LMWs from the new candidate list as [AUTO_N]
          =>These can be removed if all of them are [n] after several running through several candidate list.
    • II. Root Directory:
      • Root directory: ${MULTIWORDS}/data/Candidates
    • III. Program:
      • Run: ${MULTIWORDS}/bin/00.CandidateList
      • This program needs to be run:
        • once a candidate list is generated
        • to remove candidates that is already in the Lexicon (inflVars.data)
        • to tag (|AUTO_N) or remove candidates that is previously tagged as invalid LMWs (notBaseForm.data and not LMW.data)

        • Also, after the candidate is tagged, it is used to calculate the stats

      • StepDescriptionInputOutputNotes
        1Aggregate and analyze all previous LMW candidate files
        => This program is to analyze the precision of candidate list (candidates are valid LMWs)
        • 0.LexiconInflVars/inflVars.data.current

        • 1.LexiconAbbAcrExpansion/newEuis.a[bc][br].tagged.txt.y.20NN
        • 2.MNSMatcherParAcr/acronymExp.tag.data.tag.final.tbd.20NN
        • 3.DMNSMatcherCuiEndWor/disNGram.Core.endword.new.out.gsp.20NN
        • 4.DMNSMatcherSpVarWc/*
        • prevCand.data
        • prevCand.data.no (invalid LMWs)
        • prevCand.data.yes (valid LMWs)
        • prevCand.data.rpt (detail stats report)
        Must update:
        • candidate list if completed tagging
        • inflVars (link to the latest inflVars from LexBuild)
        • Check the latest valid vs. invalid ratio
        2Aggregate and analyze not baseForm/LMW from LexCheck/candidate files
        => This program is to analyze the precision of invalid LMWs from notBaseForm.data and notLmw.data from the annaul Lexicon tagging
        • 5.LexCheckNotBaseFor/notBaseForm.data.${YEAR}
        • 6.LexCheckNotLmw/notLmw.data.${YEAR}
        • notBaseLmw.data
        • notBaseLmw.data.no (invalid LMWs)
        • notBaseLmw.data.yes (valid LMWs)
        • notBaseLmw.data.rpt (detail stats report)
        Must update:
        • notBaseForm.data.${YEAR}
        • notLmw.data.${YEAR}
        • inflVars (link to latest inflVars from LexBuild)
        • Check the latest valid vs. invalid ratio
        3Combine output files from steps 1 and 2 to get the total data set .
        • ./prevCand.data
        • ./notBaseLmw.data
        • ./totalData.data
        • ./totalData.data.yes
        • ./totalData.data.no
        Must run step 1 and 2
        • Check the latest valid vs. invalid ratio
        • Can be used as tagged data for machine learning model
        10Filter and tag valid/invalid LMWs for a candidate file
        • ./0.LexiconInflVars/inflVars.data.current (valid LMW file)
        • ./totalData.data.no (invalid LMW file)

          Specify

        • inFile.data
        • outFile.data
        • outFile.data
        • Must complete/update steps 1 ~ 3
        • input the new candidate file (or link to ./inFile.data)
        20Generate DL TtSet from valid/invalid LMWs candidate files
        21Generate DL TtSet from inflVars (valid) and invalid LMWs in n-grams ..
    • IV. Results: please see previous candidate lists