The SPECIALIST Lexicon

Previous Candidate Lists

This page describes the analysis and aggregation on all previous Lexicon candidate lists. These lists include valid and invalid candidates from various models as described bellows. The numbers are based on real-time data. In other words, this program needs to be re-run to get the latest number when:

  • Lexicon is updated
  • A candidate list is completed
  • Not base/LMW files in LexCheck is updated.
Please notes that the numbers shown below are a snapshot on the tag completion of the latest candidate list.

The stats is based on the following implmentation

  • When a candidate list is completed, theoretically:
    • All valid words are in the Lexicon
    • Candidates are not in the Lexicon are invalid words
    • Use the latest Lexicon to auto-tag valid and invalid words from a candidate list to decide the precision
    • Use those invalid word to update invalid base/LMW files.
    • Accordingly, the stats can be re-run, the results should be (almost) identical unless a valid word become an invalid word or vise versa.
Report

  • I. Program: ${MULTIWORDS}/bin/00.CandidateList
    1
    2
    3
    4

    Algorithm:

    • Combined all previous candidate list
    • Use the latest Lexicon (inflVars.data from LexBuild) to auto-tag valid/invalid LMWs.
    • Calculate the stats (precision) for each candidate list, each model and over all.
  • II. Data directory: ${MULTIWORDS}/data/Candidate/
  • III. In files:
    • ./0.LexiconInflVars/inflVars.data.current
      => Link to the latest InflVars.data from LexBuild daily backup

    • ./1.LexiconAbbAcrExpansion/
    • ./2.MNSMatcherParAcr/
    • ./3.DMNSMatcherCuiEndWord/
    • ./4.DMNSMatcherSpVarWc/
      => Use all completed candidate lists.
  • IV. Out Files:
    • prevCand.lmw.tag
    • prevCand.lmw.yes
    • prevCand.lmw.no

    • prevCand.lmw.rpt
      => The result table shown below is based on this report, results might be slightly different over the time due to the updates on Lexicon
    • 1.LexiconAbbAcrExpansion
      • candidates derived from the expansion of abbreviations/acronyms in Lexicon release
      • includes both valid and invalid words
      • After 2020+, this candidate list is generated in the preprocess (${MULTIWORD}/12.LexAbbAcrCand/)

      YearAcronym ExpansionsAbbreviation Expansions
      TotalValidInvalidTotalValidInvalid
      2015908881 (97.03%)27 (2.97%)6240 (64.52%)22 (35.48%)
      20165959 (100.00%)0 (0.00%)183180 (98.36%)3 (1.64%)
      20173939 (100.00%)0 (0.00%)2219 (86.36%)3 (13.64%)
      20181716 (94.12%)1 (5.88)2826 (92.86%)2 (7.14%)
      2019151142 (94.04%)9 (5.96%)1312 (92.31%)1 (7.69%)
      YearTotalValidInvalid
      2020148112 (75.68%)36 (24.32%)
      2021158129 (81.65%)29 (18.35%)
      20229453 (56.38%)41 (43.62%)
      202322 (100.00%)0 (0.00%)
      Accu.Total: 1808Valid: 1636 (90.49%)Invalid: 172 (9.51%)

      * Some of the terms might be duplicated among years

    • 2.MNSMatcherParAcr
      • candidates derived from the (ACR) matcher in MNS (07.MatcherParAcr)
      • includes both valid and invalid words
      • acronymExp.tag.data.tag.final.tbd.${YEAR}
        => CandList/acronymExp.tag.data.tag.final.tbd.${YEAR}.used.rmYesNo candidate only, not include AUTO_N
        => CandList/acronymExp.tag.data.tag.final.tbd.${YEAR}.used.rmYesTagNo include AUTO_N (sent to Linguists)

        YearTotalValidInvalidNotes
        201549943679(73.67%)1315 (26.33%)
        2016360200 (55.56%)160 (44.44%)
        201718551316 (70.94%)539 (29.06%)
        • Completed: 2018-11-15
        2018808604 (74.75%)204 (25.25%)
        • AUTO_N is not included (see details below)
        • Completed: 2019-01-03
        20191081663 (61.33%)418 (38.67%)
        • AUTO_N is not included (see details below)
        • Completed: 2019-10-16
        20201061786 (74.08%)275 (25.92%)
        • AUTO_N is not included (see details below)
        • Completed: 2020-08-18
        20211262XXX (XX.XX%)XXX (XX.XX%)
        • AUTO_N is not included (see details below)
        • Processing, Completed: 20XX-XX-XX
        Accu.98167056 (71.88%)2760 (28.12%)

        * Some of the terms might be duplicated among years

      • acronymExp.tag.data.tag.final.tbd.${YEAR}.rmYesTagNo include AUTO_N
        => AUTO_N: monitor and calculate AUTO_N becomes valid LMWs. This featuer is to show the consistency of tagging.

        YearTotalValidInvalidNotes
        201855738 (6.82%)519 (93.18%)6.82% became valid
        20192533231 (9.12%)2302 (90.88%)9.12% became valid
        2020277153 (1.91%)2718 (98.09%)1.91% became valid
        Very consistent (small percentage).

    • 3.DMNSMatcherCuiEndWord
      • candidates derived from the CUI and Endword matchers in DMNS
      • includes both valid and invalid words
      • Use the precision from last file (> 80%) and number of current file (36....rmYesNo: ~1000) to decide number of top endWords
      • 36.disNGram.Core.endword.out.rmYesNo.gsp.${YEAR}

        YearTotalValidInvalidNotes
        201663705725 (89.87%)645 (10.13%)
        • top 33 endwords
        201719451764 (90.69%)181 (9.31%)
        • Top 43 endwords
        • AUTO_N is not included (detailed below)
        • Completed: 2019-05-20
        2018819703 (85.84%)116 (14.16%)
        • Top 51 endwords
        • AUTO_N is not included (detailed below)
        • Completed: 2019-08-02
        201929182588 (88.69%)330 (11.31%)
        • Top 57 endwords
        • AUTO_N is not included (detailed below)
        • Completed: 2020-06-12
        202028462489 (87.46%)357 (12.54%)
        • Top 80 endwords
        • AUTO_N is not included (detailed below)
        • Completed: 2021-03-01
        20211550TBD (87.46%)TBD (12.54%)
        • Top 85 endwords
        • AUTO_N is not included (detailed below)
        • Completed: TBD
        Accu.1489813269 (89.07%)1629 (10.93%)

        * Some of the terms might be duplicated among years

      • 36.disNGram.Core.endword.out.rmYesTagNo.gsp.${YEAR}
        => AUTO_N: monitor and calculate AUTO_N becomes valid LMWs

        YearTotalValidInvalidNotes
        20171034393 (38.01%)641 (61.99%)38.01% become valid
        Main reason is some candidates were not tagged
        2018953133 (13.96%)820 (86.04%)13.96% become valid
        Clean up
        201998450 (5.08%)934 (94.92%)5.08% become valid
        Small percentage is consistent!
        2020129124 (1.86%)1267 (98.14%)1.86% become valid
        Small percentage is consistent!

    • 4.DMNSMatcherSpVarWc
      • candidates derived from the SpVar and Frequency matchers in DMNS
      • includes both valid and invalid words

        YearWord CountTotalValidInvalidAccu. P
        2015100000033682397 (71.17%)971 (28.83%)71.17%
        10000022181520 (68.53%)698 (31.47%)70.12%
        10000895605 (67.60%)290 (32.40%)69.77%
        1000588249 (42.35%)339 (57.65%)67.49%
        100538119 (22.12%)419 (77.88%)64.28%
        Accu.Accu.76074890 (64.28%)2712 (35.72%)64.28%

        * This model is not performed due to the time consuming and limited resources

    • 8.WordNet
      • candidates derived from the derivations, synonyms, antonyms in WordNet 3.0
      • includes both valid and invalid words
      • unique lowercase terms are used (input has both cases), so the total number is smaller than the actual no. of input terms

        ModelsTotalValidInvalidNotes
        zeroD, CUI322322 (100.00%)0 (0.00%)WordNetCand.ZD.cui.2021
        zeroD, no CUI626601 (96.01%)25 (3.99%)WordNetCand.ZD.noCui.2021
        aPairs19121412 (73.85%)500 (26.15%)WordNetCand.AP.2021
        Accu.28582333 (81.63%)525 (18.37%)

    • Process to get Stats on Previous Tagged Candidate List:
      • Run 00.CandidateList
        • 1
          => Get invalid words from all previous candidate lists need to be in the Candidates directories:
          • 1.LexiconAbbAcrExpansion
          • 2.MNSMatcherParAcr
          • 3.DMNSMatcherCuiEndWord
          • 4.DMNSMatcherSpVarWc
          • 8.WordNet

          => unique lowercase term from prevCand is used.
          => generate prevCand.rpt (for the stats)..
          => generate prevCand.*
        • 2
          => update the link ./5.LexCheckNotBaseForm/notBaseForm.data.current to the latest one in LexCheck
          => update the link ./6.LexCheckNotLmw/notLmw.data.current to the latest one in LexCheck
          => get invalid words from notBase and notLwm
          => generate notBaseLmw.*
        • 3
          => combine invalid words from above 2 steps
          => generate totalTerms.1_2.*
        • 4
          => Copy results (reports) to ./DataLog/$YEAR}/${YEAR}_${MONTH}_${DAY}
        • 5, only used for non-routine word Cand
          • prevCand.lmw.rpt (used for updating the stats.)
          • notBaseLmw.lmw.rpta (used for referenced)
      • candidates from all above sources that are completed tagging.
      • out files: prevCand.lmw.rpt

        DateTotalValidInvalidNotes - completed candList
        2018-11-152195516096 (73.31%)5859 (26.69%)2.MNSMatcherParAcr, 2017
        2019-01-032276316687 (73.31%)6076 (26.69%)2.MNSMatcherParAcr, 2018
        2019-07-192485618915 (76.10%)5941 (23.90%)1.LexiconAbbAcrExpansion, 2020
        2019-08-022567519608 (76.37%)6067 (23.63%)3.DMNSMatcherCuiEndWord, 2018
        2019-10-162675620429 (76.35%)6327 (23.65%)2.MNSMatcherParAcr, 2019
        2020-06-122967423041 (77.65%)6633 (22.35%)3.DMNSMatcherCuiEndWord, 2019
        2020-07-172983223192 (77.74%)6640 (22.26%)1.LexiconAbbAcrExpansion, 2021
        2020-08-183089223999 (77.69%)6893 (22.32%)2.MNSMatcherParAcr, 2020
        2021-03-013373726512 (78.58%)7225 (21.42%)3.DMNSMatcherCuiEndWord, 2020
        2021-07-133383126571 (78.54%)7260 (21.46%)1.LexiconAbbAcrExpansion, 2022
        2022-01-103412826868 (78.73%)7260 (21.27%)8.WordNetCand.ZD.cui.2021
        2022-01-103475427466 (79.03%)7288 (20.97%)8.WordNetCand.ZD.noCui.2021
        2022-07-063475627471 (79.04%)7285 (20.96%)1.LexiconAbbAcrExpansion, 2023
        2022-09-273664928865 (78.76%)7784 (21.24%)8.WordNetCand.AP.2021