SPECIALIST Lexicon

Year	Acronym Expansions	Abbreviation Expansions
Total	Valid	Invalid	Total	Valid	Invalid
2015	908	881 (97.03%)	27 (2.97%)	62	40 (64.52%)	22 (35.48%)
2016	59	59 (100.00%)	0 (0.00%)	183	180 (98.36%)	3 (1.64%)
2017	39	39 (100.00%)	0 (0.00%)	22	19 (86.36%)	3 (13.64%)
2018	17	16 (94.12%)	1 (5.88)	28	26 (92.86%)	2 (7.14%)
2019	151	142 (94.04%)	9 (5.96%)	13	12 (92.31%)	1 (7.69%)

Year	Total	Valid	Invalid
2020	148	112 (75.68%)	36 (24.32%)
2021	158	129 (81.65%)	29 (18.35%)
2022	94	53 (56.38%)	41 (43.62%)
2023	2	2 (100.00%)	0 (0.00%)

Accu.	Total: 1808	Valid: 1636 (90.49%)	Invalid: 172 (9.51%)

Year	Total	Valid	Invalid	Notes
2015	4994	3679(73.67%)	1315 (26.33%)
2016	360	200 (55.56%)	160 (44.44%)
2017	1855	1316 (70.94%)	539 (29.06%)	Completed: 2018-11-15
2018	808	604 (74.75%)	204 (25.25%)	AUTO_N is not included (see details below) Completed: 2019-01-03
2019	1081	663 (61.33%)	418 (38.67%)	AUTO_N is not included (see details below) Completed: 2019-10-16
2020	1061	786 (74.08%)	275 (25.92%)	AUTO_N is not included (see details below) Completed: 2020-08-18
2021	1262	XXX (XX.XX%)	XXX (XX.XX%)	AUTO_N is not included (see details below) Processing, Completed: 20XX-XX-XX

Accu.	9816	7056 (71.88%)	2760 (28.12%)

Year	Total	Valid	Invalid	Notes
2018	557	38 (6.82%)	519 (93.18%)	6.82% became valid
2019	2533	231 (9.12%)	2302 (90.88%)	9.12% became valid
2020	2771	53 (1.91%)	2718 (98.09%)	1.91% became valid Very consistent (small percentage).

Year	Total	Valid	Invalid	Notes
2016	6370	5725 (89.87%)	645 (10.13%)	top 33 endwords
2017	1945	1764 (90.69%)	181 (9.31%)	Top 43 endwords AUTO_N is not included (detailed below) Completed: 2019-05-20
2018	819	703 (85.84%)	116 (14.16%)	Top 51 endwords AUTO_N is not included (detailed below) Completed: 2019-08-02
2019	2918	2588 (88.69%)	330 (11.31%)	Top 57 endwords AUTO_N is not included (detailed below) Completed: 2020-06-12
2020	2846	2489 (87.46%)	357 (12.54%)	Top 80 endwords AUTO_N is not included (detailed below) Completed: 2021-03-01
2021	1550	TBD (87.46%)	TBD (12.54%)	Top 85 endwords AUTO_N is not included (detailed below) Completed: TBD

Accu.	14898	13269 (89.07%)	1629 (10.93%)

Year	Total	Valid	Invalid	Notes
2017	1034	393 (38.01%)	641 (61.99%)	38.01% become valid Main reason is some candidates were not tagged
2018	953	133 (13.96%)	820 (86.04%)	13.96% become valid Clean up
2019	984	50 (5.08%)	934 (94.92%)	5.08% become valid Small percentage is consistent!
2020	1291	24 (1.86%)	1267 (98.14%)	1.86% become valid Small percentage is consistent!

Year	Word Count	Total	Valid	Invalid	Accu. P
2015	1000000	3368	2397 (71.17%)	971 (28.83%)	71.17%
100000	2218	1520 (68.53%)	698 (31.47%)	70.12%
10000	895	605 (67.60%)	290 (32.40%)	69.77%
1000	588	249 (42.35%)	339 (57.65%)	67.49%
100	538	119 (22.12%)	419 (77.88%)	64.28%

Accu.	Accu.	7607	4890 (64.28%)	2712 (35.72%)	64.28%

Models	Total	Valid	Invalid	Notes
zeroD, CUI	322	322 (100.00%)	0 (0.00%)	WordNetCand.ZD.cui.2021
zeroD, no CUI	626	601 (96.01%)	25 (3.99%)	WordNetCand.ZD.noCui.2021
aPairs	1912	1412 (73.85%)	500 (26.15%)	WordNetCand.AP.2021

Accu.	2858	2333 (81.63%)	525 (18.37%)

Date	Total	Valid	Invalid	Notes - completed candList
2018-11-15	21955	16096 (73.31%)	5859 (26.69%)	2.MNSMatcherParAcr, 2017
2019-01-03	22763	16687 (73.31%)	6076 (26.69%)	2.MNSMatcherParAcr, 2018
2019-07-19	24856	18915 (76.10%)	5941 (23.90%)	1.LexiconAbbAcrExpansion, 2020
2019-08-02	25675	19608 (76.37%)	6067 (23.63%)	3.DMNSMatcherCuiEndWord, 2018
2019-10-16	26756	20429 (76.35%)	6327 (23.65%)	2.MNSMatcherParAcr, 2019
2020-06-12	29674	23041 (77.65%)	6633 (22.35%)	3.DMNSMatcherCuiEndWord, 2019
2020-07-17	29832	23192 (77.74%)	6640 (22.26%)	1.LexiconAbbAcrExpansion, 2021
2020-08-18	30892	23999 (77.69%)	6893 (22.32%)	2.MNSMatcherParAcr, 2020
2021-03-01	33737	26512 (78.58%)	7225 (21.42%)	3.DMNSMatcherCuiEndWord, 2020
2021-07-13	33831	26571 (78.54%)	7260 (21.46%)	1.LexiconAbbAcrExpansion, 2022
2022-01-10	34128	26868 (78.73%)	7260 (21.27%)	8.WordNetCand.ZD.cui.2021
2022-01-10	34754	27466 (79.03%)	7288 (20.97%)	8.WordNetCand.ZD.noCui.2021
2022-07-06	34756	27471 (79.04%)	7285 (20.96%)	1.LexiconAbbAcrExpansion, 2023
2022-09-27	36649	28865 (78.76%)	7784 (21.24%)	8.WordNetCand.AP.2021

Previous Candidate Lists

This page describes the analysis and aggregation on all previous Lexicon candidate lists. These lists include valid and invalid candidates from various models as described bellows. The numbers are based on real-time data. In other words, this program needs to be re-run to get the latest number when:

Lexicon is updated
A candidate list is completed
Not base/LMW files in LexCheck is updated.

Please notes that the numbers shown below are a snapshot on the tag completion of the latest candidate list.

The stats is based on the following implmentation

When a candidate list is completed, theoretically:
- All valid words are in the Lexicon
- Candidates are not in the Lexicon are invalid words
- Use the latest Lexicon to auto-tag valid and invalid words from a candidate list to decide the precision
- Use those invalid word to update invalid base/LMW files.
- Accordingly, the stats can be re-run, the results should be (almost) identical unless a valid word become an invalid word or vise versa.

Report

I. Program: ${MULTIWORDS}/bin/00.CandidateList
1
2
3
4

Algorithm:

Combined all previous candidate list
Use the latest Lexicon (inflVars.data from LexBuild) to auto-tag valid/invalid LMWs.
Calculate the stats (precision) for each candidate list, each model and over all.

II. Data directory: ${MULTIWORDS}/data/Candidate/

III. In files:

./0.LexiconInflVars/inflVars.data.current
=> Link to the latest InflVars.data from LexBuild daily backup
./1.LexiconAbbAcrExpansion/
./2.MNSMatcherParAcr/
./3.DMNSMatcherCuiEndWord/
./4.DMNSMatcherSpVarWc/
=> Use all completed candidate lists.

IV. Out Files:
- prevCand.lmw.tag
- prevCand.lmw.yes
- prevCand.lmw.no
- prevCand.lmw.rpt
  => The result table shown below is based on this report, results might be slightly different over the time due to the updates on Lexicon
- 1.LexiconAbbAcrExpansion
  - candidates derived from the expansion of abbreviations/acronyms in Lexicon release
  - includes both valid and invalid words
  - After 2020+, this candidate list is generated in the preprocess (${MULTIWORD}/12.LexAbbAcrCand/)
  Year Acronym Expansions Abbreviation Expansions
  Total Valid Invalid Total Valid Invalid
  2015 908 881 (97.03%) 27 (2.97%) 62 40 (64.52%) 22 (35.48%)
  2016 59 59 (100.00%) 0 (0.00%) 183 180 (98.36%) 3 (1.64%)
  2017 39 39 (100.00%) 0 (0.00%) 22 19 (86.36%) 3 (13.64%)
  2018 17 16 (94.12%) 1 (5.88) 28 26 (92.86%) 2 (7.14%)
  2019 151 142 (94.04%) 9 (5.96%) 13 12 (92.31%) 1 (7.69%)
  
  Year Total Valid Invalid
  2020 148 112 (75.68%) 36 (24.32%)
  2021 158 129 (81.65%) 29 (18.35%)
  2022 94 53 (56.38%) 41 (43.62%)
  2023 2 2 (100.00%) 0 (0.00%)
  
  Accu. Total: 1808 Valid: 1636 (90.49%) Invalid: 172 (9.51%)
  
  * Some of the terms might be duplicated among years
- 2.MNSMatcherParAcr
  - candidates derived from the (ACR) matcher in MNS (07.MatcherParAcr)
  - includes both valid and invalid words
  - acronymExp.tag.data.tag.final.tbd.${YEAR}
    => CandList/acronymExp.tag.data.tag.final.tbd.${YEAR}.used.rmYesNo candidate only, not include AUTO_N
    => CandList/acronymExp.tag.data.tag.final.tbd.${YEAR}.used.rmYesTagNo include AUTO_N (sent to Linguists)
    Year Total Valid Invalid Notes
    2015 4994 3679(73.67%) 1315 (26.33%)
    2016 360 200 (55.56%) 160 (44.44%)
    2017 1855 1316 (70.94%) 539 (29.06%)
    Completed: 2018-11-15
    
    2018 808 604 (74.75%) 204 (25.25%)
    AUTO_N is not included (see details below)
    Completed: 2019-01-03
    
    2019 1081 663 (61.33%) 418 (38.67%)
    AUTO_N is not included (see details below)
    Completed: 2019-10-16
    
    2020 1061 786 (74.08%) 275 (25.92%)
    AUTO_N is not included (see details below)
    Completed: 2020-08-18
    
    2021 1262 XXX (XX.XX%) XXX (XX.XX%)
    AUTO_N is not included (see details below)
    Processing, Completed: 20XX-XX-XX
    
    Accu. 9816 7056 (71.88%) 2760 (28.12%)
    * Some of the terms might be duplicated among years
  - acronymExp.tag.data.tag.final.tbd.${YEAR}.rmYesTagNo include AUTO_N
    => AUTO_N: monitor and calculate AUTO_N becomes valid LMWs. This featuer is to show the consistency of tagging.
    
    Year Total Valid Invalid Notes
    2018 557 38 (6.82%) 519 (93.18%) 6.82% became valid
    2019 2533 231 (9.12%) 2302 (90.88%) 9.12% became valid
    2020 2771 53 (1.91%) 2718 (98.09%) 1.91% became valid
    Very consistent (small percentage).
- 3.DMNSMatcherCuiEndWord
  - candidates derived from the CUI and Endword matchers in DMNS
  - includes both valid and invalid words
  - Use the precision from last file (> 80%) and number of current file (36....rmYesNo: ~1000) to decide number of top endWords
  - 36.disNGram.Core.endword.out.rmYesNo.gsp.${YEAR}
    Year Total Valid Invalid Notes
    2016 6370 5725 (89.87%) 645 (10.13%)
    top 33 endwords
    
    2017 1945 1764 (90.69%) 181 (9.31%)
    Top 43 endwords
    AUTO_N is not included (detailed below)
    Completed: 2019-05-20
    
    2018 819 703 (85.84%) 116 (14.16%)
    Top 51 endwords
    AUTO_N is not included (detailed below)
    Completed: 2019-08-02
    
    2019 2918 2588 (88.69%) 330 (11.31%)
    Top 57 endwords
    AUTO_N is not included (detailed below)
    Completed: 2020-06-12
    
    2020 2846 2489 (87.46%) 357 (12.54%)
    Top 80 endwords
    AUTO_N is not included (detailed below)
    Completed: 2021-03-01
    
    2021 1550 TBD (87.46%) TBD (12.54%)
    Top 85 endwords
    AUTO_N is not included (detailed below)
    Completed: TBD
    
    Accu. 14898 13269 (89.07%) 1629 (10.93%)
    * Some of the terms might be duplicated among years
  - 36.disNGram.Core.endword.out.rmYesTagNo.gsp.${YEAR}
    => AUTO_N: monitor and calculate AUTO_N becomes valid LMWs
    
    Year Total Valid Invalid Notes
    2017 1034 393 (38.01%) 641 (61.99%) 38.01% become valid
    Main reason is some candidates were not tagged
    2018 953 133 (13.96%) 820 (86.04%) 13.96% become valid
    Clean up
    2019 984 50 (5.08%) 934 (94.92%) 5.08% become valid
    Small percentage is consistent!
    2020 1291 24 (1.86%) 1267 (98.14%) 1.86% become valid
    Small percentage is consistent!
- 4.DMNSMatcherSpVarWc
  - candidates derived from the SpVar and Frequency matchers in DMNS
  - includes both valid and invalid words
    
    Year Word Count Total Valid Invalid Accu. P
    2015 1000000 3368 2397 (71.17%) 971 (28.83%) 71.17%
    100000 2218 1520 (68.53%) 698 (31.47%) 70.12%
    10000 895 605 (67.60%) 290 (32.40%) 69.77%
    1000 588 249 (42.35%) 339 (57.65%) 67.49%
    100 538 119 (22.12%) 419 (77.88%) 64.28%
    
    Accu. Accu. 7607 4890 (64.28%) 2712 (35.72%) 64.28%
    
    * This model is not performed due to the time consuming and limited resources
- 8.WordNet
  - candidates derived from the derivations, synonyms, antonyms in WordNet 3.0
  - includes both valid and invalid words
  - unique lowercase terms are used (input has both cases), so the total number is smaller than the actual no. of input terms
    
    Models Total Valid Invalid Notes
    zeroD, CUI 322 322 (100.00%) 0 (0.00%) WordNetCand.ZD.cui.2021
    zeroD, no CUI 626 601 (96.01%) 25 (3.99%) WordNetCand.ZD.noCui.2021
    aPairs 1912 1412 (73.85%) 500 (26.15%) WordNetCand.AP.2021
    
    Accu. 2858 2333 (81.63%) 525 (18.37%)
- Process to get Stats on Previous Tagged Candidate List:
  - Run 00.CandidateList
    - 1
      => Get invalid words from all previous candidate lists need to be in the Candidates directories:
      - 1.LexiconAbbAcrExpansion
      - 2.MNSMatcherParAcr
      - 3.DMNSMatcherCuiEndWord
      - 4.DMNSMatcherSpVarWc
      - 8.WordNet
      => unique lowercase term from prevCand is used.
      => generate prevCand.rpt (for the stats)..
      => generate prevCand.*
    - 2
      => update the link ./5.LexCheckNotBaseForm/notBaseForm.data.current to the latest one in LexCheck
      => update the link ./6.LexCheckNotLmw/notLmw.data.current to the latest one in LexCheck
      => get invalid words from notBase and notLwm
      => generate notBaseLmw.*
    - 3
      => combine invalid words from above 2 steps
      => generate totalTerms.1_2.*
    - 4
      => Copy results (reports) to ./DataLog/$YEAR}/${YEAR}_${MONTH}_${DAY}
    - 5, only used for non-routine word Cand
      - prevCand.lmw.rpt (used for updating the stats.)
      - notBaseLmw.lmw.rpta (used for referenced)
  - candidates from all above sources that are completed tagging.
  - out files: prevCand.lmw.rpt
    
    Date Total Valid Invalid Notes - completed candList
    2018-11-15 21955 16096 (73.31%) 5859 (26.69%) 2.MNSMatcherParAcr, 2017
    2019-01-03 22763 16687 (73.31%) 6076 (26.69%) 2.MNSMatcherParAcr, 2018
    2019-07-19 24856 18915 (76.10%) 5941 (23.90%) 1.LexiconAbbAcrExpansion, 2020
    2019-08-02 25675 19608 (76.37%) 6067 (23.63%) 3.DMNSMatcherCuiEndWord, 2018
    2019-10-16 26756 20429 (76.35%) 6327 (23.65%) 2.MNSMatcherParAcr, 2019
    2020-06-12 29674 23041 (77.65%) 6633 (22.35%) 3.DMNSMatcherCuiEndWord, 2019
    2020-07-17 29832 23192 (77.74%) 6640 (22.26%) 1.LexiconAbbAcrExpansion, 2021
    2020-08-18 30892 23999 (77.69%) 6893 (22.32%) 2.MNSMatcherParAcr, 2020
    2021-03-01 33737 26512 (78.58%) 7225 (21.42%) 3.DMNSMatcherCuiEndWord, 2020
    2021-07-13 33831 26571 (78.54%) 7260 (21.46%) 1.LexiconAbbAcrExpansion, 2022
    2022-01-10 34128 26868 (78.73%) 7260 (21.27%) 8.WordNetCand.ZD.cui.2021
    2022-01-10 34754 27466 (79.03%) 7288 (20.97%) 8.WordNetCand.ZD.noCui.2021
    2022-07-06 34756 27471 (79.04%) 7285 (20.96%) 1.LexiconAbbAcrExpansion, 2023
    2022-09-27 36649 28865 (78.76%) 7784 (21.24%) 8.WordNetCand.AP.2021

The SPECIALIST Lexicon