SPECIALIST Lexicon

Validate and Fix LEXICON

Step 0-1:The input file: LEXICON
- Make a symbolic link in the development machine (lexdev)
  shell> cd ${LEXICON}/data/${YEAR}/data
  shell> ln -sf ./LEXICON.mmddyy LEXICON.freeze
Step 0-2:Trim extra space
- If extra space found, trim extra space in LexBuild, and go back to the previous step shell> fgrep " " LEXICON.freeze | wc -l => should be 0, all extra space is taken care of in LexBuild automatically If not, need to have data in LexBuild fixed as well
Step 1: Remove class_type, annotations & signatures
- shell> ${LEXICON}/bin/1.FinalizeLexicon <year>
  make sure the java version in the script is correct!
  - Input: LEXICON.freeze
  - Operations:
    - Remove annotations & signatures from freeze version to generate LEXICON.freeze.removeAnnotation
    - Check and fix incompliant non-ASCII characters between HTML and Unicode (U+0080 ~ U+009F), and sent the output to LEXICON.release:
    - Correct the illegal non-Ascii characters in LEXICON.release
    - Remove class_type, this operation will be removed after the task of class_type tagging is completed.
  - Output:
    - LEXICON.freeze.removeAnnotation
    - LEXICON.freeze.removeAnnotation.nonAscii
    - LEXICON.freeze.removeAnnotation.nonAscii.Stat
    - LEXICON.release.1.NoAnnotationNoIllegalNonAscii
    - LEXICON.release.nonAscii
    - LEXICON.release.nonAscii.Stat
    - LEXICON.release (this is the file name used in the process after this step)
      => This file is the same as ./LEXICON.release.1.NoAnnotationNoIllegalNonAscii
    - mv ./LEXICON.release LEXICON.release.log.1.noAnno
    - Link LEXICON.release to LEXICON.release.log.1.noAnno for the next step
    - ln -sf ./LEXICON.release.log.1.noAnno LEXICON.release
    No need to go through detail of output at this point.

Step 2-1:Validate EUI, syntax, content, cross-reference, and illegal non-ASCII characters:

Check errors for syntax, content, and cross-ref, etc.
Send the list to LexBuilders to fix lexRecords through LexBuild
Fix step-by-step and rerun the program using new Lexicon.release (link) until no erros found (except for exceptions).
This step take about 1 week to complete (between fixes in LB, LEXICON.release, and rerun the program)

shell> ${LEXICON}/bin/2.ValidateLexicon <year> > log.2
Go through the log.2 file to ensure the following steps

Check EUI
=> Make sure the total number of EUIs is correct
shel>fgrep "entry=" LEXICON.release > Euis
=> Make sure the no EUI is E0000000
Check Syntax
LexCheck.ValidateSyntaxFromTextFile
=> Must update the ${LEX_CHECK}/data/Files/preposition.data.${YEAR}
shell> cd ${LEX_CHECK_PROC}/data/GetFiles
shell> cp -p ${LEXICON}/data/${YEAR}/data/LEXICON.release.log.1.noAnno LEXICON.release.log.1.noAnno.${YEAR} shell> cd ${LEX_CHECK_PROC}/bin shell> GetFilesFromLexicon 2 (prepositions) 3 (particles) 12 13
=> Use LEXICON.release.log.1.noAnno (link to LEXICON.${YEAR})
- preposition.data.${YEAR}, might just use the previous year since it covers more prepostions.
- particles.data.${YEAR}
shell> cp -p ${LEX_CHECK}/data/Files/dupRecExceptions.data.${PRE_YEAR} dupRecExceptions.data.${YEAR}
shell> cp -p ${LEX_CHECK}/data/Files/irregExceptions.data.${PRE_YEAR} irregExceptions.data.${YEAR}
shell> cp -p ${LEX_CHECK}/data/Files/notBaseForm.data.${PRE_YEAR} notBaseForm.data.${YEAR}
=> If errors found, fix LEXICON.release and rerun the script
=> the fix might be add
- prepositons
- particles
  - add near after 2022
to the updated file. Consulting with linguist in such case. if add new prepositions, they need to be added to the Lexicon (prep) as well.
=> Make sure "No error found"
=> The final fixed copy is saved as LEXICON.release.log.2.2.syntaxFix
Check Contents,
LexCheck.ValidateContentFromTextFile
=> update the ${LEX_CHECK}/data/Files/preposition.data.${YEAR}
=> update the ${LEX_CHECK}/data/Files/irregException.data.${YEAR}
=> If errors found, fix LEXICON.release and rerun the script
=> Make sure "Total error: 0"
=> The final fixed copy is saved as LEXICON.release.2.fixContent
=> Use ./LEXICON.release.2.fixContent for the next steps (if it is different from the input)
ln -sf ./LEXICON.release.log.2.3.contentFix LEXICON.release

Check Cross-Ref,
LexCheck.LexCrossCheck
=> update the ${LEX_CHECK}/data/Files/preposition.data.${YEAR}
=> update the ${LEX_CHECK}/data/Files/dupRecException.data.${YEAR}
=> If errors found, fix LEXICON.release, then link, and rerun the script
=> Fix errors in the same order as the reports
=> Errors are shown as Content Err in the log.2 file.
=> Go to the end of log.2 file to see the final stats.
=> This step is very time comsumming. It take about 1-2 weeks if everything goes smooth!

dup EUI: must fixed (manually)

dup LexRecord: partially fixed manually and update dupRecException.data

Send ${OUT_FILE}.fixCrossCheck.dupRec to linguists to tag "N|C":

N: not duplicate and no change
=> add all "N" to dupRecExceptions.data.${YEAR}
=> Manually remove [N] tag
C: change (delete or merge duplicate records)
=> records with tag of "C" need to be corrected in LB and will be updated in the next release or this release.

Re-run the program with updated dupRecExceptions.data.${YEAR} until the following 2 number are the same:

the (number of) ${OUT_FILE}.dupRec (can be found in stats at the end of Step 3 section in log.2 file)
${OUT_FILE}.dupRec.cTag (where only contains C tag)

are the same. All the N tags (exception, not duplicated) are eliminated by dupRecExceptions.data.${YEAR} and C tags will be updated in the next release.
After 2020+, we choose to manually fix the Lexicon for [C] tag and save the fixed Lexicon as LEXICON.release.log.2.4.02.crDupRecFix. Then link to LEXICON.release and rerun the program until no CR-dup-record is found.

Year	DupRec	N	C	Notes
2014	137	69	68	Only multiword (137/1184) are tagged due to limited resource and due date. The rest (abbreviations or acronyms) are updated in the next release.
2015	1183	1042	141	Changes are updated in LB and fixed for next release
2016	67	62	5	Changes are updated in LB and fixed for next release
2017	69	63	6	Changes are updated in LB and fixed for next release
2018	55	48	7	Changes are updated in LB and fixed for next release
2019	11	6	5	Changes are updated in LB and fixed for next release
2020	3	0	3	Changes are updated in LB and this release
2021	3	0	3	Changes are updated in LB and this release
2022	12	3	9	Changes are updated in LB and this release
2023	3	2	1	Changes are updated in LB and this release
2024	2	2	0	Changes are updated in LB and this release
2025	1	0	1	Changes are updated in LB and this release

no EUI:
This can be fixed at wrong citation (spVar) and wrong citation (spVar):, duplicated.
- Explanation:
  - These are cross-ref terms used for abbreviation, acronym, and nominalization that program can not find the associated citation and EUI by cross-ref check.
  - Cross-ref terms must be citations (not spVars).
  - Ideally, citations (legit LMWs) should have an associated lexRecord for it.
  - Cross-ref terms can be invalid LMWs (not happen often) as for the expansion of abbreviations or acronyms
  - Cross-ref terms must be valid LMWs for nominalization.
  - Ignore the suggestions from computer report at the end of each line ("=> remove EUI")
- auto-fix for current release: by removing EUI (for those EUI does not exist)
  => use LEXICON.release.3.fixCrossCheck as LEXICON.release (link) and rerun
  This is not a great solution!
  => For 2020+ release, linguists manual fixed Lexicon in LexBuild, use the daily generate Lexicon as Lexicon.freeze to re-run above steps. The final run has 0 cases of no_EUI
- Manual fix for the current (future) release in LB by linguists:
  shell>fgrep " no EUI (" log.2 > 2.4.03.noEui
  send 2.3.noEui to linguists for following actions:
- Actions:
  Check the cross-ref terms
  Check if the associated EUI or citation exist in Lexicon:
  - Case-1: if it is misspelled: correct it
    Make sure you correct it to the citation form, not the spVar.
  - Case-2: if it is correctly spelled, and it is a valid LMW
    =>If it has an associated record as spVar. correct the cross-ref term to the citation (most cases).
    =>If it has an associated record, but not in the record, add to the record as spVar. Also, correct the cross-ref term to the citation.
    =>If it has no associated record/EUI found, add a new record of this citation
  - Case-3: if it is correctly spelled, and it is not a legit citation (LMW), and update the cr terms with new created recrod EUI.
    => Please let Chris know
  - => Add to notBaseForm.data.${YEAR} (this happen, but not often).
- Synchronization:
  These issues are temparately auto-fixed by removing EUI for the current release. However, the data are pernamently fixed in LexBuild and expect no same issues in the future releases. After 2020+ release, this issues are fixed in the LB and the fixed Lexicon from the LB are used to rerun this validation.
  These issues are partcially taken care of in the pre-process.
- Log:
  
  Year no EUI No. notBaseForm No.
  2017 22 4
  2018 4 2
  2019 63 0
  2020 61 0
  2021 34 0
  2022 18 0
  2023 0 0
  2024 0 0
  2025 0 0
wrong citation (spVar):
- After 2020+ release, this issue is fixed in Step-3, using the fixed lexicon from the LexBuild.
- Before 2019-, auto-fix for current release by replacing correct citation
  => use LEXICON.release.3.fixCrossCheck as LEXICON.release (link) and rerun
- Manual fix for future release by linguists:
  shell>fgrep " wrong citation (spVar) (" log.2 |fgrep -v " wrong citation (spVar), duplicates (" > 2.4.04.wrongCitSpVar
  send 2.4.04wrongCitSpVar to linguists for following actions:
  - These are citations in the abb, acr, nom are spVar (not cit), they are auto-fixed by the program
  - replace spVar by corrct citation
- Synchronization:
  These issues are auto-fixed by replacing spVar by correct citation for the current release. However, the data are pernamently fixed in LexBuild and expect no same issue in future releases.
  This issue is part of 4.03-NO_EUI. It is fixed in 2020 due to the re-run on 4.03-No_EUI.
- Init Log:
  
  Year wrong citation (spVar) No.
  2017 71
  2018 0
  2019 59
  2020 0
  2021 1
  2022 0
  2023 0
  2024 0
  2025 0
wrong citation (spVar), duplictes:
- auto-fix for current release by removing the spVar attribute
  => use LEXICON.release.3.fixCrossCheck as LEXICON.release (link) and rerun
- Manual fix for future release by linguists:
  shell>fgrep " wrong citation (spVar), duplicates (" log.2 > 2.5.wrongCitSpVarDup
  send 2.4.05.wrongCitSpVarDup to linguists for following actions:
  - These are citations in the nom are spVar (not cit), after replaced by the correct citation, they becomes duplicates and thus remove (auto-fixed) by the program
  - remove the nom with spVar
- Synchronization:
  These issues are auto-fixed by removing nom attribute with spVar for the current release. However, the data are pernamently fixed in LexBuild and expect no same issue in future releases.
- Init Log:
  
  Year wrong citation (spVar), duplictes No.
  2017 12
  2018 0
  2019 2
  2020 1
  2021 6
  2022 2
  2023 9
  2024 20
  2025 11
Steps 3, 4, 5 are auto-fixed at the same time when run the validataion program. So, use the LEXICON.release.3.fixCrossCheck as LEXICON.release (link) and rerun
shell> cp -p ./LEXICON.release.3.fixCrossCheck Lexicon.release.3.fixCrossCheck.2.5.cit
shell> ln -sf ./LEXICON.release.log.${No}.fixCrossRed Lexicon.release
rerun 2.ValidateLexicon ${YEAR} > log.2
Please make sure check everything to make sure everything is OK because the auto-fix in different steps might cause new issuess. Such as add EUI and causes duplicates. Rerun this until no error found!
missing EUI: auto-fix
shell>fgrep "missing EUI (" log.2 > 2.6.missingEui
Sent to linguists to fix (add EUI in as suggested)

=> use LEXICON.release.3.fixCrossCheck and rerun
shell> cp -r LEXICON.release.3.fixCrossCheck Lexicon.release.log.${no}.missEuiFix
shell> ln -sf ./LEXICON.release.log.${no}.missEuiFix Lexicon.release
Save LEXICON.release.3.fixCrossCheck as LEXICON.release.log.${No}.misEuiFix (link to Lexicon.release) and rerun this step
wrong EUI: must fixed manually
- wrong EUI:shell> fgrep "wrong EUI" log.2 > 2.4.7.wrongEui.nom
- Sent list to linguists to:
  - Confirm the correct the EUI
  - Fix lexRecords in the LexBuild
  shell> cp -p LEXICON.release.3.fixCrossCheck Lexicon.release.log.${No}.wrongEuiFix
  shell> ln -sf ./LEXICON.release.log.${No}.wrongEuiFix Lexicon.release
- Save LEXICON.release.3.fixCrossCheck as LEXICON.release.log.${No}.wrongEuiFix (link to Lexicon.release) and rerun this step
missing EUIs: must fixed manually
wrong EUIs: must fixed manually
symmetric citation: must fixed manually
- Delete corss-ref is not citation (spVars).
symmetric catogory: must fixed manually
symmetric none: must fixed manually
- This feature checks the symmetric issue in nominalization
- All nominalizations should be symmetric. That is nominalization and nominalization_of.
- nom:shell> fgrep " symmetric none @ [" log.2 > 2.12.symNone
- Sent list to linguists:
  - Fix lexRecords in the LexBuild:
    => if the normalization is correct, add nominalizations
    => if the normalization is not correct, delete nominalizations
    => if the fixes is more than adding or deleting nominalizations (complicate fix involves changes/add in other LexRecords), notify Chris and tell him the details of fixes.
- Save LEXICON.release.3.fixCrossCheck as LEXICON.release.log.${No}.symNoneFix
- Manually fix Lexicon.release.log.${No}.nonSymFix by synchronizing those fixed records in LB
- Link Lexicon.release.log.${No}.nonSymFix to LEXICON.release
- re-run the program until:
  - The number of log.2 for "12. symmetric none:" is 0
  - the input (LEXICON.release) and fixed output (LEXICON.release.3.fixCrossCheck) are the same

Year	no EUI No.	notBaseForm No.
2017	22	4
2018	4	2
2019	63	0
2020	61	0
2021	34	0
2022	18	0
2023	0	0
2024	0	0
2025	0	0

Year	wrong citation (spVar) No.
2017	71
2018	0
2019	59
2020	0
2021	1
2022	0
2023	0
2024	0
2025	0

Year	wrong citation (spVar), duplictes No.
2017	12
2018	0
2019	2
2020	1
2021	6
2022	2
2023	9
2024	20
2025	11

new EUI:

shell> fgrep " new EUI (" log.2 > 2.4.13.fixCrossRef-newEui
=> the line count should be the same as error count in log.2
nom:shell> fgrep "nominalizations - new EUI (" log.2 > 2.13.newEui.nom
- This file includes all issues with nominalization: new EUI and non-symmetrical (2.13)
- Send to linguists to fix in LB and then fix manually by comparing to LB (similar step as in 2.13).
acr:shell> fgrep "acronyms - new EUI (" log.2 > 2.13.newEui.acr
abb:shell> fgrep "abbreviations - new EUI (" log.2 > 2.13.newEui.abb
- These two files are used as LMW candidate list to add multiwords to Lexicon
- The expansions of acr/abb are good candidates for LexMultiwords
- Those not-base-form terms from previous releases are stored in ${LEX_CHECK}/data/Files/notBaseForm.data.
- This file is used to exclude FP err-msg.
- This file is updated between releases as described follows:
- The updates must be completed in LexCheck pre-process before running the next release.
- Ideally, all terms in these two files are:
  - valid LW (will be added to Lexicon by next release)
  - invalid LW (will be add to notBaseForm)
  So, all errors should be disappear once these post-procedures are done.
  Post-Procedures:
  - send the list to linguist to tag [Y]|[I]|[N]
  - [Y]: a valid citation or base form
    => A new lexRecord should be added in the future Lexicon
  - [I]: a valid inflectional form
    => A new lexRecord should be added
    => The associated lexRecord might need to change from inflectional form to citation form
  - [N]: Other than above two tags, not a valid Lexicon word form for citations, spelling variants, or an inflectional form (such as plural form, past tense, etc.)
    => This list (notBaseForm.data.${YEAR}) is used to exclude exceptions for future releases (we are assuming an invalid base form won't become a valid base form over the time).
  - During this process, LexBuilder might need to delete invalid expansions, modify records, add new records. However, we don't need this detail infromation for the program.
    (This is the post-process that need to be done for current release, before the next release)
  - Make sure "Total error: 0" or equals to sum of (2. dupLexRecord + 13.newEUI)
    - The error no of 13.newEui should be 0 if expansions of ACR and ABB are pre-processed before freezing the Lexicon. If not, ifti should be equals to the no of tags of [Y] and [I].
    - Tag of [N] are handled in the notBaseForm.data.${Year}
  Ideally, LEXICON.release should be identical to LEXICON.release.3.fixCrossCheck

Check non-ASCII characters

Check if new appear non-ASCII char is legal

Compare to the previous year on all nonAscii.char
- The program compares files of LEXICON.release.NonAscii.line and LEXICON.release.NonAscii.char to the previous release and sent the difference to LEXICON.release.NonAscii.Char.1.3.diff.
- if file (LEXICON.release.NonAscii.Char.1.3.diff) is empty, then OK.
- if not empty, go through all new non-ASCII characters in LEXICON.release.NonAscii.Char.1.3.diff in Lexicon.release (and compare to last release), and manually check and modify if and list all instanace.
- Send this instance list to linguist to verify and tag: > non-ascii char|U+value replacing ascii char tag
  - [k]: keep the instance with non-ASCII char
  - [D]: delete the instance (with non-ASCII or replacing ASCII chars)
  - [R]: replace with ASCII-char and keep the instance

Find illegal non-ASCII chars

Some non-ASCII Unicode characters looks the same as ASCII. However, they are different when read in by machine and cause issues downstream.
Go through LEXICON.release.NonAscii.char file to see if any illegal non-ASCII characters list in the following table exist (use U+value). If so, fix them.
For U+03BC and U+00B5: compare the count
Sent to linguists to fix in LB if illegal ASCII chars are found

Name	Letter 1	Letter 2 (Illegal non-ASCII)	Notes
postrophe	[']-(APOSTROPHE, U+0027)	[‘]-(LEFT SINGLE QUOTATION MARK, U+2018)	Replace illegal non-ASCII
postrophe	[']-(APOSTROPHE, U+0027)	[’]-(RIGHT SINGLE QUOTATION MARK, U+2019) => accepted after 2021+ rlease	Replace illegal non-ASCII
hyphen	[-]-(HYPHEN-MINUS, U+002D)	[‑]-(NON-BREAKING HYPHEN, U+2011)	Replace illegal non-ASCII => accepted after 2021+ release
hyphen	[-]-(HYPHEN-MINUS, U+002D)	[–]-(EN DASH, U+2013)	Replace illegal non-ASCII => accepted after 2021+ release
beta	[β]-(GREEK SMALL LETTER BETA, U+03B2)	[ß]-(LATIN SMALL LETTER SHARP S, U+00DF)	Replace illegal non-ASCII
mu/micro	[μ]-(GREEK SMALL LETTER MU, U+03BC)	[µ]-(MICRO SIGN, U+00B5)	Both could be legal. Check the records to make sure the right chars are used.
Y/EPSILON	[Y]-(LATIN CAPITAL LETTER Y, U+0059)	[Υ]-(GREEK CAPITAL LETTER UPSILON, U+03A5)	Both could be legal. Check the records to make sure the right chars are used.

=> The final fixed copy is saved as LEXICON.release.log.2.5.nonAsciiFix

Re-run the program with new Lexicon.release (link) until everything is OK: shell> ${LEXICON}/bin/2.ValidateLexicon <year> > log.2

Step 2-2:Check TradeMark:
TradeMark
- shell> ${LEXICON}/bin/2.ValidateLexicon <year> > log.2
- Check the word count (wc) of output files (tradeMark.data)
  Should be 0 because there is no annotation.
Step 2-3:Check Irreg Base:
- Skip - already checked in Check Content after 2014+ release.
- Old version of Check Irreg
Step 2-4:Check cross-Ref:
- Skip - already checked in the Step of Cross-Ref after 2014+ release.
- Cross-Ref: A enhanced cross-reference check program was implemented after 2014 and thus it is removed from LexBuild (web tool). So, this step has to be checked. If issues found, fixed them in LexBuild and go back to Step 0.
- Old version of Check cross-Ref

Completed: Clean up files and logs: move all logs and files to ./${year}

The SPECIALIST Lexicon