The SPECIALIST Lexicon

Validate and Fix LEXICON

  • Step 0-1:The input file: LEXICON
      shell> cp -p LEXICON ${LEXICON}/data/${YEAR}/data/LEXICON.mmddyy
    • Make a symbolic link in the development machine (lexdev)
      shell> cd ${LEXICON}/data/${YEAR}/data
      shell> ln -sf ./LEXICON.mmddyy LEXICON.freeze

  • Step 0-2:Trim extra space
    • If extra space found, trim extra space in LexBuild, and go back to the previous step
      shell> fgrep "  " LEXICON.freeze | wc -l
      => should be 0, all extra space is taken care of in LexBuild automatically
      If not, need to have data in LexBuild fixed as well

  • Step 1: Remove class_type, annotations & signatures
    • shell> ${LEXICON}/bin/1.FinalizeLexicon <year>
      make sure the java version in the script is correct!
      • Input: LEXICON.freeze
      • Operations:
        • Remove annotations & signatures from freeze version to generate LEXICON.freeze.removeAnnotation
        • Check and fix incompliant non-ASCII characters between HTML and Unicode (U+0080 ~ U+009F), and sent the output to LEXICON.release:
        • Correct the illegal non-Ascii characters in LEXICON.release
        • Remove class_type, this operation will be removed after the task of class_type tagging is completed.
      • Output:
        • LEXICON.freeze.removeAnnotation
        • LEXICON.freeze.removeAnnotation.nonAscii
        • LEXICON.freeze.removeAnnotation.nonAscii.Stat
        • LEXICON.release.1.NoAnnotationNoIllegalNonAscii
        • LEXICON.release.nonAscii
        • LEXICON.release.nonAscii.Stat

        • LEXICON.release (this is the file name used in the process after this step)
          => This file is the same as ./LEXICON.release.1.NoAnnotationNoIllegalNonAscii
        • mv ./LEXICON.release LEXICON.release.log.1.noAnno
        • Link LEXICON.release to LEXICON.release.log.1.noAnno for the next step
        • ln -sf ./LEXICON.release.log.1.noAnno LEXICON.release

        No need to go through detail of output at this point.
  • Step 2-1:Validate EUI, syntax, content, cross-reference, and illegal non-ASCII characters:
    • Check errors for syntax, content, and cross-ref, etc.
    • Send the list to LexBuilders to fix lexRecords through LexBuild
    • Fix step-by-step and rerun the program using new Lexicon.release (link) until no erros found (except for exceptions).
    • This step take about 1 week to complete (between fixes in LB, LEXICON.release, and rerun the program)
    • shell> ${LEXICON}/bin/2.ValidateLexicon <year> > log.2
      Go through the log.2 file to ensure the following steps
      1. Check EUI
        => Make sure the total number of EUIs is correct
        shel>fgrep "entry=" LEXICON.release |wc -l
        => Make sure the no EUI is E0000000

      2. Check Syntax
        LexCheck.ValidateSyntaxFromTextFile
        => Update the ${LEX_CHECK}/data/Files/preposition.data.${YEAR}
        shell> cd ${DEV_DIR}/LC/Proc/bin
        shell> GetFilesFromLexicon
        2
        3
        12
        13

        => Use LEXICON.release.log.1.noAnno as to get link:
        • preposition.data.${YEAR}, might just use the previous year since it covers more prepostions.
        • particles.data.${YEAR}

        shell> cp -rp ${LEX_CHECK}/data/Files/dupRecExceptions.data.${PRE_YEAR} dupRecExceptions.data.${YEAR}
        shell> cp -rp ${LEX_CHECK}/data/Files/irregExceptions.data.${PRE_YEAR} irregExceptions.data.${YEAR}
        shell> cp -rp ${LEX_CHECK}/data/Files/notBaseForm.data.${PRE_YEAR} notBaseForm.data.${YEAR}
        => If errors found, fix LEXICON.release and rerun the script
        => the fix might be add prepositons or particles to the updated file. Consulting with linguist in such case. if add new prepositions, they need to be added to the Lexicon (prep) as well.
        => Make sure "No error found"
        => The final fixed copy is saved as LEXICON.release.2.2.syntaxFix

      3. Check Contents,
        LexCheck.ValidateContentFromTextFile
        => update the ${LEX_CHECK}/data/Files/preposition.data.${YEAR}
        => update the ${LEX_CHECK}/data/Files/irregException.data.${YEAR}
        => If errors found, fix LEXICON.release and rerun the script
        => Make sure "Total error: 0"
        => The final fixed copy is saved as LEXICON.release.2.fixContent

        => Use ./LEXICON.release.2.fixContent for the next steps (if it is different from the input)
        ln -sf ./LEXICON.release.log.2.3.contentFix LEXICON.release

      4. Check Cross-Ref,
        LexCheck.LexCrossCheck
        => update the ${LEX_CHECK}/data/Files/preposition.data.${YEAR}
        => update the ${LEX_CHECK}/data/Files/dupRecException.data.${YEAR}
        => If errors found, fix LEXICON.release, then link, and rerun the script
        => Fix errors in the same order as the reports
        => Errors are shown as Content Err in the log.2 file.
        => Go to the end of log.2 file to see the final stats.
        => This step is very time comsumming. It take about 1-2 weeks if everything goes smooth!
        1. dup EUI: must fixed (manually)
        2. dup LexRecord: partially fixed manually and update dupRecException.data
          • Send ${OUT_FILE}.fixCrossCheck.dupRec to linguists to tag "N|C":
            • N: not duplicate and no change
              => add all "N" to dupRecExceptions.data.${YEAR}
              => Manually remove [N] tag
            • C: change (delete or merge duplicate records)
              => records with tag of "C" need to be corrected in LB and will be updated in the next release or this release.
            • Re-run the program with updated dupRecExceptions.data.${YEAR} until the following 2 number are the same:
              • the (number of) ${OUT_FILE}.dupRec (can be found in stats at the end of Step 3 section in log.2 file)
              • ${OUT_FILE}.dupRec.cTag (where only contains C tag)

              are the same. All the N tags (exception, not duplicated) are eliminated by dupRecExceptions.data.${YEAR} and C tags will be updated in the next release.
              After 2020+, we choose to manually fix the Lexicon for [C] tag and save the fixed Lexicon as LEXICON.release.log.2.4.02.crDupRecFix. Then link to LEXICON.release and rerun the program until no CR-dup-record is found.

              YearDupRecNCNotes
              20141376968Only multiword (137/1184) are tagged due to limited resource and due date. The rest (abbreviations or acronyms) are updated in the next release.
              201511831042141Changes are updated in LB and fixed for next release
              201667625Changes are updated in LB and fixed for next release
              201769636Changes are updated in LB and fixed for next release
              201855487Changes are updated in LB and fixed for next release
              20191165Changes are updated in LB and fixed for next release
              2020303Changes are updated in LB and this release
              2021303Changes are updated in LB and this release
              20221239Changes are updated in LB and this release
              2023321Changes are updated in LB and this release
              2024220Changes are updated in LB and this release
        3. no EUI:
          This can be fixed at wrong citation (spVar) and wrong citation (spVar):, duplicated.
          • Explanation:
            • These are cross-ref terms used for abbreviation, acronym, and nominalization that program can not find the associated citation and EUI by cross-ref check.
            • Cross-ref terms must be citations (not spVars).
            • Ideally, citations (legit LMWs) should have an associated lexRecord for it.
            • Cross-ref terms can be invalid LMWs (not happen often) as for the expansion of abbreviations or acronyms
            • Cross-ref terms must be valid LMWs for nominalization.
            • Ignore the suggestions from computer report at the end of each line ("=> remove EUI")
          • auto-fix for current release: by removing EUI (for those EUI does not exist)
            => use LEXICON.release.3.fixCrossCheck as LEXICON.release (link) and rerun
            This is not a great solution!
            => For 2020+ release, linguists manual fixed Lexicon in LexBuild, use the daily generate Lexicon as Lexicon.freeze to re-run above steps. The final run has 0 cases of no_EUI

          • Manual fix for the current (future) release in LB by linguists:
            shell>fgrep " no EUI (" log.2 > 2.4.03.noEui
            send 2.3.noEui to linguists for following actions:
          • Actions:
            Check the cross-ref terms
            Check if the associated EUI or citation exist in Lexicon:
            • Case-1: if it is misspelled: correct it
              Make sure you correct it to the citation form, not the spVar.
            • Case-2: if it is correctly spelled, and it is a valid LMW
              =>If it has an associated record as spVar. correct the cross-ref term to the citation (most cases).
              =>If it has an associated record, but not in the record, add to the record as spVar. Also, correct the cross-ref term to the citation.
              =>If it has no associated record/EUI found, add a new record of this citation
            • Case-3: if it is correctly spelled, and it is not a legit citation (LMW), and update the cr terms with new created recrod EUI.
              => Please let Chris know
            • => Add to notBaseForm.data.${YEAR} (this happen, but not often).
          • Synchronization:
            These issues are temparately auto-fixed by removing EUI for the current release. However, the data are pernamently fixed in LexBuild and expect no same issues in the future releases. After 2020+ release, this issues are fixed in the LB and the fixed Lexicon from the LB are used to rerun this validation.
            These issues are partcially taken care of in the pre-process.
          • Log:
            Yearno EUI No. notBaseForm No.
            2017224
            201842
            2019630
            2020610
            2021340
            2022180
            202300
            202400
        4. wrong citation (spVar):
          • After 2020+ release, this issue is fixed in Step-3, using the fixed lexicon from the LexBuild.

          • Before 2019-, auto-fix for current release by replacing correct citation
            => use LEXICON.release.3.fixCrossCheck as LEXICON.release (link) and rerun
          • Manual fix for future release by linguists:
            shell>fgrep " wrong citation (spVar) (" log.2 |fgrep -v " wrong citation (spVar), duplicates (" > 2.4.04.wrongCitSpVar
            send 2.4.04wrongCitSpVar to linguists for following actions:
            • These are citations in the abb, acr, nom are spVar (not cit), they are auto-fixed by the program
            • replace spVar by corrct citation
          • Synchronization:
            These issues are auto-fixed by replacing spVar by correct citation for the current release. However, the data are pernamently fixed in LexBuild and expect no same issue in future releases.
            This issue is part of 4.03-NO_EUI. It is fixed in 2020 due to the re-run on 4.03-No_EUI.
          • Init Log:
            Yearwrong citation (spVar) No.
            201771
            20180
            201959
            20200
            20211
            20220
            20230
            20240
        5. wrong citation (spVar), duplictes:
          • auto-fix for current release by removing the spVar attribute
            => use LEXICON.release.3.fixCrossCheck as LEXICON.release (link) and rerun
          • Manual fix for future release by linguists:
            shell>fgrep " wrong citation (spVar), duplicates (" log.2 > 2.5.wrongCitSpVarDup
            send 2.4.05.wrongCitSpVarDup to linguists for following actions:
            • These are citations in the nom are spVar (not cit), after replaced by the correct citation, they becomes duplicates and thus remove (auto-fixed) by the program
            • remove the nom with spVar
          • Synchronization:
            These issues are auto-fixed by removing nom attribute with spVar for the current release. However, the data are pernamently fixed in LexBuild and expect no same issue in future releases.
          • Init Log:
            Yearwrong citation (spVar), duplictes No.
            201712
            20180
            20192
            20201
            20216
            20222
            20239
            202420

          Steps 3, 4, 5 are auto-fixed at the same time when run the validataion program. So, use the LEXICON.release.3.fixCrossCheck as LEXICON.release (link) and rerun
          shell> cp -p ./LEXICON.release.3.fixCrossCheck Lexicon.release.3.fixCrossCheck.2.5.cit
          shell> ln -sf ./LEXICON.release.log.${No}.fixCrossRed Lexicon.release

          rerun 2.ValidateLexicon ${YEAR} > log.2
          Please make sure check everything to make sure everything is OK because the auto-fix in different steps might cause new issuess. Such as add EUI and causes duplicates. Rerun this until no error found!

        6. missing EUI: auto-fix
          shell>fgrep "missing EUI (" log.2 > 2.6.missingEui
          Sent to linguists to fix (add EUI in as suggested)


          => use LEXICON.release.3.fixCrossCheck and rerun
          shell> cp -r LEXICON.release.3.fixCrossCheck Lexicon.release.log.${no}.missEuiFix
          shell> ln -sf ./LEXICON.release.log.${no}.missEuiFix Lexicon.release
          Save LEXICON.release.3.fixCrossCheck as LEXICON.release.log.${No}.misEuiFix (link to Lexicon.release) and rerun this step

        7. wrong EUI: must fixed manually
          • wrong EUI:shell> fgrep "wrong EUI" log.2 > 2.4.7.wrongEui.nom
          • Sent list to linguists to:
            • Confirm the correct the EUI
            • Fix lexRecords in the LexBuild

            shell> cp -p LEXICON.release.3.fixCrossCheck Lexicon.release.log.${No}.wrongEuiFix
            shell> ln -sf ./LEXICON.release.log.${No}.wrongEuiFix Lexicon.release
          • Save LEXICON.release.3.fixCrossCheck as LEXICON.release.log.${No}.wrongEuiFix (link to Lexicon.release) and rerun this step
        8. missing EUIs: must fixed manually
        9. wrong EUIs: must fixed manually
        10. symmetric citation: must fixed manually
          • Delete corss-ref is not citation (spVars).
        11. symmetric catogory: must fixed manually
        12. symmetric none: must fixed manually
          • This feature checks the symmetric issue in nominalization
          • All nominalizations should be symmetric. That is nominalization and nominalization_of.
          • nom:shell> fgrep " symmetric none @ [" log.2 > 2.12.symNone
          • Sent list to linguists:
            • Fix lexRecords in the LexBuild:
              => if the normalization is correct, add nominalizations
              => if the normalization is not correct, delete nominalizations
              => if the fixes is more than adding or deleting nominalizations (complicate fix involves changes/add in other LexRecords), notify Chris and tell him the details of fixes.
          • Save LEXICON.release.3.fixCrossCheck as LEXICON.release.log.${No}.symNoneFix
          • Manually fix Lexicon.release.log.${No}.nonSymFix by synchronizing those fixed records in LB
          • Link Lexicon.release.log.${No}.nonSymFix to LEXICON.release
          • re-run the program until:
            • The number of log.2 for "12. symmetric none:" is 0
            • the input (LEXICON.release) and fixed output (LEXICON.release.3.fixCrossCheck) are the same

        13. new EUI:
          • shell> fgrep " new EUI (" log.2 > 2.4.13.fixCrossRef-newEui
            => the line count should be the same as error count in log.2
          • nom:shell> fgrep "nominalizations - new EUI (" log.2 > 2.13.newEui.nom
            • This file includes all issues with nominalization: new EUI and non-symmetrical (2.13)
            • Send to linguists to fix in LB and then fix manually by comparing to LB (similar step as in 2.13).

          • acr:shell> fgrep "acronyms - new EUI (" log.2 > 2.13.newEui.acr
          • abb:shell> fgrep "abbreviations - new EUI (" log.2 > 2.13.newEui.abb
            • These two files are used as LMW candidate list to add multiwords to Lexicon
            • The expansions of acr/abb are good candidates for LexMultiwords
            • Those not-base-form terms from previous releases are stored in ${LEX_CHECK}/data/Files/notBaseForm.data.
            • This file is used to exclude FP err-msg.
            • This file is updated between releases as described follows:
            • The updates must be completed in LexCheck pre-process before running the next release.
            • Ideally, all terms in these two files are:
              • valid LW (will be added to Lexicon by next release)
              • invalid LW (will be add to notBaseForm)

              So, all errors should be disappear once these post-procedures are done.

              Post-Procedures:

              • send the list to linguist to tag [Y]|[I]|[N]
              • [Y]: a valid citation or base form
                => A new lexRecord should be added in the future Lexicon
              • [I]: a valid inflectional form
                => A new lexRecord should be added
                => The associated lexRecord might need to change from inflectional form to citation form
              • [N]: Other than above two tags, not a valid Lexicon word form for citations, spelling variants, or an inflectional form (such as plural form, past tense, etc.)
                => This list (notBaseForm.data.${YEAR}) is used to exclude exceptions for future releases (we are assuming an invalid base form won't become a valid base form over the time).
              • During this process, LexBuilder might need to delete invalid expansions, modify records, add new records. However, we don't need this detail infromation for the program.

                (This is the post-process that need to be done for current release, before the next release)

              • Make sure "Total error: 0" or equals to sum of (2. dupLexRecord + 13.newEUI)
                • The error no of 13.newEui should be 0 if expansions of ACR and ABB are pre-processed before freezing the Lexicon. If not, ifti should be equals to the no of tags of [Y] and [I].
                • Tag of [N] are handled in the notBaseForm.data.${Year}

              Ideally, LEXICON.release should be identical to LEXICON.release.3.fixCrossCheck

          • Check non-ASCII characters
            • Check if new appear non-ASCII char is legal
              • Compare to the previous year on all nonAscii.char
                • The program compares files of LEXICON.release.NonAscii.line and LEXICON.release.NonAscii.char to the previous release and sent the difference to LEXICON.release.NonAscii.Char.1.3.diff.
                • Go through all new non-ASCII characters in LEXICON.release.NonAscii.Char.1.3.diff in Lexicon.release (and compare to last release), and manually check and modify if and list all instanace.
                • Send this instance list to linguist to verify and tag:
                  > non-ascii char|U+value
                  replacing ascii char
                  tag
                  • [k]: keep the instance with non-ASCII char
                  • [D]: delete the instance (with non-ASCII or replacing ASCII chars)
                  • [R]: replace with ASCII-char and keep the instance
              • Find illegal non-ASCII chars
                • Some non-ASCII Unicode characters looks the same as ASCII. However, they are different when read in by machine and cause issues downstream.
                • Go through LEXICON.release.NonAscii.char file to see if any illegal non-ASCII characters list in the following table exist (use U+value). If so, fix them.
                • For U+03BC and U+00B5: compare the count
                • Sent to linguists to fix in LB if illegal ASCII chars are found

                NameLetter 1Letter 2 (Illegal non-ASCII)Notes
                postrophe[']-(APOSTROPHE, U+0027)[‘]-(LEFT SINGLE QUOTATION MARK, U+2018)Replace illegal non-ASCII
                [’]-(RIGHT SINGLE QUOTATION MARK, U+2019)
                => accepted after 2021+ rlease
                hyphen[-]-(HYPHEN-MINUS, U+002D)[‑]-(NON-BREAKING HYPHEN, U+2011)Replace illegal non-ASCII
                => accepted after 2021+ release
                [–]-(EN DASH, U+2013)
                beta[β]-(GREEK SMALL LETTER BETA, U+03B2)[ß]-(LATIN SMALL LETTER SHARP S, U+00DF)Replace illegal non-ASCII
                mu/micro[μ]-(GREEK SMALL LETTER MU, U+03BC)[µ]-(MICRO SIGN, U+00B5)Both could be legal. Check the records to make sure the right chars are used.
                Y/EPSILON[Y]-(LATIN CAPITAL LETTER Y, U+0059)[Υ]-(GREEK CAPITAL LETTER UPSILON, U+03A5)Both could be legal. Check the records to make sure the right chars are used.

        => The final fixed copy is saved as LEXICON.release.log.2.5.nonAsciiFix
      5. Re-run the program with new Lexicon.release (link) until everything is OK:
        shell> ${LEXICON}/bin/2.ValidateLexicon <year> > log.2

  • Step 2-2:Check TradeMark:
    TradeMark
    • shell> ${LEXICON}/bin/2.ValidateLexicon <year> > log.2
    • Check the word count (wc) of output files (tradeMark.data)
      Should be 0 because there is no annotation.

  • Step 2-3:Check Irreg Base:
    • Skip - already checked in Check Content after 2014+ release.
    • Old version of Check Irreg

  • Step 2-4:Check cross-Ref:
    • Skip - already checked in the Step of Cross-Ref after 2014+ release.
    • Cross-Ref: A enhanced cross-reference check program was implemented after 2014 and thus it is removed from LexBuild (web tool). So, this step has to be checked. If issues found, fixed them in LexBuild and go back to Step 0.
    • Old version of Check cross-Ref

Completed: Clean up files and logs: move all logs and files to ./${year}