The SPECIALIST Lexicon

Generate LEXICON in pure ASCII format

This step must be completed before generate LEXICON tables because the LEXICON.release might need to modified through this step.

I. Concept: Algorithm of Generating ASCII Lexicon

  1. Convert base and spelling varaints to ASCII and saved ASCII citations
    • converts non-ASCII citations and spellVars to pure ASCII
  2. Convert to ASCII: by go through Lexicon line by line
    • Convert all lines with non-ASCII characers to pure ASCII by using Lvg.toAscii APIs
    • Use validAsciiConversions.txt to ensure all ASCII conversion are valid
    • Delete all lines that have bases are not in Lexicon (no EUI or EUI not in Lexicon)
  3. Clean up ASCII LEXICON by removing duplications
    • Delete duplications due to above steps ASCII conversions

II. Pre-Process: Prepare data and files

  • link LEXICON
    • mkdir ${LEXICON_DIR}/data/${YEAR}/tables
    • cd ${LEXICON_DIR}/data/${YEAR}/tables
    • ln -sf ../data/LEXICON.release LEXICON
  • copy exceptions from the previous year
    • mkdir ${LEXICON_DIR}/data/${YEAR}/ascii
    • shell>cp -rp ${LEXICON}/data/${PRE_YEAR}/ascii/exceptions ${LEXICON}/data/${YEAR}/ascii/exceptions

    • validAsciiConversions.txt:
      • Include all known valid ASCII conversion over the years
      • This file is used in step-3 (of GenerateAsciiLexicon) in the logics described in section I.
      • In step-3, all lines are converted to pure ASCII.
      • Lines has non-ASCII characters can be converted to ASCII if they are belong to valid ASCII conversion.
      • These conversion is send to LEXICON.asciiLine (line by line conversion).
      • Conversions are removed (clean up in step 3 in the logics described above) if they are duplications of original ASCII lines, such as irreg.
      • Some conversions are kept, such as compl, and trademark.

      • All above automatic conversions must be verified (known) as valid conversion (for ASCII Lexicon).
      • If it is not a valid conversion, it is removed.
      • If it is a valid conversion, add them in the Lexicon. However, if it is valid, the Lexicon should have it already, so they should be deleted in step-3.
      • Sometimes, it is valid, but the conversion is not in Lexicon yet. In such case, they should be added in to Lexicon (see section IV. Review).
      • Theorectically, all valid ASCII conversion should be deleted in the cleanup step due to the duplications.

      • This list should be consulted with linguists and updated annually (see section IV. Review).

      • This feature is needed and the file needs to be updated annually.
      • Please note that no valid ASCII conversion for base=xxx because a record without pure ASCII citation is deleted in the first step.

    • invalidAsciiExceptions.txt:
      • Include all known invalid ASCII conversion over the years
      • This file is used in step-4: review ASCII reports program
      • This files needs to be updated annually (see section IV. Review).

III. Process: Generate ASCII Lexicon

  • shell> ${LEXICON}/bin/3.GenerateAsciiLexicon <year>
    ${LVG_YEAR}
    ${LC_YEAR}
  • output directory: ${LEXICON}/data/${YEAR}/ascii/logs/
  • output files:
    • LEXICON.asciiBase: processed ASCII file after step-1: base conversion
    • LEXICON.asciiLine: processed ASCII file after step-2: line conversion
    • LEXICON.ascii: final ASCII Lexicon file for release
    • Log files (./logs/):
      • summary.rpt: summary report of ASCII conversion
        • Manually check deleted lexRecords due to no ASCII base form by following steps. There are 4 known deletions in the past.
          • Run the next step: Review ASCII reports
          • Check ./reports/baseDeleteNotLex.rpt
            E0543077|base|delete|not-Lex|divorcé|divorce|N
            E0702889|base|delete|not-Lex|Pécs|Pecs|N
            E0710983|base|delete|not-Lex|GΩ|GOmega|N
            E0721571|base|delete|not-Lex|μB|muB|N
        • Sent to Linguist to tag (see below for tagging details)

        Log

      • LEXICON.asciiBaseLog: log file of step-1: conversion from LEXICON.relese to LEXICON.asciiBase
      • LEXICON.asciiLineLog: log file of step-2: conversion from LEXICON.asciiBase to LEXICON.asciiLine
      • LEXICON.asciiLog: log file of step-3: conversion from LEXICON.asciiLine to LEXICON.ascii.
      • validAsciiConversions.log: log for valid conversion. The file of "validAsciiConversions.txt" should include all contents in this file.
        • Compare the following 2 files:
          • LEXICON.asciiLog: conversion from LEXICON.asciiLine to LEXICON.ascii.
          • validAsciiConversions.log: log for valid conversion

          => The contents (line number, order, and content) should be the same becasue all conversion should be valid conversion.
        • If not, compare the difference, and then fix by the following:
          • Update validAsciiConversions.txt => so that all valid ASCII conversions are known (duplicated with original ASCII).
          • Update LEXICON.release. Such as remvoe duplicated acromyms. This is why we need to generate ASCII Lexicon before generate tables.

    • Run this step, the real validation is at the next step - review ASCII Reports

    IV. Review ASCII Reports
    shell> ${LEXICON}/bin/4.ReviewAsciiReports <year>

    • Exceptions: invalidAsciiExceptions.txt
      • Invalid ASCII conversions
      • These invalid exceptions are used in review steps to make sure we know all the deleted converted lines and no line is deleted by accident (unintensional mistakes).
      • For examples, E0028609|formula, the ASCII conversion of "variants=irreg|formula|formulæ|", "variants=irreg|formula|formulae|", is deleted because it is a invalid irreg. However, formulae is a valid plural form from "variants=glreg".
      • This list should be consulted with linguists.
      • This list is updated annually as described in the annual updates below.

        Exception filesDescriptionAction
        invalidAsciiExceptions.txtinvalid ASCII conversion that is deleted in line to line ASCII conversionupdate

    • Annual update on exceptions:
      • go through ${LEXICON}/data/${YEAR}/ascii/out/*.out
        • baseDeleteNotLex.out
        • abbreviationDeleteNoEui.out
        • acronymDeleteNoEui.out
        • complDeleteNoEui.out
        • irregDeleteNoEui.out
        • nominalizationDeleteNoEui.out
        • spVarDeleteNotLex.out
        • trademarkDeleteNoEui.out
      • The format is:
        EUIType|action|Reasonnon-ASCIIASCII conversionTag (TBD)
        • Type could include "|", such as "spVar|delete|not-Lex", or "delete|no-Eui", so that 1 line might looks like having more than 5 fields.
        • All files listed above are deleted lines after ASCII conversion due to no EUI or not in Lexicon (base forms).
        • Ideally, no deletion should be perfromed if it is a valid conversion.
        • So, All ./out/*.out should be empty (0), except for those are tagged as "C" (see below) and not updated in the current Lexicon.release. Which will be changed (added/modified/deleted) in LB and will be corrected (disappear) in the future releases.

      • Send ./out/*.out (that is not empty) to lexBuilders (linguists) to tag [N|C|V]:
        Check if the ASCII conversion of base|spVar (fields 4) is legal for their type:
        • base|spVar: must be a lexicon base form
          • Files:
            • baseDeleteNotLex.out
            • spVarDeleteNotLex.out
          • Format:
            EUIBaseAction (delete)Cause (not in Lexicon)CitationASCII conversion

          • Tags:
            • [N]: the ASCII conversion is an invalid base form in that record.
              Do nothing in the LB.
              Add these to invaludAsciiExceptions.txt
            • [Y]: the ASCII conversion is a valid base form in that record.
              Add the ASCII conversion to the records in LB. The AsCII conversion should be the citation. And the non-ascii base form should be a spVar.
        • abb|acr|irreg|nom|compl|trademark: could be a term that is not a base from in Lexicon
          • [N]: if it is an illegal (invalid) conversion:
            Add these to invalidAsciiExceptions.txt
          • If it is legal (valid) conversion:
          • [V]: if it is a legal conversion and already in the Lexicon (the conversion become a duplciates), no need to change in Lexicon:
            Such as the spVar conversion is valid. However, it is duplicated to the original ASCII (unless the original lexRecord missed those spVar, which needs to be added/modified).
            • Only for spVar: if the base form of difference case is in Lexicon, it should be tagged as [N] (because it is case sensitive).
            • Add these to validAsciiExceptions.txt
          • [C]: if it is a legal conversion and not in Lexicon
            • add|change to the associated lexRecord
              • if it is a new base, add/modify to Lexicon in LB for future release
              • if it is typo in spVar: fix the typo in the entry (spVar) to match ASCII entry
              • if it is missed in spVar: add the ASCII entry (spVar) to the lexical record
              • these issues still exist until the fix in the next release
              • if it is missed in others: add the ASCII entry (acr, abb, etc.) to the associated record

            • Add these to validAsciiExceptions.txt and update the Lexicon.release
              • For base|spVar: only update the Lexion.release because the program will check ASCII conversion for all base forms
              • For others (irreg|acr|abb|...): update both the Lexicon.relese and add to validAsciiExceptions.txt

              or
            • Do nothing, however, the *.out won't be 0 (not recommended)
      • Rerun the progrom:
        • if manually edit the Lexicon: re-run from Step-II
        • If no manually edit the Lexicon: re-run in step-III (watch out for validCnversion) and step-IV (watch out for the no of *.out). These above two issues should disappear after update the validAsciiExceptions.txt/Lexicon.release and invalidAsciiExceptions.txt.
      • Please refer ASCII LEXICON generation design documents for details
      • Conversion log (see log.3, 3.GenerateAsciiLexicon ${YEAR} > log.3):

        YearNotes
        2014All 88 valid conversions are deleted in step 3.
        2015All 90 valid conversions are deleted in step 3 (93 valid exceptions).
        2016All 90 valid conversions are deleted in step 3 (93 valid exceptions).
        2017All 94 valid conversions are deleted in step 3 (97 valid exceptions).
        2018All 92 valid conversions are deleted in step 3 (97 valid exceptions).
        2019All 95 valid conversions are deleted in step 3 (100 valid exceptions).
        2020All 93 valid conversions are deleted in step 3 (100 valid exceptions).
        2021All 100 valid conversions are deleted in step 3 (107 valid exceptions).
        2022All 100 valid conversions are deleted in step 3 (107 valid exceptions).
        2023All 100 valid conversions are deleted in step 3 (107 valid exceptions).
        2024All 100 valid conversions are deleted in step 3 (107 valid exceptions).

      • ./ascii/LEXICON.ascii: is the ASCII release
        shell> cp -rp LEXICON.ascii LEXICON.ascii.${YEAR}
      • ./data/LEXICON.release (LEXICON.release.log.4.nonAsciiFix) is the UTF-8 release
        shell> cp -rp LEXICON.release LEXICON.release.${YEAR}
        • Double check if there are non-ASCII unicode in LEXICON.ascii
        • shell>cd /nfsvol/lex/Lu/Development/LVG/Components/Unicode/bin
        • shell>GetNonAsciiFromFile ${LEXICON.ascii} line char
        • shell> wc -l line must be 0 (no non-ASCII Unicdoe)

V. Generate ASCII tables

  • shell> ${LEXICON}/bin/10.GenerateAsciiTables <year>
    9
    • Generate all LEXICON tables form LEXICON.ascii
    • duplicatedSpellingVars should be 0 in step 5
  • shell> ${LEXICON}/bin/10.GenerateAsciiTables <year>
    10

    Add .ascii extension to all files from ascii directory