The SPECIALIST Lexicon

Generate LEXICON in pure ASCII format

This step must be completed before generate LEXICON tables because the LEXICON.release might need to modified through this step.

I. Concept: Algorithm of Generating ASCII Lexicon

  1. Convert base and spelling varaints to ASCII and saved ASCII citations
    • converts non-ASCII citations and spellVars to pure ASCII
  2. Convert to ASCII: by go through Lexicon line by line
    • Convert all lines with non-ASCII characers to pure ASCII by using Lvg.toAscii APIs
    • Use validAsciiConversions.txt to ensure all ASCII conversion are valid
    • Delete all lines that have bases are not in Lexicon (no EUI or EUI not in Lexicon)
  3. Clean up ASCII LEXICON by removing duplications
    • Delete duplications due to above steps ASCII conversions

II. Pre-Process: Prepare data and files

  • link LEXICON
    • mkdir ${LEXICON_DIR}/data/${YEAR}/tables
    • cd ${LEXICON_DIR}/data/${YEAR}/tables
    • ln -sf ../data/LEXICON.release LEXICON
  • copy exceptions from the previous year
    • mkdir ${LEXICON_DIR}/data/${YEAR}/ascii
    • shell>cp -rp ${LEXICON}/data/${PRE_YEAR}/ascii/exceptions ${LEXICON}/data/${YEAR}/ascii/exceptions

    • validAsciiConversions.txt:
      • Include all known valid ASCII conversion over the years
      • This file is used in step-3 (of GenerateAsciiLexicon) in the logics described in section I.
      • In step-3, all lines are converted to pure ASCII.
      • Lines has non-ASCII characters can be converted to ASCII if they are belong to valid ASCII conversion.
      • These conversion is send to LEXICON.asciiLine (line by line conversion).
      • Conversions are removed (clean up in step 3 in the logics described above) if they are duplications of original ASCII lines, such as irreg.
      • Some conversions are kept, such as compl, and trademark.

      • All above automatic conversions must be verified (known) as valid conversion (for ASCII Lexicon).
      • If it is not a valid conversion, it is removed.
      • If it is a valid conversion, add them in the Lexicon. However, if it is valid, the Lexicon should have it already, so they should be deleted in step-3.
      • Sometimes, it is valid, but the conversion is not in Lexicon yet. In such case, they should be added in to Lexicon (see section IV. Review).
      • Theorectically, all valid ASCII conversion should be deleted in the cleanup step due to the duplications.

      • This list should be consulted with linguists and updated annually (see section IV. Review).

      • This feature is needed and the file needs to be updated annually.
      • Please note that no valid ASCII conversion for base=xxx because a record without pure ASCII citation is deleted in the first step.

    • invalidAsciiExceptions.txt:
      • Include all known invalid ASCII conversion over the years
      • This file is used in step-4: review ASCII reports program
      • This files needs to be updated annually (see section IV. Review).

III. Process: Generate ASCII Lexicon

  • shell> ${LEXICON}/bin/3.GenerateAsciiLexicon <year>
    ${LVG_YEAR}
    ${LC_YEAR}
  • output directory: ${LEXICON}/data/${YEAR}/ascii/logs/
  • output files:
    • LEXICON.asciiBase: processed ASCII file after step-1: base conversion
    • LEXICON.asciiLine: processed ASCII file after step-2: line conversion
    • LEXICON.ascii: final ASCII Lexicon file for release
    • Log files (./logs/):
      • summary.rpt: summary report of ASCII conversion
        • Manually check deleted lexRecords due to no ASCII base form by following steps. There are 4 known deletions in the past.
          • Run the next step: Review ASCII reports (4.ReviewAsciiReports ${YEAR}) for further process (see session IV.)

          Log

        • LEXICON.asciiBaseLog: log file of step-1: conversion from LEXICON.relese to LEXICON.asciiBase
        • LEXICON.asciiLineLog: log file of step-2: conversion from LEXICON.asciiBase to LEXICON.asciiLine
        • LEXICON.asciiLog: log file of step-3: conversion from LEXICON.asciiLine to LEXICON.ascii.
        • validAsciiConversions.log: log for valid conversion. The file of "validAsciiConversions.txt" should include all contents in this file.
          • Manually compare the following 2 files:
            • LEXICON.asciiLog: conversion from LEXICON.asciiLine to LEXICON.ascii.
            • validAsciiConversions.log: log for valid conversion

            => The contents (line number, order, and content) should be the same becasue all conversion should be valid conversion.
          • If not, compare the difference, and then fix by the following:
            • Update validAsciiConversions.txt => so that all valid ASCII conversions are known (duplicated with original ASCII).
            • Update LEXICON.release. Such as remvoe duplicated acromyms. This is why we need to generate ASCII Lexicon before generate tables.

      • Run this step, the real validation is at the next step - review ASCII Reports

      IV. Review ASCII Reports
      shell> ${LEXICON}/bin/4.ReviewAsciiReports <year>

      • Exceptions: invalidAsciiExceptions.txt
        • this file lists invalid ASCII conversions
        • These exceptions of invalid conversions are used in review steps to make sure we know all the deleted converted lines and no line is deleted by accident (unintensional mistakes).
          For examples, in the case of "E0028609|formula", the ASCII conversion of "variants=irreg|formula|formulæ|" is "variants=irreg|formula|formulae|".The ASCII conversion is deleted because it is a invalid irreg. However, formulae is a valid plural form from "variants=glreg".
        • This list should be consulted with linguists.
        • This list needs to be updated annually as described below.

          Exception filesDescriptionAction
          invalidAsciiExceptions.txtinvalid ASCII conversion that is deleted in line to line ASCII conversionupdate

      • Annual update on exceptions:
        • go through ${LEXICON}/data/${YEAR}/ascii/out/*.out
          • baseDeleteNotLex.out
            • Check ./reports/baseDeleteNotLex.rpt
              known deleted bases due to ASCII conversion:
              E0543077|base|delete|not-Lex|divorcé|divorce|N
              E0702889|base|delete|not-Lex|Pécs|Pecs|N
              E0710983|base|delete|not-Lex|GΩ|GOmega|N
              E0721571|base|delete|not-Lex|μB|muB|N
          • Sent new deletions to Linguist to tag (see below for tagging details)
        • abbreviationDeleteNoEui.out
        • acronymDeleteNoEui.out
        • complDeleteNoEui.out
        • irregDeleteNoEui.out
        • nominalizationDeleteNoEui.out
        • spVarDeleteNotLex.out
        • trademarkDeleteNoEui.out
      • The format is:
        EUITypeActionReasonnon-ASCIIASCII conversionTag (TBD)
        • Type: spVar, base, etc.
        • Action: delete,
        • Reason: not-Lex, no-Eui, etc.
        • All files listed above are deleted lines after ASCII conversion due to no EUI or not in Lexicon (not a base form).
        • Ideally, no deletion should be perfromed if it is a valid conversion.
        • So, All ./out/*.out should be empty (0), except for those are tagged as "C" (see below) and not updated in the current Lexicon.release. Which will be changed (added/modified/deleted) in LB and will be corrected (disappear) in the future releases.

      • Send ./out/*.out (that is not empty) to lexBuilders (linguists) to tag:
        Check if the ASCII conversion of base|spVar (fields 6) is legal for their type:
        • base|spVar: must be a lexicon base form
          • Files:
            • baseDeleteNotLex.out
            • spVarDeleteNotLex.out
          • Format:
            EUIType (spVar|base)Action (delete)Reason (not in Lexicon)non-ASCII CitationASCII conversionTag (Y|N)

          • Tags:
            • [N]: the ASCII conversion is an invalid base form in that record.
              Do nothing in the LB.
              Add these to invaludAsciiExceptions.txt
            • [Y]: the ASCII conversion is a valid base form in that record.
              Add the ASCII conversion to the records in LB. The ASCII conversion should be the citation. And the non-ASCII base form should be a spVar.
        • abb|acr|irreg|nom|compl|trademark: could be a term that is not a base from in Lexicon
          • [N]: An illegal (invalid) conversion:
            Add these to invalidAsciiExceptions.txt
          • [V]: A legal conversion and already in the Lexicon (the conversion become a duplciates), no need to change in Lexicon:
            Such as the spVar conversion is valid. However, it is duplicated to the original ASCII (unless the original lexRecord missed those spVar, which needs to be added/modified).
            • Only for spVar: if the base form of difference case is in Lexicon, it should be tagged as [N] (because it is case sensitive).
            • Add these to validAsciiExceptions.txt
          • [C]: A legal conversion and not in Lexicon
            • add|change to the associated lexRecord
              • if it is a new base, add/modify to Lexicon in LB for future release
              • if it is typo in spVar: fix the typo in the entry (spVar) to match ASCII entry
              • if it is missed in spVar: add the ASCII entry (spVar) to the lexical record
              • these issues still exist until the fix in the next release
              • if it is missed in others: add the ASCII entry (acr, abb, etc.) to the associated record

            • Add these to validAsciiExceptions.txt and update the Lexicon.release
              • For base|spVar: only update the Lexion.release because the program will check ASCII conversion for all base forms
              • For others (irreg|acr|abb|...): update both the Lexicon.relese and add to validAsciiExceptions.txt

              or
            • Do nothing, however, the *.out won't be 0 (not recommended)
      • Rerun the progrom:
        • if manually edit the Lexicon: re-run from Step-II
        • If no manually edit the Lexicon: re-run in step-III (watch out for validCnversion) and step-IV (watch out for the no of *.out). These above two issues should disappear after update the validAsciiExceptions.txt/Lexicon.release and invalidAsciiExceptions.txt.
      • Please refer ASCII LEXICON generation design documents for details
      • Conversion log (see log.3, 3.GenerateAsciiLexicon ${YEAR} > log.3):

        YearNotes
        2014All 88 valid conversions are deleted in step 3.
        2015All 90 valid conversions are deleted in step 3 (93 valid exceptions).
        2016All 90 valid conversions are deleted in step 3 (93 valid exceptions).
        2017All 94 valid conversions are deleted in step 3 (97 valid exceptions).
        2018All 92 valid conversions are deleted in step 3 (97 valid exceptions).
        2019All 95 valid conversions are deleted in step 3 (100 valid exceptions).
        2020All 93 valid conversions are deleted in step 3 (100 valid exceptions).
        2021All 100 valid conversions are deleted in step 3 (107 valid exceptions).
        2022All 100 valid conversions are deleted in step 3 (107 valid exceptions).
        2023All 100 valid conversions are deleted in step 3 (107 valid exceptions).
        2024All 100 valid conversions are deleted in step 3 (107 valid exceptions).
        2025All 100 valid conversions are deleted in step 3 (107 valid exceptions).
        2026All 100 valid conversions are deleted in step 3 (107 valid exceptions).

      • ./ascii/LEXICON.ascii: is the ASCII release
        shell> cp -rp LEXICON.ascii LEXICON.ascii.${YEAR}
      • ./data/LEXICON.release (LEXICON.release.log.4.nonAsciiFix) is the UTF-8 release
        shell> cp -rp LEXICON.release LEXICON.release.${YEAR}
        • Double check if there are non-ASCII unicode in LEXICON.ascii
        • shell>cd /nfsvol/lex/Lu/Development/LVG/Components/Unicode/bin
        • shell>GetNonAsciiFromFile ${LEXICON.ascii} line char
        • shell> wc -l line must be 0 (no non-ASCII Unicdoe)

V. Generate ASCII tables

  • shell> ${LEXICON}/bin/10.GenerateAsciiTables <year>
    9
    • Generate all LEXICON tables form LEXICON.ascii
    • duplicatedSpellingVars should be 0 in step 5
  • shell> ${LEXICON}/bin/10.GenerateAsciiTables <year>
    10

    Add .ascii extension to all files from ascii directory