SPECIALIST Lexicon

Generate LEXICON in pure ASCII format

This step must be completed before generate LEXICON tables because the LEXICON.release might need to modified through this step.

I. Concept: Algorithm of Generating ASCII Lexicon

Convert base and spelling varaints to ASCII and saved ASCII citations
- converts non-ASCII citations and spellVars to pure ASCII
Convert to ASCII: by go through Lexicon line by line
- Convert all lines with non-ASCII characers to pure ASCII by using Lvg.toAscii APIs
- Use validAsciiConversions.txt to ensure all ASCII conversion are valid
- Delete all lines that have bases are not in Lexicon (no EUI or EUI not in Lexicon)
Clean up ASCII LEXICON by removing duplications
- Delete duplications due to above steps ASCII conversions

II. Pre-Process: Prepare data and files

link LEXICON
- mkdir ${LEXICON_DIR}/data/${YEAR}/tables
- cd ${LEXICON_DIR}/data/${YEAR}/tables
- ln -sf ../data/LEXICON.release LEXICON
copy exceptions from the previous year
- mkdir ${LEXICON_DIR}/data/${YEAR}/ascii
- shell>cp -rp ${LEXICON}/data/${PRE_YEAR}/ascii/exceptions ${LEXICON}/data/${YEAR}/ascii/exceptions
- validAsciiConversions.txt:
  - Include all known valid ASCII conversion over the years
  - This file is used in step-3 (of GenerateAsciiLexicon) in the logics described in section I.
  - In step-3, all lines are converted to pure ASCII.
  - Lines has non-ASCII characters can be converted to ASCII if they are belong to valid ASCII conversion.
  - These conversion is send to LEXICON.asciiLine (line by line conversion).
  - Conversions are removed (clean up in step 3 in the logics described above) if they are duplications of original ASCII lines, such as irreg.
  - Some conversions are kept, such as compl, and trademark.
  - All above automatic conversions must be verified (known) as valid conversion (for ASCII Lexicon).
  - If it is not a valid conversion, it is removed.
  - If it is a valid conversion, add them in the Lexicon. However, if it is valid, the Lexicon should have it already, so they should be deleted in step-3.
  - Sometimes, it is valid, but the conversion is not in Lexicon yet. In such case, they should be added in to Lexicon (see section IV. Review).
  - Theorectically, all valid ASCII conversion should be deleted in the cleanup step due to the duplications.
  - This list should be consulted with linguists and updated annually (see section IV. Review).
  - This feature is needed and the file needs to be updated annually.
  - Please note that no valid ASCII conversion for base=xxx because a record without pure ASCII citation is deleted in the first step.
- invalidAsciiExceptions.txt:
  - Include all known invalid ASCII conversion over the years
  - This file is used in step-4: review ASCII reports program
  - This files needs to be updated annually (see section IV. Review).

III. Process: Generate ASCII Lexicon

shell> ${LEXICON}/bin/3.GenerateAsciiLexicon <year>
${LVG_YEAR}
${LC_YEAR}
output directory: ${LEXICON}/data/${YEAR}/ascii/logs/

output files:

LEXICON.asciiBase: processed ASCII file after step-1: base conversion
LEXICON.asciiLine: processed ASCII file after step-2: line conversion
LEXICON.ascii: final ASCII Lexicon file for release
Log files (./logs/):
- summary.rpt: summary report of ASCII conversion
  - Manually check deleted lexRecords due to no ASCII base form by following steps. There are 4 known deletions in the past.
    - Run the next step: Review ASCII reports (4.ReviewAsciiReports ${YEAR})
    - Check ./reports/baseDeleteNotLex.rpt
      known deleted bases due to ASCII conversion: E0543077|base|delete|not-Lex|divorcé|divorce|N E0702889|base|delete|not-Lex|Pécs|Pecs|N E0710983|base|delete|not-Lex|GΩ|GOmega|N E0721571|base|delete|not-Lex|μB|muB|N
  - Sent new deletions to Linguist to tag (see below for tagging details)
  Log
- LEXICON.asciiBaseLog: log file of step-1: conversion from LEXICON.relese to LEXICON.asciiBase
- LEXICON.asciiLineLog: log file of step-2: conversion from LEXICON.asciiBase to LEXICON.asciiLine
- LEXICON.asciiLog: log file of step-3: conversion from LEXICON.asciiLine to LEXICON.ascii.
- validAsciiConversions.log: log for valid conversion. The file of "validAsciiConversions.txt" should include all contents in this file.
  - Manually compare the following 2 files:
    - LEXICON.asciiLog: conversion from LEXICON.asciiLine to LEXICON.ascii.
    - validAsciiConversions.log: log for valid conversion
    => The contents (line number, order, and content) should be the same becasue all conversion should be valid conversion.
  - If not, compare the difference, and then fix by the following:
    - Update LEXICON.release. Such as remvoe duplicated acromyms. This is why we need to generate ASCII Lexicon before generate tables.
Run this step, the real validation is at the next step - review ASCII Reports

IV. Review ASCII Reports
shell> ${LEXICON}/bin/4.ReviewAsciiReports <year>

Exceptions: invalidAsciiExceptions.txt
- Invalid ASCII conversions
- These invalid exceptions are used in review steps to make sure we know all the deleted converted lines and no line is deleted by accident (unintensional mistakes).
- For examples, E0028609|formula, the ASCII conversion of "variants=irreg|formula|formulæ|", "variants=irreg|formula|formulae|", is deleted because it is a invalid irreg. However, formulae is a valid plural form from "variants=glreg".
- This list should be consulted with linguists.
- This list is updated annually as described in the annual updates below.
  
  Exception files Description Action
  invalidAsciiExceptions.txt invalid ASCII conversion that is deleted in line to line ASCII conversion update

Exception files	Description	Action
invalidAsciiExceptions.txt	invalid ASCII conversion that is deleted in line to line ASCII conversion	update

Annual update on exceptions:

go through ${LEXICON}/data/${YEAR}/ascii/out/*.out
- baseDeleteNotLex.out
- abbreviationDeleteNoEui.out
- acronymDeleteNoEui.out
- complDeleteNoEui.out
- irregDeleteNoEui.out
- nominalizationDeleteNoEui.out
- spVarDeleteNotLex.out
- trademarkDeleteNoEui.out
The format is:

EUI Type|action|Reason non-ASCII ASCII conversion Tag (TBD)
- Type could include "|", such as "spVar|delete|not-Lex", or "delete|no-Eui", so that 1 line might looks like having more than 5 fields.
- All files listed above are deleted lines after ASCII conversion due to no EUI or not in Lexicon (base forms).
- Ideally, no deletion should be perfromed if it is a valid conversion.
- So, All ./out/*.out should be empty (0), except for those are tagged as "C" (see below) and not updated in the current Lexicon.release. Which will be changed (added/modified/deleted) in LB and will be corrected (disappear) in the future releases.
Send ./out/*.out (that is not empty) to lexBuilders (linguists) to tag [N|C|V]:
Check if the ASCII conversion of base|spVar (fields 4) is legal for their type:
- base|spVar: must be a lexicon base form
  - Files:
    - baseDeleteNotLex.out
    - spVarDeleteNotLex.out
  - Format:
    
    EUI Base Action (delete) Cause (not in Lexicon) Citation ASCII conversion
  - Tags:
    - [N]: the ASCII conversion is an invalid base form in that record.
      Do nothing in the LB.
      Add these to invaludAsciiExceptions.txt
    - [Y]: the ASCII conversion is a valid base form in that record.
      Add the ASCII conversion to the records in LB. The AsCII conversion should be the citation. And the non-ascii base form should be a spVar.
- abb|acr|irreg|nom|compl|trademark: could be a term that is not a base from in Lexicon
  - [N]: if it is an illegal (invalid) conversion:
    Add these to invalidAsciiExceptions.txt
  - If it is legal (valid) conversion:
  - [V]: if it is a legal conversion and already in the Lexicon (the conversion become a duplciates), no need to change in Lexicon:
    Such as the spVar conversion is valid. However, it is duplicated to the original ASCII (unless the original lexRecord missed those spVar, which needs to be added/modified).
    - Only for spVar: if the base form of difference case is in Lexicon, it should be tagged as [N] (because it is case sensitive).
    - Add these to validAsciiExceptions.txt
  - [C]: if it is a legal conversion and not in Lexicon
    - add|change to the associated lexRecord
      - if it is a new base, add/modify to Lexicon in LB for future release
      - if it is typo in spVar: fix the typo in the entry (spVar) to match ASCII entry
      - if it is missed in spVar: add the ASCII entry (spVar) to the lexical record
      - these issues still exist until the fix in the next release
      - if it is missed in others: add the ASCII entry (acr, abb, etc.) to the associated record
    - Add these to validAsciiExceptions.txt and update the Lexicon.release
      - For base|spVar: only update the Lexion.release because the program will check ASCII conversion for all base forms
      - For others (irreg|acr|abb|...): update both the Lexicon.relese and add to validAsciiExceptions.txt
      or
    - Do nothing, however, the *.out won't be 0 (not recommended)
Rerun the progrom:
- if manually edit the Lexicon: re-run from Step-II
- If no manually edit the Lexicon: re-run in step-III (watch out for validCnversion) and step-IV (watch out for the no of *.out). These above two issues should disappear after update the validAsciiExceptions.txt/Lexicon.release and invalidAsciiExceptions.txt.
Please refer ASCII LEXICON generation design documents for details

Conversion log (see log.3, 3.GenerateAsciiLexicon ${YEAR} > log.3):

Year	Notes
2014	All 88 valid conversions are deleted in step 3.
2015	All 90 valid conversions are deleted in step 3 (93 valid exceptions).
2016	All 90 valid conversions are deleted in step 3 (93 valid exceptions).
2017	All 94 valid conversions are deleted in step 3 (97 valid exceptions).
2018	All 92 valid conversions are deleted in step 3 (97 valid exceptions).
2019	All 95 valid conversions are deleted in step 3 (100 valid exceptions).
2020	All 93 valid conversions are deleted in step 3 (100 valid exceptions).
2021	All 100 valid conversions are deleted in step 3 (107 valid exceptions).
2022	All 100 valid conversions are deleted in step 3 (107 valid exceptions).
2023	All 100 valid conversions are deleted in step 3 (107 valid exceptions).
2024	All 100 valid conversions are deleted in step 3 (107 valid exceptions).
2025	All 100 valid conversions are deleted in step 3 (107 valid exceptions).

./ascii/LEXICON.ascii: is the ASCII release
shell> cp -rp LEXICON.ascii LEXICON.ascii.${YEAR}
./data/LEXICON.release (LEXICON.release.log.4.nonAsciiFix) is the UTF-8 release
shell> cp -rp LEXICON.release LEXICON.release.${YEAR}
- Double check if there are non-ASCII unicode in LEXICON.ascii
- shell>cd /nfsvol/lex/Lu/Development/LVG/Components/Unicode/bin
- shell>GetNonAsciiFromFile ${LEXICON.ascii} line char
- shell> wc -l line must be 0 (no non-ASCII Unicdoe)

V. Generate ASCII tables

shell> ${LEXICON}/bin/10.GenerateAsciiTables <year> 9
- Generate all LEXICON tables form LEXICON.ascii
- duplicatedSpellingVars should be 0 in step 5
shell> ${LEXICON}/bin/10.GenerateAsciiTables <year> 10
Add .ascii extension to all files from ascii directory

The SPECIALIST Lexicon