Lexical Tools

Procedures of preparing Lexical Tools files

There are scripts and programs to generate lexical tools data files automatically. It is detailed as follows:

I. Location:

  • "${LVG_COMPONENTS}/PreDataBase/bin/"

II. Inputs:

  • "${LEXICON}/data/${YEAR}/tables/"
  • "${LVG_COMPONENTS}/PreDataBase/data/${YEAR}/data/

III. Outputs:

  • "${LVG_COMPONENTS}/PreDataBase/data/${YEAR}/data/"

V. Detail procedures:

  • shell> 1.LoadLexiconFiles ${YEAR}

    This script copies initial original files to dataOrg directory ($LVG_COMPONENTS/PreDataBase/data/${YEAR}/dataOrg/).

    StepsNotesSourceTarget
    1Copy inflection variables file $Lexicon/${YEAR}/tables/inflVars.data $PreDataBase/data/${YEAR}/dataOrg/inflVars.data
    2Copy & Modify acronyms file $Lexicon/${YEAR}/tables/LRABR $PreDataBase/data/${YEAR}/dataOrg/acronyms
    $PreDataBase/data/${YEAR}/dataOrg/acr_exp
    3Copy & modify proper file $Lexicon/${YEAR}/tables/LRPRP $PreDataBase/data/${YEAR}/dataOrg/proper
    4Copy nominalization file $Lexicon/${YEAR}/tables/LRNOM $PreDataBase/data/${YEAR}/dataOrg/LRNOM
    5Copy synonyms file None (synonym.data has its own script to generate after 2017+) $PreDataBase/data/${YEAR}/dataOrg/synonyms.data
    6Copy derivation file None (derivation.data has its own script to generate after 2013+) None
    7Copy antonym file None (antonym.data has its own script to generate after 2022+) None
    8Run above 7 steps see above see above

  • shell> 2.GenerateLexiconFiles ${YEAR}

    This script generates final lvg files to data directory from dataOrg directory.

    StepsNotesSourceTarget
    1Copy inflection variables file $PreDataBase/data/${YEAR}/dataOrg/inflVars.data $PreDataBase/data/${YEAR}/data/infl.data
    2Copy & Modify acronyms file $PreDataBase/data/${YEAR}/dataOrg/acronyms $PreDataBase/data/${YEAR}/data/acronym.data
    3Copy proper file $PreDataBase/data/${YEAR}/dataOrg/proper $PreDataBase/data/${YEAR}/data/properNoun.data
    4Copy nominalization file $PreDataBase/data/${YEAR}/dataOrg/LRNOM $PreDataBase/data/${YEAR}/data/nominalization.data
    5Copy synonyms file $Synonyms/data/${YEAR}/outData/Results/synonyms.data.${YEAR}.release $PreDataBase/data/${YEAR}/data/synonyms.data
    6Copy derivation files $Derivation/5.All/data/${YEAR}/data/derivation.data $PreDataBase/data/${YEAR}/data/derivation.data
    7Copy antonyms file $Antonym/data/0.Antonym/${YEAR}/output/antonyms.data.${YEAR}.release $PreDataBase/data/${YEAR}/data/antonyms.data
    8Run above 7 steps see above see above

  • shell> 3.MoveLexiconFiles ${YEAR}

    This script copies/moves final lvg files from data directory to ${LVG}/data/tables directory.

    StepsNotesSourceTarget
    1Copy infl.data file $PreDataBase/data/${YEAR}/data/infl.data ${LVG_DIR}/data/tables/infl.data
    2Copy acronym.data file $PreDataBase/data/${YEAR}/data/acronym.data ${LVG_DIR}/data/tables/acronym.data
    3Copy properNoun.data file $PreDataBase/data/${YEAR}/data/properNoun.data ${LVG_DIR}/data/tables/properNoun.data
    4Copy nominalization.data file $PreDataBase/data/${YEAR}/data/nominalization.data ${LVG_DIR}/data/tables/nominalization.data
    5Copy synonyms.data file $PreDataBase/data/${YEAR}/data/synonyms.data ${LVG_DIR}/data/tables/synonyms.data
    6Copy derivation.data files $PreDataBase/data/${YEAR}/data/derivation.data ${LVG_DIR}/data/tables/derivation.data
    7Copy antonyms.data files $PreDataBase/data/${YEAR}/data/antonyms.data ${LVG_DIR}/data/tables/antonyms.data
    8Run above 7 steps see above see above

  • shell> 4.AnalyzeLvgFiles ${YEAR}

    Analyze files to find the max. length of each field. Then check with the database design on each fields of each tables

    StepsNotesSourceTable
    1AnalyzeInflection ${LVG_DIR}/data/tables/infl.data Inflection
    2AnalyzeAcronym ${LVG_DIR}/data/tables/acronym.data Acronym
    3AnalyzeProperNoun ${LVG_DIR}/data/tables/properNoun.data ProperNoun
    4AnalyzeNominalization ${LVG_DIR}/data/tables/nominalization.data Nominalization
    5AnalyzeSynonym ${LVG_DIR}/data/tables/synonyms.data LexSynonym
    6AnalyzeDerivation ${LVG_DIR}/data/tables/derivation.data Derivation
    7AnalyzeAntonym ${LVG_DIR}/data/tables/antonyms.data LexAntonym

    • Check the max. field length, if exceed, change source code in ${LVG_DIR}/loadDb/ to fit
    • Also, recompile if changing the source codes

  • Load data from Lexicon files to Lvg database
    Load these data into HSqlDb database
    • shell> cd ${LVG_DIR}/loadDb/bin
    • shell> 2.LoadDb ${YEAR}
    • choose Db (HSqlDb)
      PS. make sure the property value "readonly=false" in ${LVG_DIR}/data/HSqlDb/lvg${YEAR}.properties
    • choose tables option 11) to load Lexicon tables (1 ~ 7)
    • Change back the property value "readonly=true" in ${LVG_DIR}/data/HSqlDb/lvg${YEAR}.properties

  • Generate canonical data
    Generate canonical data for luiNorm
    • Make sure reload above files into Db on the ${LVG_DEV}
    • Make sure recompile (ant dist) on the ${LVG_DEV}
      => So that the following data will be generated by the latest lvg

    • Generate atoms.data (get it fromOCCS before 2013-)
      shell> cd ${META_DIR}/bin
      shell> 2.GetAtoms
      ${PREV_YEAR}AA
      1
      2
      3

      => Total difference no (must be 0): 0

    • shell> cd $LVG_Components/CanonGenerator/bin
    • shell> 0.ModifyAtoms ${YEAR}

      StepsNotesSourceTarget
      1Prepare directories and files
      shell> cd ${CANON_GEN}/data/
      shell> mkdir ${YEAR}
      shell> cd ${YEAR}
      shell> mkdir dataOrg
      shell> mkdir data
      shell> mkdir output
      $META/data/${PRE_YEAR}AA/outputs/atoms.data $CANON_GEN/data/${YEAR}/dataOrg/atoms.org
      2Get ENG entry from atoms.org file $CANON_GEN/data/${YEAR}/dataOrg/atoms.org $CANON_GEN/data/${YEAR}/dataOrg/atoms.org.ENG
      3Get SPA entry from atoms.org file $CANON_GEN/data/${YEAR}/dataOrg/atoms.org $CANON_GEN/data/${YEAR}/dataOrg/atoms.org.SPA
      4Generate atoms.data file $CANON_GEN/data/${YEAR}/dataOrg/atoms.org.ENG $CANON_GEN/data/${YEAR}/dataOrg/atoms.data

    • Update variable ${LVG_DIR} in ${LVG_DEV_DIR}/data/config/lvg.properties (can't be AUTO_MODE)
      => Must use development LVG because of the updated DB for infl.data...
    • shell> cd ${CANON_GEN}/data/
    • shell> rm -rf HSqlDb
      => It uses HsqlDb to save base|inflVars|canonical form
      => move HSqlDb to HsqlDb.${YEAR} after it is done

    • shell> 1.RunCanonAll ${YEAR}
      	
      --------------------------------------
      Which Program ?
      --------------------------------------
      1) Generate terms list
      2) Generate words list
      3) Generate unique words list
      4) Generate base forms list
      5) Generate unique base forms list
      6) Generate canoncal forms
      7) Check non-ASCII canon
      8) All (default)
      9) Generate canoncal forms from test
      ----------
      8
      	
      	

      StepsNotesSourceTarget
      0Prepare directories and files
      1Get terms list
      • ${LVG_DIR}/data/tables/infl.data
      • $CANON_GEN/data/${YEAR}/dataOrg/atoms.data
      $CANON_GEN/data/${YEAR}/data/termList.data
      2Get words list $CANON_GEN/data/${YEAR}/data/termList.data $CANON_GEN/data/${YEAR}/data/wordList.data
      3Sort and unify words list $CANON_GEN/data/${YEAR}/data/wordList.data $CANON_GEN/data/${YEAR}/data/uniqueWordList.data
      4Get base forms of unique words list $CANON_GEN/data/${YEAR}/data/uniqueWordList.data $CANON_GEN/data/${YEAR}/data/baseList.data
      5
      • Combine bases (spelling variants) from infl.vars with baseList.data;
      • normalize non-ASCII characters;
      • sort and unify bases list
      $CANON_GEN/data/${YEAR}/data/baseList.data $CANON_GEN/data/${YEAR}/data/uniqueBaseList.data
      6Generate canonical forms
      => Use ${LVG_DIR}/lib/jdbcDrivers/HSqlDb/hsqldb.jar
      => Make sure the size of "varchar(110)" is big enough in
      => If not, it will show as SQLException: data exception ..., then modify the source code to fit.
      • "base varchar(110)" in CanonDbBaseForms.CreateBaseTable( );
      • "base varchar(110)" in CanonDbCanon.CreateCanonTable( );
      • "inflection varchar(110)" in CanonDbInflection.CreateInflectionTable( );
      $CANON_GEN/data/${YEAR}/data/uniqueBaseList.data $CANON_GEN/data/${YEAR}/data/canonical.data
      7Check/modify non-ASCII in Canonical forms $CANON_GEN/data/${YEAR}/data/canonical.data
      • $CANON_GEN/data/${YEAR}/data/notKnownUnicode.data
      • $CANON_GEN/data/${YEAR}/data/nonAscii.data

    Must run on lexdev (with huge memory). Other machines (lexdev01) take more than 1 day (too slow)

    Lvg ReleaseProcessesComputerHSqlDb versionRun-timeCanonical size
    2010Step-6lexdevHSqlDb.2.0.0.0~60 min.1,173,712
    2012Step-6lexdevHSqlDb.2.2.5~140 min.1,395,720
    2015Step-6lexdevHSqlDb.2.3.2~40 min.1,744,398
    2017Step-6lexdev1HSqlDb.2.3.4~30 min.1,921,878
    2018Step-6lexdev1HSqlDb.2.3.4~38 min.2,044,325
    2019Step-6lexdevHSqlDb.2.4.1~45 min.2,163,739
    2020Step-6lexdevHSqlDb.2.5.0~50 min.2,241,434
    2021Step-6lexdevHSqlDb.2.5.1~45 min.2,332,096
    2022Step-6lexdevHSqlDb.2.5.1~45 min.2,410,325
    2023Step-6lexdevHSqlDb.2.7.0~50 min.2,476,202
    2024Step-6lexdevHSqlDb.2.7.2~40 min.1,897,973

  • shell> 5.Generate2Files ${YEAR}

    Generate lvg files from lvg
    The lvg used is in the ${DEV_DIR}
    make sure variable ${LVG_DIR} uses the full path of lvg in the lvg config file (not AUTO_MODE), lvg.properties.
    shell> cd ${RPRE_DATABASE}/bin
    shell> 5.Generate2Files <year>

    StepsNotesSourceTableRun Time
    1Generate fruitful variants ${LVG_DIR}/data/tables/infl.data $PreDateBase/data/${YEAR}/data/fruitful.data 2 hr.
    2Generate AntiNorm ${LVG_DIR}/data/tables/infl.data $PreDateBase/data/${YEAR}/data/antiNorm.data 1 hr.
    3Copy canonical data $CanonGenerator/data/${YEAR}/data/canonical.data $PreDateBase/data/${YEAR}/data/canonical.data 2 hr.

    PS. GenerateAntiNorm requires recompile with new lvg${YEAR}dist.jar

  • shell> 6.Move2Files ${YEAR}

    This script copies/moves final lvg generated files from data directory to ${LVG_DIR}/data/tables directory.

    StepsNotesSourceTarget
    1Copy fruitful.data file $PreDataBase/data/${YEAR}/data/fruitful.data ${LVG_DIR}/data/tables/fruitful.data
    2Copy antiNorm.data file $PreDataBase/data/${YEAR}/data/antiNorm.data ${LVG_DIR}/data/tables/antiNorm.data
    3Copy canonical.data file $PreDataBase/data/${YEAR}/data/canonical.data ${LVG_DIR}/data/tables/canonical.data

  • shell> 7.Analyze2Files ${YEAR}

    Analyze files to find the max. length of each field. Then check with the database design on each fields of each tables

    StepsNotesSourceTable
    1AnalyzeFruitful ${LVG_DIR}/data/tables/fruitful.data Fruitful
    2AnalyzeAntiNorm ${LVG_DIR}/data/tables/antiNorm.data AntiNorm
    3AnalyzeCanon ${LVG_DIR}/data/tables/canonical.data Canonical

  • Load data from 2 files to Lvg database
    Load these data into HSqlDb database
    • shell> cd ${LVG_DIR}/loadDb/bin
    • shell> LoadDb ${YEAR}
    • choose Db (HSqlDb & MySql)
      PS. make sure the property value "readonly=false" in ${LVG_DIR}/data/HSqlDb/lvg${YEAR}.properties
    • choose tables option 11) to load 2 tables

    • After it is done, change "readonly=true" in ${LVG_DIR}/data/HSqlDb/lvg${YEAR}.properties