LexBuild

Crossing References Analysis

After cross references are checked (see crossing references check), following files are generated:

FileDescriptions
newTerm.datalist of possible new terms to be added in LEXICON
dup.datalist of possible duplicated lexical records
acr.datalist of problems in acronyms crossing references
abb.datalist of problems in abbreviations crossing references
nom.datalist of problems in nominamizations crossing references

All above files are further categorized, analyzed, and filtered out (from reported exceptions) by different type of issues and stored into different files under directories of the above 5 categories. They are details as follows:

  • newTerm.data
    This file is further analyzed into three files under directory of ${CROSS_REF_OUT_DIR}/newTerm/

    FileException FilterDescriptions
    abbNewTerm.dataNoneAbbreviation expansions:
    • If new terms: add to LEXICON
    • If typo: modify in LEXICON
    acrNewTerm.dataNoneAcronym expansions:
    • If new terms: add to LEXICON
    • If typo: modify in LEXICON
    nomNewTerm.dataNoneNominalizations:
    • If new terms: add to LEXICON, should be symmetric
    • If typo: modify in LEXICON

  • dup.data
    This file is further analyzed into three files under directory of ${CROSS_REF_OUT_DIR}/dup/

    FileException FilterDescriptions
    abbDup.dataDupExceptionsAbbreviation expansions:
    • If duplicated: delete/combine records, keep the EUI is referenced
    • If not: report and add to exception list
    acrDup.dataDupExceptionsAcronym expansions:
    • If duplicated: delete/combine records, keep the EUI is referenced
    • If not: report and add to exception list
    nomDup.dataDupExceptionsNominalizations:
    • If duplicated: delete/combine records, keep the EUI is referenced
    • If not: report and add to exception list

    DupExceptions:

    • Input file: ${CROSS_REF_OUT_DIR}/exceptions/dupException.data
    • Stored information in Hashtable
      key (String)values (HashSet<String>)
      EUIList of group ID
    • If two EUIs are exception (not deplicated), they are in the dupException.data. They should share (have) same group ID in the Hashtable
    • Print out EUIs with more than 2 EUIs (3 and more) in dupException.data. Pay special attnesion to them.

  • acr.data & abb.data
    These two files are in the same format and are further analyzed into nine files under directories of ${CROSS_REF_OUT_DIR}/abb/ and ${CROSS_REF_OUT_DIR}/acr/, respectively. The following table use abb to illustrate.

    FileException FilterAuto FixDescriptions
    noBaseFieldAbb.dataNoneNoCheck abbreviation expansion:
    • If exist: add expansion|EUI to source record
    • If not: remove abbreviation
    noEuiFieldAbb.dataNoneYesCheck suggested EUI:
    • If is expansion: add suggested EUI to source record (atuo fix)
    • If not:
      • add new record for the abbreviation expansion
      • do nothing*
    noEuisFieldAbb.dataNoneNoCheck suggested EUIs:
    • If is expansion: add suggested EUI(s) to source record
    • If not:
      • add new record for the abbreviation expansion
      • do nothing*
    noRecFoundAbb.dataNoneNoCheck abbreviation expansions:
    • If typo: fix it
    • If miss case: fix it
    • If not base form: change to base form
    • Else:
      • add new record for the expansion
      • delete expansion
      • do nothing*
    noEuiFoundAbb.dataNoneNoCheck abbreviation expansions:
    • If typo: fix it
    • If miss case: fix it
    • If not base form: change to base form
    • Else:
      • add new record for the expansion
      • delete expansion
      • do nothing*
    wrongEuiAbb.dataNoneYesCheck abbreviation expansions:
    • If typos, miss case, not base form: fix abbreviation expansion
    • Else: replace abb EUI with suggested EUI to source record (atuo fix)
    wrongEuisAbb.dataNoneNoCheck abbreviation expansions
    • If typos, miss case, not base form: fix abbreviation expansion
    • Else: replace abb EUI with one of suggested EUI to source record
    checkEuiAbb.data
    • dupExcptions
    • abbExceptions
    • acrExceptions
    No Check suggested EUIs:
    • If duplicated, delete/combine duplicated records
    • If not,
      • report suggested EUIs to dupExceptions
      • report and add abb|expansion to abbException

    Check abbreviation expansions|EUI

    • If OK, do nothing*
    • If wrong, change to one of suggested EUIs
    euiNullAbb.dataNoneNoCheck abbreviation expansion|EUI:
    • Change eui to one of suggested EUIs
    • remove abbreviation

    * Please note that if the message type is WARNING, "do nothing" must be one of the optional actions.

    AbbExceptions (AcrExceptions):

    • Input file: ${CROSS_REF_OUT_DIR}/exceptions/abbException.data
    • Stored information in Hashtable
      key (String)values (HashSet<String>)
      abbreviation EUIList of expansion EUIs
    • The goal of abbExceptions is used to fileter out (not print) records with legit abbreviation|abbreviation expansion when there are multiple suggested EUIs. So the algorithm is:
      • If suggested EUIs are in dupExceptions (not duplicated records)
        and
      • If abb EUI|expansion EUI is in the abbExcpetions (legit abb|abb expansion)
      Then filter it out and not include this case in the print out reports (checkEuiAbb.data).

  • nom.data
    This file is further analyzed into fifteen files under directories of ${CROSS_REF_OUT_DIR}/nom/:

    FileException FilterAuto FixDescriptions
    noBaseFieldNom.dataNoneNoCheck nominalization:
    • If exist: add nominalization|cat|EUI to source record
    • If not: remove nominalization
    noCatFieldNom.dataNoneNoCheck nominalization & suggested cats:
    • If exist: add nominalization|cat|EUI to source record
    • If not: add new record for the nominalization
    noEuiFieldNom.dataNoneYesCheck suggested EUI:
    • If is nominalization: add suggested EUI to source record (atuo fix)
    • If not:
      • add new record for the nominalization
      • do nothing*
    noEuisFieldNom.dataNoneNoCheck suggested EUIs:
    • If is nominalization: add suggested EUI(s) to source record
    • If not:
      • add new record for the nominalization
      • do nothing*
    noRecFoundNom.dataNoneNoCheck nominalizations:
    • If typo: fix it
    • If miss case: fix it
    • If not base form: change to base form
    • Else:
      • add new record for the nominalization
      • delete nominalization
      • do nothing*
    noEuiFoundNom.dataNoneNoCheck nominalizations:
    • If typo: fix it
    • If miss case: fix it
    • If not base form: change to base form
    • Else:
      • add new record for the nominalization
      • delete nominalization
      • do nothing*
    wrongCatNom.dataNoneNoCheck nominalizations:
    • If typos, miss case, not base form: fix nominalization
    • Else, check suggested cat|EUI:
      • If OK: replace nom cat|EUI with the suggested cat|suggest EUI
      • Else:
        • add new record for the nominalization
        • delete nominalization
        • do nothing
    wrongCatsNom.dataNoneNoCheck nominalization
    • If typos, miss case, not base form: fix nominalization
    • Else, check suggest cats|EUIs:
      • If OK: replace nom cat|EUI with the suggested cat|suggest EUI
      • Else:
        • add new record for the nominalization
        • delete nominalization
        • do nothing
    checkCatNom.data
    • multiNomExcptions
    NoCheck suggested EUIs:
    • If duplicated, delete/combine duplicated records
    • If not, check if nominalization
      • If yes,
        • add to source recrod
        • report suggested EUIs to multiNomExceptions
      • If not,
        • do nothing*
        • report suggested EUIs to notNomExceptions
    catNullNom.dataNoneNoCheck nominalization:
    • Change cat to one of suggested cats
    • remove nominalization
    wrongEuiNom.dataNoneYesCheck nominalization:
    • If typos, miss case, not base form: fix nominalization
    • Else: replace nom cat|EUI with suggested cat|EUI to source record (atuo fix)
    wrongEuisNom.dataNoneNoCheck nominalizations
    • If typos, miss case, not base form: fix nominalization
    • Else: replace nom cat|EUI with one of suggested cat|EUI to source record
    checkEuiNom.data
    • multiNomExceptions
    No Check suggested EUIs:
    • If duplicated, delete/combine duplicated records
    • If not,
      • report and add base|nominalization to multiNomException
      • report suggested EUIs to dupExceptions (if dup has not taken care of)

    Check nominalization|EUI

    • If OK, do nothing*
    • If wrong, change to one of suggested EUIs
    euiNullNom.dataNoneNoCheck nominalization|EUI:
    • Change eui to one of suggested EUIs
    • remove nominalization
    notSymNom.dataNoneNoCheck target record
    • If exist, add nominalization to target record
    • If not,
      • add a new target record of nominalization
      • do nothing*

    * Please note that if the message type is WARNING, "do nothing" must be one of the optional actions.

    MultiNomExceptions:

    • Input file: ${CROSS_REF_OUT_DIR}/exceptions/multiNomException.data
    • Stored multiple nominalization (of a base) information in Hashtable
      key (String)values (HashSet<String>)
      base EUIList of nominalization EUIs

    NotNomExceptions:

    • Input file: ${CROSS_REF_OUT_DIR}/exceptions/notNomException.data
    • Stored not nominalization and base information in Hashtable
      key (String)values (HashSet<String>)
      base EUIList of EUIS that is not nominalization to the base base
    • This exceptions maybe reduced if duplicated records are eliminated

    The goal of nomExceptions and notNomExceptions are used to fileter out (not print) records with legit base|nominalization when there are multiple suggested EUIs. So the algorithm are:

    • checkCatNom.data:
      • If base EUI|nominalization EUI is in the multiNomExcpetions (base to mulitple noms)
        or
      • If base EUI|suggested EUI is in the notNomExcpetions (EUIs are not nom)
      Then filter it out and not include this case in the print out reports (checkCatNom.data).

    • checkEuiNom.data:
      • If base EUI|nominalization EUIs is in the multiNomExcpetions (base to mulitple noms)
      Then filter it out and not include this case in the print out reports (checkEuiNom.data).