The SPECIALIST Lexicon

LMW Candidates from Expansions of Abb/Acr in Lexicon

I. Introduction

The SPECAILIST Lexicon inlucdes expansions of abbreviations and acronyms. These are good LMW candidates. These expansions are cross-referenced (with EUI) if they exist in the Lexicon. Those without cross-ref EUIs are:

  • Invalid LMWs (base)
    • because "law(s) of articulation". That is a noun with a postmodifying prepositional phrase, rather than being a single NP, it cannot be a Lexbuild base. Such as condition on discharge|COD|E0453760
    • chemical names that are more like formulas than like words, such as 1-oleoyl-2-acetyl-sn-glycerol|OAG|E0698010
    • names of studies, considering them to be too ephemeral as terms. such as acquired immunodeficiency syndrome test|AIDS test|E0776477
  • Valid LMWs that to be added to the Lexicon: used as candidates.

Before 2019-, the validation and candidate generation processes were done during the Lexicon release. The correction (of adding valid LMWs) were done in the post-process and records will have updated cross-ref EUI in the following release. After 2020+, Ths process was migrated to pre-process before freezing the Lexicon.

II. Models

Implemented in ${LMW_DIR}/LexCandidates/GetAbbAcrExpansions.java

  • From a LEXICON file, retrieve all abb/acr expansions
    • has cross-ref EUI:
      • [TAG_Y]: correct CR EUI vs. expansion (case sensitive)
      • [TAG_I]: Incorrect CR EUI vs. expansion
        • [|NO_EUI]: EUI is not in the Lexicon: deleted records
        • [baseSet]: Expansion is not base of CR EUI, could be modified records, requires manual fixes
    • no cross-ref EUI:
      • [TAG_P]: expansions are known exception (in the abb\acr_expansion_no_EUI_exception list, spelling of some expansions are invalid LMW, but has a same spelling as LMW. For exampe, E0688694|lin|noun, the expansion 'lines' is not the plural of N line, but a gene name).
      • [TAG_M]: expansions are LMWs (inflVars, lowercased), add multiple mathced EUIs, sent file to linguist to tag:
        • file: abbAcrExpansions.data.hasEui.M
        • Format:
          EUIPOSCitationexpansionmatched EUIACR or ABBfound matching EUITag
      • [TAG_E]: expansions are LMWs (inflVars, lowercased): add 1 matched EUI, sent file to linguist to tag:
        • file: abbAcrExpansions.data.hasEui.E
        • Format:
          EUIPOSCitationexpansionmatched EUIACR or ABBfound matching EUITag

          Tags:
        • [C]: correct, expansion is invalid LMW, they should not have CR-ref EUI. No fix in the LexBuild.
          => Add to ${LMW_DIR}/data/${YEAR}/inData/abbAcrExpansions.data.hasEui.Exception.${YEAR}
        • [Y]: if the matched EUI is correct, manually add EUI to the lexRecord in the LexBuild.
          => update in the LB and the Lexicon release
        • [- EUI: E0xxxxxxx]: expansion is a valid LMW, add the EUI to the end of line if matched EUI is not correct. Also, fix them in the LexBuild
      • [Tag_N]: expansions are known invalid base forms (lowercased)
      • [TAG_C]: others, candidate list (sent to linguist)

III. Processes

  • Source directory: ${LMW_DIR}/sources/LexCandidates
  • Input Data directory (${IN_DIR}: ${LMW_DIR}/data/${YEAR}/inData/
  • Current Data directory (${CUR_DIR}): ${LMW_DIR}/data/current/
  • Out Data directory (${OUT_DIR}): ${LMW_DIR}/data/${YEAR}/outData/12.LexCandidates
  • Program: ${LMW_DIR}/bin/12.LexAbbAcrCand <YEAR>

    StepDescrptionInputsOutputsNotes
    Pre-Process:
    0
    • Update the latest valid and invalid LMW list
    • Update candidates
    • ${LMW_DIR}/bin/00.CandidateList, steps 1-4
      => Linked to the latest Lexicon and inflVars from LexBuild daily backup
    Process:
    1Generate candidate list from Abb/Acr expansion
    • GetAbbAcrExpansions.java
    • ${IN_DIR}/LEXICON (input)
    • ${IN_DIR}/inflVars.data (valid LMWs)
    • ${CUR_DIR}/notBase.data.current (needs to be updated at step 10)
    • ${OUT_DIR}/abbAcrExpansions.data.hasEui.Exception.${YEAR} (modified fromprev year)
    • abbAcrExpansions.tag (all tags)
    • abbAcrExpansions.invEui (the cross-ref EUI is invalid)
    • abbAcrExpansions.hasEui (no cross-ref EUI, but, expansion matches EUIs)
    • abbAcrExpansions.rpt (summary report)
    • abbAcrExpansions.data.cand (candidate list)
      => manual copy to ./Cand/abbAcrExpansions.data.cand.${YEAR}
      => Link to ./Stats/abbAcrExpansions.data.cand.${YEAR}
      => first, go to step 10 to gen candidate list
      => then, repeat steps 0-2 until abbAcrExpansions.data.cand is empty (0)
    2Split invalid cross-ref EUI and no cross-ref EUI matches EUI file
    • abbAcrExpansions.data.invEui
    • abbAcrExpansions.data.hasEui
    • abbAcrExpansions.data.invEui.NO_EUI
      => Sent to linguist to tag [D]
      • [D]: if the CR of expansion is a deleted record (invalid LMWs), cross-ref EUI should be manually removed.
      • Others: the expansion is a valid LMW, this case might require to change the epxasion to citation form, restore the deleted records, or create a new lexRecord, and modify the CR-EUI, etc..

      => update ${LEX_CHECK}/data/File/notBaseForm.data.${YEAR}
      • this file should be empty after the update (notBaseForm.data)
    • abbAcrExpansions.data.invEui.WRONG_CIT
      => wrong citation, after fixed, it should be empty
    • abbAcrExpansions.data.hasEui.E
      => Exceptions, expansion has 1 matched EUI
      => Send to linguist to tag:
      • [C]: correct, expansion is invalid LMW, they should not have CR-ref EUI. No fix in LB.
      • [Y]: if the suggesting matched EUI is correct, manually add EUI to the lexRecord in LB.
      • [- EUI: E0xxxxxxx]: expansion is a valid LMW, add the EUI to the end of line if suggesting matched EUI is not correct. Also, fix in the LB.
    • abbAcrExpansions.data.hasEui.M
      => Exceptions, expansion has multiple matched EUIs
      => Sent to linguist to tag:
      • [C]: correct, the expansion shold not have cross-ref EUI (even the spelling is a valid base.=> add to abbAcrExpansions.data.hasEui.Exception.${YEAR}
      • [Y]: if the 1 matched EUI is correct (need to update the Lexicon in LExBuild)
      • EUI: add the correct EUI, might need to update the corss-ref EUI, modify the expansion, or add a new record (if expansion is a LMW) to Lexicon
    Post-Process:
    10Auto-tag candidate list
    • CandidateUtil.FilterTagCandFile
    • ${STATS_DIR}/abbAcrExpansions.data.cand.${YEAR}
    • ${CAND_DIR}/inflVars.data.current (valid LMWs)
    • ${CAND_DIR}/totalTerms.all.base.no (invalid LMWs)
    • abbAcrExpansions.data.cand.${YEAR}.autoTag (all tags)
    • abbAcrExpansions.data.cand.${YEAR}.rmYesNo
      After updates completed, this file must be empty (wc=0)
    • abbAcrExpansions.data.cand.${YEAR}.rmYesTagNo
      =>Before update, this file is used as candidate list send to linguist
      • No tag
      • if the expansion is a valid LMW, add to Lexicon, add CR-EUI to the expansion
    • notBaseFormUpdate.data.${YEAR}
      • cd ./Stats
      • flds 4,2 abbAcrExpansions.data.cand.${YEAR}.rmYesTagNo.${YEAR} > notBaseFormUpdate.data.${YEAR}
      • Append notBaseFormUpdate.data.${YEAR} to ${LexCheck}/data/Files/notBaeForm.data.${YEAR}
    After the candidate list is completed:
    • Add/Link candidates to ${Candidates}/1.LexiconAbbAcrExpansion/abbAcrExpansions.data.cand.${YEAR}
    • Run 00.CandidateList, step 1-4
      This step updates the valid and invalid LMW, and thus update the candidates.
    • rerun step 1-2, until *.cand = 0, because candidates that are LMWs are in the Lexicon and invalid LMWs are tagged as invalid automatically (by the updated totalTerm.all.base.no from 00.CandidateList), no new candidate should be found.