The SPECIALIST Lexicon

Precision for New LMW Candidates from (ACR) Model

I. Introduction

The results from matchers generate LMWs (LexMultiWords) candidates list. This list is sent to linguists to add to Lexicon. An algorithm has been developed to calculate the precision (= valid no/candidates no = retrieved-relevant/retrieved) automatically in this process.

II. Format and Tag

Ideally, no tag is needed because tag can be retrieved from Lexicon. However, we would like to have linguists manually add tag to the LMW candidates as follows to ensure the precision. For example, if linguists forget to tag a candidate and thus it will be considered as invalid MW without manully tagging. With manual tagging, any missed tag can be identify.

  • LMW candidates format:
    LMW candidate*Tag

    * LMW candidates are lowercase core-terms

  • Tag:

    Three types of tag are automatically tagged as follows. The algorithm check valid first, then invalid. TBD are other than above two conditions. Filter.Lexicon are used for checking, thus including exact match, exact match after lowercase, match after removing lead-end punctuation, etc..

    TagDescriptionNotes
    yes
    y
    known valid LMW from Lexicon
    • inflvars.data
    no
    n
    known invalid LMW from previous tag
    • notBaseForm.data (from Lexicon - valid expansion)
    • notMwFromEndWord.data
    • notMwFromSpVar.data
    • notMwFromCuiTerm.data
    tbdTo be done (untagged candidates)
    • LMW candiates file
    onot a valid acronym expansion (invalid MW)legency tag, not used
    ea valid expansion that exist in Lexicon (valid MW)legency tag, not used

    The tagging results are used to update the invalidMw file. The following checking algorithm are used:

    • yes: should be in Lexicon
      =>lexAccessLb -n -i:yes -o:yes.out
      =>fgrep "|No Result Found-" yes.out |wc -l ... should be 0
    • no: should not be in Lexicon
      =>lexAccessLb -n -i:no -o:no.out
      =>fgrep "|No Result Found-" no.out |wc -l ... should be the same size as no.out

III. Process to get precision on new LMWs from candidate list

  • Candidates from MEDLINE n-gram.${NGRAM_YEAR}
  • Stats from LEXICON.${INIT_LEX_YEAR}

StepDescription
0Prepare valid and invalid files:
  • Initial set:
    • valid: inflVars.data.${YEAR}
    • invalid: invalidMwXXX.data.${YEAR-1}
  • Final set (current):
    • valid: inflVars.data.current (latest)
    • invalid: invalidMwForXXX.data.final (current)
      => Need to update invalidMwFromXXX.data.${YEAR} until tagging on TBD is completed
      =>invalidMwFromXXX.data.${YEAR} updates invalidMwForXXX.data.${YEAR}
      =>invalidMwForXXX.data.final link to the lastest invalidMwForXXX.data.${YEAR}
1Run TagXXX to auto tag until no TBD in the final data:
  • Candidates: TBD from initial data, send to linguists to
    • add LMW to Lexicon
    • tag yes|no
  • Update invalidMwFromXXX.data.current from tagged file (no Tag)
    • Rerun this step until no TBD in the final data
    • Link invalidMwForXXX.data.final to the latest invalidMwForXXX.data.${YEAR}

  • Precision is used after TBD inthe final data is 0:
    • All "yes" from "TBD" is alreay in Lexicon (inflVars.data.current}
    • All "no" from "TBD" is updated in invalidMwXXX.data.${NEXT_YEAR} (current)
    • precision (of new candidates) = No. of yes from "TBD"/no. of total "TBD"

    • Precison should be improved over the years (after first release in 2014) because more invalidMw are collected.