SPECIALIST Lexicon

Precision for New LMW Candidates from (ACR) Model

I. Introduction

The results from matchers generate LMWs (LexMultiWords) candidates list. This list is sent to linguists to add to Lexicon. An algorithm has been developed to calculate the precision (= valid no/candidates no = retrieved-relevant/retrieved) automatically in this process.

II. Format and Tag

Ideally, no tag is needed because tag can be retrieved from Lexicon. However, we would like to have linguists manually add tag to the LMW candidates as follows to ensure the precision. For example, if linguists forget to tag a candidate and thus it will be considered as invalid MW without manully tagging. With manual tagging, any missed tag can be identify.

LMW candidates format:

LMW candidate* Tag

* LMW candidates are lowercase core-terms

Tag:

Three types of tag are automatically tagged as follows. The algorithm check valid first, then invalid. TBD are other than above two conditions. Filter.Lexicon are used for checking, thus including exact match, exact match after lowercase, match after removing lead-end punctuation, etc..

Tag	Description	Notes
yes y	known valid LMW from Lexicon	inflvars.data
no n	known invalid LMW from previous tag	notBaseForm.data (from Lexicon - valid expansion) notMwFromEndWord.data notMwFromSpVar.data notMwFromCuiTerm.data
tbd	To be done (untagged candidates)	LMW candiates file
o	not a valid acronym expansion (invalid MW)	legency tag, not used
e	a valid expansion that exist in Lexicon (valid MW)	legency tag, not used

The tagging results are used to update the invalidMw file. The following checking algorithm are used:

yes: should be in Lexicon
=>lexAccessLb -n -i:yes -o:yes.out
=>fgrep "|No Result Found-" yes.out |wc -l ... should be 0
no: should not be in Lexicon
=>lexAccessLb -n -i:no -o:no.out
=>fgrep "|No Result Found-" no.out |wc -l ... should be the same size as no.out

III. Process to get precision on new LMWs from candidate list

Candidates from MEDLINE n-gram.${NGRAM_YEAR}
Stats from LEXICON.${INIT_LEX_YEAR}

Step	Description
0	Prepare valid and invalid files: Initial set: valid: inflVars.data.${YEAR} invalid: invalidMwXXX.data.${YEAR-1} Final set (current): valid: inflVars.data.current (latest) invalid: invalidMwForXXX.data.final (current) => Need to update invalidMwFromXXX.data.${YEAR} until tagging on TBD is completed =>invalidMwFromXXX.data.${YEAR} updates invalidMwForXXX.data.${YEAR} =>invalidMwForXXX.data.final link to the lastest invalidMwForXXX.data.${YEAR}
1	Run TagXXX to auto tag until no TBD in the final data: Candidates: TBD from initial data, send to linguists to add LMW to Lexicon tag yes\|no Update invalidMwFromXXX.data.current from tagged file (no Tag) Rerun this step until no TBD in the final data Link invalidMwForXXX.data.final to the latest invalidMwForXXX.data.${YEAR} Precision is used after TBD inthe final data is 0: All "yes" from "TBD" is alreay in Lexicon (inflVars.data.current} All "no" from "TBD" is updated in invalidMwXXX.data.${NEXT_YEAR} (current) precision (of new candidates) = No. of yes from "TBD"/no. of total "TBD" Precison should be improved over the years (after first release in 2014) because more invalidMw are collected.

Step

Description

Prepare valid and invalid files:

Initial set:
- valid: inflVars.data.${YEAR}
- invalid: invalidMwXXX.data.${YEAR-1}
Final set (current):
- valid: inflVars.data.current (latest)
- invalid: invalidMwForXXX.data.final (current)
  => Need to update invalidMwFromXXX.data.${YEAR} until tagging on TBD is completed
  =>invalidMwFromXXX.data.${YEAR} updates invalidMwForXXX.data.${YEAR}
  =>invalidMwForXXX.data.final link to the lastest invalidMwForXXX.data.${YEAR}

Run TagXXX to auto tag until no TBD in the final data:

Candidates: TBD from initial data, send to linguists to
- add LMW to Lexicon
- tag yes|no
Update invalidMwFromXXX.data.current from tagged file (no Tag)
- Rerun this step until no TBD in the final data
- Link invalidMwForXXX.data.final to the latest invalidMwForXXX.data.${YEAR}
Precision is used after TBD inthe final data is 0:
- All "yes" from "TBD" is alreay in Lexicon (inflVars.data.current}
- All "no" from "TBD" is updated in invalidMwXXX.data.${NEXT_YEAR} (current)
- precision (of new candidates) = No. of yes from "TBD"/no. of total "TBD"
- Precison should be improved over the years (after first release in 2014) because more invalidMw are collected.

The SPECIALIST Lexicon