The SPECIALIST Lexicon

Inclusive Filter: EndWord pattern

I. Introduction

N-Grams (terms) end with certain words have high possibilities to be a valid multiwords. Such as "syndrome, acid, disease, etc". These nGrams are retrieved as LMW candidates. The top (33+) most frequent endWord are retrieved from Lexicon, excluding: numbers (1, 2, 3, I, II) and single character word (A, B). This EndWord list is used as default to test this matchers.

II. Procedure
The following procedures are used to find valid multiwords from n-grams by endWord pattern:

  • Dir: ${MEDLINE_WRODS}/bin
  • Program:

    shell> 10.MatcherEndWord

    StepDescriptionInputsOutputsNotes
    preProcess - Get EndWord from Lexicon
    1Analyze EndWord pattern in LMW (only multiwords)
    • EndWordAnalysis.java
    • ./03.LeadEndTerm/lexWords.data
    • EndWord.1.analysis.stats
      => Use this to get the top end word list
      • flds 1 EndWord.1.analysis.stats > EndWord.1.analysis.stats.1
      • manually removing digit (1, 2, I, II, III, etc.) and single Char (A, B, C, etc.)
      • Select the top ${N} (45, 50, 55, etc.) as endWords.top${N}.data.${YEAR}
      • Link endWords.top${N}.data.${YEAR} to endWords.top.data
        => This is used to generate candidates in 09.MatcherCui option-34
    • EndWord.1.analysis.detail
    Find high frequency endword in Lexicon
    • get the stats on endword for all LexTerms.
    • get the list of all Lexicon term by the rank of endWord
    2EndWordMatcher
    • MatcherEndWord.java
    • N/A
    • N/A
    Check if a term contains an endWord
    • Used as unit test
    • No need to run for generating LMW candidate list
    3Test Matcher-EndWord in Lexicon
    • TestMatcherEndWord.java
    • ./03.LeadEndTerm/lexWords.data
    • ./inData/matcherEndWord.data
    • EndWord.2.test.stats
    • EndWord.2.test.detail
    Similar to Step 1, but including single words
    • No need to run for generating LMW candidate list
    Process
    10Apply Matcher-EndWord on nGram
    • ApplyMatcherEndWord.java
    • nGram option
      • distilledNGram.${YEAR}
      • n-gram.${YEAR}
    • ./inData/matcherEndWord.data
    • endWord option (ALL_END_WORD or a endWord)
    ./Distilled/${END_WROD}
    ./Whole/${END_WROD}
    • nGram.endWord.${YEAR}.apply.core.${END_WROD}
    • nGram.endWord.${YEAR}.apply.detail.${END_WROD}
    • nGram.endWord.${YEAR}.apply.out.${END_WROD}
    • nGram.endWord.${YEAR}.apply.stats.${END_WROD}
    Get N-gram that matches the specified endWords
    • Case sensitive
    • Endword of [syndrome] and [Center] are under evaluated
    11Tag and get Stats for LMW candidates
    • TagEndWordMw.java
    • nGram.EndWord.${YEAR}.apply.core.${END_WROD}
    • nGram.endWord.${YEAR}.apply.core.${END_WROD}.tag
    • nGram.endWord.${YEAR}.apply.core.${END_WROD}.tbd
    • nGram.endWord.${YEAR}.apply.core.${END_WROD}.stats
    • Send this TBD-list to linguists to tag yes|no
    • yes: add valid MW to Lexicon
    • no: update to notMwFromEndWord.data.${YEAR}
    12Sort by reversed string - LMWs candidates
    • SortByReversedStr.java
    • nGram.endWord.${YEAR}.apply.core.${END_WROD}.tbd
    • nGram.endWord.${YEAR}.apply.core.${END_WROD}.tbd.rSort
    Sort the results by the reversed string (last character first)
    • Suggested by Lynn, easier for tagging.
    • rerun Step 11-12 until there is no "tbd" exist in step-11
    • The result is used for LMW candidate list
    • Get Stats (precision and recall)
    • Compare stats between whole and distrilled n_gram set

III. End_Word List

For now, only two endWords are selected for tagging. We will test more from the high frequency list derived frim Step 2.

  • syndrome
  • Center