The SPECIALIST Lexicon

Distilled MEDLINE N-Gram Set

I. Introduction

The MEDLINE n-gram set includes many invalid LMWs that are not needed for most NLP research. LSG developed a set of exclusive filters that filter out these invalid LMWs. The filtering process filtered out about 2/3 of n-grams from MEDLINE n-gram set release. This enhanced/filtered N-Gram set is called the distilled MEDLINE n-gram set.

II. Precision and Recall

This distilled MEDLINE n-gram set has higher precision and same (similar) recall rate in terms of valid multiwords. LSG performs the accuracy test on all developed exclusive filters by applying these filters on Lexicon (valid LMW). The minimum passing rate is 99.99%. In other words, these filters only filter out invalid LMWs without removing valid LMWs. A simple calculation is described as below:

  • The n-gram set include valid N LMW (TP0) and M invalid LMW (FP0)
  • A serial filters filter out X valid LMW (TP1) and Y invalid LMW (FP1)
  • The distilled N-Gram set have (N-X) valid LMW (TP2) and (M-Y) invalid LMW (FP1)
  • If the accuracy test is very high (99.99%), then
    • X is a very small number (almost 0)
    • Y is a large number (almost 2/3 of N+M)

  • The precision
    = (retrieved and relevant)/(total retrieved)
    = TP2/(TP2+FP2)
    = (N-X)/(N-X + M-Y)
    = N/(N+M-Y) (> N/N+M)

  • The recall
    = (retrieved and relevant)/(total relevant)
    = TP2/(TP2 + FN0)
    = (N-X)/(N-X + FN0)
    = N/(N + FN0) (= TP0/ (TP0 + FN0))

III. Conclusion

The distilled MEDLINE n-gram Set vs. MEDLINE n-gram Set

  • All exclusive filters have accuracy rate above 99.99% (tested on Lexicon)
  • smaller data set (about 1/3)
  • better precision
  • similar recall
  • cab be used as baseline for further analysis

IV. Release Processes

  • Dir: ${MULTIWORD_DIR}
  • Script: manually add n-gram number for ${YEAR} to ${MULTIWORD_DIR}/bin/05.ApplyFilters
    shell>cd ${MULTIWORDS}/data/${YEAR}/outData/02.NGram/nGrams
    shell>wc -l nGramSet.${YEAR}.30

    YearnGram Number
    201417,023,819
    201518,148,692
    201619,325,338
    201721,963,037
    201823,171,133
    201924,666,816
    202026,310,808
    202128,103,252
    202230,090,771
    202332,107,061

    The nGram number is a must to get the correct pass-rate (percentage). Update this number with associated year in the 05.ApplyFilters script before run the distilled n-gram set and update the log file.
  • Input Data:
    Need to setup all the following files before runnning the program (05.ApplyFilters)
    In practice, we test all filters with the latest Lexicon (04.TestFilters) by setting up the following files before we proceed to 05.ApplyFilters.
    For the Lead-End-Term: they should run Step 1-4 on the program of 03.LeadEndTerm ${YEAR} to update th data. However, use the previous data is OK.
    • n-gram.${YEAR} (Step 1):
      shell>mkdir ${MULTIWORD_DIR}/data/${YEAR}/outData/05.ApplyFilters
      shell>cd ${MULTIWORD_DIR}/data/${YEAR}/outData/05.ApplyFilters
      shell>ln -sf ../02.NGram/nGrams/nGramSet.${YEAR}.30 nGram.${YEAR}
    • NRVAR (Step 13):
      shell>cd ${MULTIWORD_DIR}/data/${YEAR}/inData
      shell>ln -sf nfsvol/lex/Lu/Backup/Releases/UMLS/${YEAR}_AA_release/LEX/NUMBERS/NRVAR NRVAR
    • stopWords.data (Step 14):
      shell>cd ${MULTIWORD_DIR}/data/${YEAR}/inData
      shell>cp -p ../../${PREV_YEAR}/inData/stopWords.data.${PREV_YEAR} stopWords.data.${YEAR}
      shell>ln -sf ./stopWords.data.${YEAR} stopWords.data
    • unit.data (Step 24):
      shell>cp -p ../../${PREV_YEAR}/inData/unit.data.${PREV_YEAR} unit.data.${YEAR}
      shell>ln -sf ./unit.data.${YEAR} unit.data
    • invalidLeadTerms.data.abs (Step 30):
      shell>cd ${MULTIWORD_DIR}/data/${YEAR}/outData/03.LeadEndTerm
      shell>cat invalidLeadTerms.data invalidLeadTerms.data.append > invalidLeadTerms.data.${YEAR}
      or
      shell>cp -p ../../../${PREV_YEAR}/outData/03.LeadEndTerm/invalidLeadTerms.data.${PREV_YEAR} invalidLeadTerms.data.${YEAR}
      shell>ln -sf ./invalidLeadTerms.data.${YEAR} invalidLeadTerms.data.abs
    • invalidEndTerms.data.abs (Step 31):
      shell>cd ${MULTIWORD_DIR}/data/${YEAR}/outData/03.LeadEndTerm
      shell>mv invalidEndTerms.data invalidEndTerms.data.${YEAR}
      or
      shell>cp -p ../../../${PREV_YEAR}/outData/03.LeadEndTerm/invalidEndTerms.data.${PREV_YEAR} invalidEndTerms.data.${YEAR}
      shell>ln -sf ./invalidEndTerms.data.${YEAR} invalidEndTerms.data.abs
    • invalidLeadEndTermCandidates.data (Step 32):
      shell>cd ${MULTIWORD_DIR}/data/${YEAR}/outData/03.LeadEndTerm
      shell>cp -p ../../../${PREV_YEAR}/outData/03.LeadEndTerm/invalidLeadEndTermCandidates.data .
      This file could be the same if you run the 03.LeadEndTerm
    • validLeadTerms.data.pat (Step 33):
      shell>cd ${MULTIWORD_DIR}/data/${YEAR}/outData/03.LeadEndTerm
      shell>cp -p ../../../${PREV_YEAR}/outData/03.LeadEndTerm/validLeadTerms.data.pat.${PREV_YEAR} validLeadTerms.data.pat.${YEAR}
      shell>ln -sf ./validLeadTerms.data.pat.${YEAR} validLeadTerms.data.pat
    • validEndTerms.data.pat (Step 34):
      shell>cd ${MULTIWORD_DIR}/data/${YEAR}/outData/03.LeadEndTerm
      shell>cp -p ../../../${PREV_YEAR}/outData/03.LeadEndTerm/validEndTerms.data.pat.${PREV_YEAR} validEndTerms.data.pat.${YEAR}
      shell>ln -sf ./validEndTerms.data.pat.${YEAR} validEndTerms.data.pat
  • Run Program:
    • shell>cd ${MULTIWORDS}/bin/05.ApplyFilters ${YEAR}
      1
      10-14
      20-25
      30-34
      40

      or

    • shell>cd 05.ApplyFiltersAll
    • shell>runApplyFilersAll ${YEAR}
  • Output Data:
    • Dir: /${MULTIWORD}/data/${YEAR}/outData/05.ApplyFilters
    • ApplyFilters.rpt (use this file to update release log file below)
      => shell> cp -p ApplyFilters.rpt ApplyFilters.rpt.${YEAR}
    • nGram.${YEAR}.${STEP}.${NAME}
    • nGram.${YEAR}.${STEP}.${NAME}.exp
    • nGram.${YEAR}.${STEP}.${NAME}.trap

    • Use nGram.${YEAR}.34.invEndTermPat (the last fitlered one) for the distilled n-gram set
    • Distribute it
      =>shell> cp -p nGram.2018.34.invEndTermPat ../02.NGram/nGrams/distilledNGram.${YEAR}
    • Backup it
      =>shell> gtar -czvf distilledNGram.${YEAR}.tgz distilledNGram.${YEAR}
    • Get Core files
      =>shell> cd ${MULTIWORDS}/bin
      =>shell> 06.NGramUtil ${YEAR}
      20 for the MEDLINE nGram set
      21 for the distilled nGram set

V. Release Logs

VI. Run the Test data on the Lexicon

  • Must run 03.LeadEndTerm ${YEAR} for lexWords.data
  • run 04.TestFilters ${YEAR} to update test result on each filter