The SPECIALIST Lexicon

Total Data Set Files and Usage

The procedure of adding LMW are:

  • Generate N-gram set from MEDLINE (or any other corpus)
  • Generate candidate lists from n-gram set
  • Linguist manualy tag/add term from candidate lists

It would be very useful if we collect all manually tagged terms. It can be used for:

  • Final filter to exclude previous tagged terms
    • Use all inflVars to filter out valid LMWs (term already in the Lexicon)
    • Use the collected invalid LMWs to tag invalid as a reference for linguists (some invalid term might become valid).
      => Please note that the above two data are changed when the Lexicon is updated, a new candidate list is completed, or a new not base/LMW files is updated in LexCheck
  • Use as training/test data set for deep learning models

This manually data are collected from two sources:

  • Program: ${MULTIWORDS}/bin/00.CandidateList
    3
  • Data directory: ${MULTIWORDS}/data/Candidate/
  • Algorithm:
    • Get total data from
      • All previous candidate list
      • All not base/LMW files from LexCheck
    • Use the lastest Lexicon (InflVars.data) to tag valid/invalid LMWs
      • valid LMWs: totalData.data.yes
      • invalid LMWs: totalData.data.no
    • Use the tagged results to tag new candidate lists:
      • valid LMWs: inflVars.data
      • invalid LMWs: totalData.data.no
      • TBD (new Candidate list): others
  • Out Files:
    • totalData.data.*

      DateNotesTotal CandidateValid LMWsInvalid LMWs
      totalData.datatotalData.data.yestotalData.data.no
      2018-11-152.MNSMatcherParAcr, 20173100416331 (52.67%) 14673 (47.32%)
      2019-01-032.MNSMatcherParAcr, 20183180616924 (53.21%) 14882 (46.79%)
      2019-05-203.DMNSMatcherCuiEndWord, 201733751 16924 (50.14%) 16827 (49.86%)