The SPECIALIST Lexicon

LMW Candidates from WordNet

I. Introduction

The WordNet is used to enhance Lexicon for multiwords, derivations, synonyms, and antonyms. The WordNet 3.0 and JWI (Jave WordNet Interface) are used for this development.

II. Models

Implemented serveral models in ${LMW_DIR}/WordNetMw/*.java

  • Pre-Process:
    • file dir: ${WORD_NET_DIR}/data/outData/data/Output/
      Fileword countDescription
      Words from WorNet
      WnWords.data.3.0156,584Words from synset, root, unique in spelling and POS
      WnIndexWords.data.3.0155,287
      • Indexed words, unique lower case word of all words
      • SW: 90956, MW: 64331
      • noun: 117798, verb:11529, adj:21479, adv:4481
      Derivations from WordNet
      WnDPairs.data.3.042,475derivations
      • include suffixD (16,427) and zeroD (4,769)
      • prefixD are not included
      • include multiwords
      • dPairs are not completely symmetric
      Synonyms from WordNet
      WnSPairs.data.3.0315312
      Antonyms from WordNet
      WnAPairs.data.3.012248
  • Model-1, word candidates from words in WordNet (TBD):
    • status: sent: TBD, tag: TBD
    • Dir: ${LMW_DIR}/sources/WordNetMw/
    • file dir: ${LMW_DIR}/data/2021/outData/13.WordNet/
    • Algorithm:
      StepalgorithmOut file
       words from WordNetWnWords.data
      0lexicon filter:
      filter out words are in the Lexicon
      • candFromWn.data.pass.0.lexicon (76,076)
      • candFromWn.data.trap.0.lexicon (80,508)
      1general filters:
      filter out invalid words: pipe, punc, digit, number, stopword
      • candFromWn.data.pass.1.general (75,596)
      2pattern filers:
      filter out invalid words: parAcr, indArt, colon, disChar, disPunc, imcomplete, measure
      • candFromWn.data.pass.2.pattern (75,493)
      3single words:
      filter out decade, ordinal, Roman, no CUI
      • WordNetCand.sw.cuiGram.2021 (1,991)
        Cand list, TBD, filter out sw from derivations, synonym, antonyms.
      4multiwords: filter out Ilt, Iet, Let, Vlt, Vet, no CUI,
      • WordNetCand.mw.cui.2021 (9,022)
  • Model-2: Verb Complement: [verb + prep]
    [verb + prep] is valid word with POS of verb in WordNet, but it is recorded as verb complement in the Lexicon. Generate [verb + prep|verb] for verb completement candidates in LexBuild.
    • status: sent: 5/24/2021, tag: TBD
    • Dir: ${LMW_DIR}/sources/WordNetMw/
    • file dir: ${LMW_DIR}/data/2021/outData/13.WordNet/Cand
    • Algorithm (GenVerbComplement.java):
      StepalgorithmOut file
       words from WordNetWnWords.data
      1
      • Get multiwords from WordNet
      • POS must be verb
      • multiword must match pattern of [verb + " " + prep]
      verbComplement.data (1,708)
  • Model-3: Word candidates from Zero Derivations in WordNet
    • status: sent: Done, tag: Done
    • Dir: ${DERIVATION_DIR}/6.WordNetD/
    • file dir: ${DERIVATION_DIR}/6.WordNetD/data/2021/outData/Cand
    • Algorithm:
      StepalgorithmOut file
      0Categorize dPairs to zeroD, suffixD, prefixD, Others
      • WnDPairs.1.unique.data.ZD.2021 (4,769)
      • WnDPairs.1.unique.data.SD.2021 (16,427)
      • WnDPairs.1.unique.data.PD.2021 (2)
        => none of these are legit prefixD
      • WnDPairs.1.unique.data.OD.2021 (68)
        => none of them are legit 1-step derivation. Most of them are related words semantically
      1retrieve those base|POS are not in the Lexicon
      • WnDPairs.1.unique.data.ZD.2021.word (1,420)
      2
      • apply combined filters: Lexicon, general, pattern, single-word, lead-end word
      • filter out verb compliments
      • CUI filters
      • WnDPairs.ZD.wordCand.cand.2021 (349)
      • WnDPairs.ZD.wordCand.trapByPostFilter.noCui.2021 (836)
    • Results & Conclusion:

      Zero derivations in the WordNet are used to retrieve lexical multiword candidates. The precision of valid multiwords on these candidates is calculated by lowercased spelling without POS. The generated candidates from this model have high precision on valid multiwords (97.05%).

      This study further categorized candidates into two groups: with and without UMLS CUIs (concept unique identifiers). The precisions are 100% and 95.53% for candidates with and without CUIs, respectively. Theoretically, all multiwords have meaning (mapped CUIs) by themselves. It is interesting that we observed no noticeable difference on precision verse CUIs. Our inference is that UMLS do not have complete concept coverage on all terms.

      In conclusion,

      • applying CUIs as filters for multiword retrieval reduces considerable recall (65% deduction in this model - 100% to 35%) and gains slight precision (2.95% increasing in this model - 97.05% to 100%). Overall, it decreases F1 and thus CUI filter is not recommended for multiword retrieval from thesaurus.
      • WordNet is a good resource to enhance multiword coverage in the SPECAILIST Lexicon

      Algorithm for this Model – multiwords from zero derivations in WordNet:

      • Retrieve derivations from WordNet 3.0
      • Categorize dPairs to zeroD, suffixD, prefixD, Others
      • Retrieve zeroDs, that their base|POS are not in the Lexicon (filter out words in the Lexicon)
      • Apply combined filters: Lexicon, general, pattern, single-word, lead-end word
      • Filter out verb compliments
      • Apply CUI filters
      • Generate candidates
      • Calculate performance based on unique lowercased spelling, without POS

      Candidate files:

      • WnDPairs.ZD.wordCand.cand.2021
      • WnDPairs.ZD.wordCand.trapByPostFilter.noCui.2021
    • Model-4: Word candidates from Suffix Derivations in WordNet
      • status: sent: TBD, tag: TBD
      • Dir: ${DERIVATION_DIR}/6.WordNetD/
      • file dir: ${DERIVATION_DIR}/6.WordNetD/data/2021/outData/Cand/
      • Algorithm:
        StepalgorithmOut file
        0Categorize dPairs to zeroD, suffixD, prefixD, Others
        • WnDPairs.1.unique.data.ZD.2021 (4,769)
        • WnDPairs.1.unique.data.SD.2021 (16,427)
        • WnDPairs.1.unique.data.PD.2021 (2)
          => none of these are legit prefixD
        • WnDPairs.1.unique.data.OD.2021 (68)
          => none of them are legit 1-step derivation. Most of them are related words semantically
        1retrieve those single words (no need)
        • WnDPairs.1.unique.data.SD.2021.word (4,886)
        2
        See log.LMW_SD.2021
        • input: WnDPairs.1.unique.data.SD.2021 (16,427 pairs, 26,514 words)
        • apply combined filters (4,619):
          • Lexicon:
          • Not Base Form:
          • Not LMW:
          • general filters
          • Pattern filters
          • single-word
          • lead-end term
          • lead-term pattern
          • end-term pattern
        • filter out word candidates from previous WordNet models (4,619)
          • verb compliments
          • zeroD
        • CUI filters (4,619)
          • CUI: WnDPairs.SD.wordCand.cand.2021 (756)
          • No CUI: WnDPairs.SD.wordCand.trapByPostFilter.noCui.2021 (3,722)
          • Prev-Models: 141 (zeroD & Verb Complement)
        • WnDPairs.SD.wordCand.cand.2021 (756)
        • WnDPairs.SD.wordCand.trapByPostFilter.noCui.2021 (3,772)
      • Results:
    • Model-5: Word candidates from synonyms in WordNet

      TBD

    • Model-6: Word candidates from antonyms in WordNet

      TBD

    III. Processes

    • Source directory: ${LMW_DIR}/sources/LexCandidates
    • Data directory: ${LMW_DIR}/data/${YEAR}/outData/12.LexCandidates
    • Program: ${LMW_DIR}/bin/12.LexAbbAcrCand <YEAR>

    IV. References

    • George A. Miller (1995). WordNet: A Lexical Database for English. Communications of the ACM Vol. 38, No. 11: 39-41.
    • Christiane Fellbaum (1998, ed.) WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press.
    • Fellbaum, Christiane (2005). WordNet and wordnets. In: Brown, Keith et al. (eds.), Encyclopedia of Language and Linguistics, Second Edition, Oxford: Elsevier, 665-670.

    • Finlayson, Mark Alan (2014) Java Libraries for Accessing the Princeton Wordnet: Comparison and Evaluation. In H. Orav, C. Fellbaum, & P. Vossen (Eds.), Proceedings of the 7th International Global WordNet Conference (GWC 2014) (pp. 78-85). Tartu, Estonia.