The SPECIALIST Lexicon

Lexicon Test - Establish the Gold Standard

Introduction

The SPECIALIST Lexicon is a good corpus to be used for testing spVar model. It includes spelling variants in base forms and Inflectional Spelling Variants.

Model (GetGoldStdFromLex.java)

  • Inputs:
    • inflVars.data:
      inflVarcatinflEUIbasecitation
    • LRSPL:
      EUISpVarcitation
    • inflSpVars.data:
      inflSpVar
  • Outputs:
    • goldStd.data
      inflVarspVar tag

      where:

      • inflVar: lowercased inflVar, unique
      • spVar tag: true|false
    • Lex.terms.out (all terms from Lexicon)
  • Algorithm:
    • Go through inflVars.data
    • Tag true if EUI are in the EUI set of base spVars (from LRSPL)
    • Tag true if term are inflSpVars (from inflSpVars.data)
    • In case of an inflVar exist in multiple lexRecords (EUIs), it is tagged as true if one of the them has spVars

  • What are missing:
    The following spelling variants are missing in this program (False Negative). These missing spVar are not included in the gold-standard (final submit) for the AIAM.2016 multiword paper.