The SPECIALIST Lexicon

Multiwords

I. Introduction

Many researches recognize the importance of multiwords for NLP. Here are the summaries:

  • A multiword is defined as "idiosyncratic interpretations that cross word boundaries (or space)". More new MWEs (than single words) are added to new technology, medical domains (Sag et al., 2002:2)
  • Multiword is also called multiword expression (MWE).
  • A multiword is a word (has part of speech and meaning) include space.
  • The number of multiwords in Lexicon is of the same order of magnitude as the number of single words (Jackendoff, 1997: P.156). This seems an underestimate for biomedical domain.
  • Multiwords are used extensively in biomedical domain
  • Multiwords are an essential ingredient and play a key role for the success of NLP task.
  • Multiwords constitute a key problem that must be resolved in order for linguistically precise NLP to succeed.
  • Precise recognition of word boundaries and identify multiwords benefit disambiguation and improves the accuracy in information extraction.
  • MWE is easily mastered by native speakers, their interpretation poses a major challenge for computational systems, due to their flexible and heterogeneous nature.
  • Morphology rules can't be applied directly to MWE (Green, 2013), Example: part of speech.
  • The incorporation of MWE knowledge has been shown to improve task accuracy for a range of NLP applications, including dependency parsing (Nivre and Nilsson 2004), supertagging (Blunsom and Baldwin 2006), sentence generation (Hogan et al. 2007), machine translation (Carpuat and Diab 2010), and shallow parsing (Korkontzelos and Manandhar 2010). (Green, 2013)

  • more than single word, MWE is a cretical problem in NLP.

II. Definition Of LexMultiword

Multiwords are words that happens to be spelled with a space. Single words are words without space. Lexicon include single words and multiwords. For examples:

Single WordMultiword
  • saw
  • ice-cream
  • clubfoot
  • club-foot
  • club foot
  • ice cream
  • hot dog
  • drop-foot gait
  • Horner'ssyndrome

Multiwords discussed in this project are multiwords that in the SPECIALIST Lexicon, called LexMultiWords. It must includes:

  • Part of speech
  • Inflection
  • A special unit of lexical meaning by themselves

III. LexMultiword and Phrase

  • A multiword is not a phrase. For example: "in house" is a multiword. "in the east" is a phrase.
  • A phrase is a group of words, which makes sense, but not complete sense. It is a group of related words without a subject and a verb.
  • Multiwords are recorded in Lexicon, where phrase are not in Lexicon.

  • In the SPECIALIST Lexicon, there is about 30% of multiwords. There are lots of rooms for multiwords to grow.
  • Many NLP fundamental tools (such as Lexical Tools, MetaMap, etc.) use Lexicon as corpus and term expression (multiwords)
  • Many NLP tasks (such as ClinicalTrials.gov, machine translation, information extraction) use LEXICON and it's applications

IV. MWE Characteristics

  • Arbitrariness and Institutionalisation
    The arbitrary character is the most challenging property.
    salt and pepper ? pepper and salt [Smadja, 1993]
  • Frequency: 50 ~ 70% of the Lexicon [Jackenoff 1997, Keieger and Finatto 2004, Ramisch 2009]
  • Limited lexical, syntactic and semantic variablility
    [Sag et al, 2002]
    • syntactic: extragrammaticality (by and large, kingdom come), lexicalization
    • semantic: non-compositionality, non-substitutability, no word-for-word translation, domain-specificity/idiomaticity

V. MWE Classifications (Sag 2002)

  • lexicalized phrases (have partially idiosyncratice syntax or semantic)
    • fixed phrases (fixed expressions)
      [by and large], [in short], [kingdom come], [every which way], [ad hoc], [Palo Alto], [rock'in roll], [in vitro]
    • semi-fixed expresions
      • non-Decomposable Idioms
        [tick the bucket], [trip the light fantastic], [shot the breeze], [hot dog]
      • Compund Nominals
        • right-headed
          [car park],
        • left-headed
          [attonery general], [part of speech], [congressman at large], [pain in the neck]
      • Proper Names
        [San Francisco], [San Francisco 49ers]
    • syntactically-flexible expressions
      • verb particle constructions
        [look up], [write up], [brush up], [call up]
      • Decomposable Idioms (can be compositional reading)
        [spill the beans], [let the cat out of the bag], [get the ball rolling], [sweep under the rug], [storm in a teacup], [beat around the bush], [radar footprint]
      • Light Verbs (support verbs) constructions
        [make a mistake], [give a demo]
  • Institutionalized phrases (simple decomposable - syntactically and semantically compositional)
    [traffic light], [kindle excitement], [motor car]

VI. MWE Morphosyntactic Classes (Typology)

  • Nominal Expression
    • Nominal compunds
      • [traffic light], [soup pot cover], [Russian roulette], [bulletproof vest], degree of freedom], [part of speech], [dry run], [food for thought]
      • generally denote a specific concept for which ther eis no equivalent single-word formulation.
      • noun compounds - when the nominal compund is composed exclusively of noun (Nakov 2013)
        [wine glass], [liver cell line], [olive oil], [laser printer], [telephone booth], [post office]
    • Proper names (denote a very specific named entity)
      not all proper names are MWE: [Google], [Paris]
      Entity recognition
      • city: [Porto Alegre]
      • institution: [United Nations]
      • Person: [Alan Turing]
    • Multiword terms (nominal compunds used in a specified domain to denote a specific concept)

    [big deal] - idiom,
  • Verb Expression (compound verbs)
    • Phrasl verbs
      [agree with], [rely on], [give up], [take off], [bring about]
    • Light verb (verb does not contribute too much meaning)
      [take a shower], [make a presentation], [take a nap]
  • Adverbial Expression
    [upside down], [second hand] - idiom, [at stake],
  • Adjectival Expression
    [on fire], [in the buff]

VII. MWE Computational Classifications (orthogonal types)

  • Fixed expression (word-with-spaces approach)
    [in short], [ with respect to], [by and large], [kingdom come]
  • Idioms (non-compositional semantics, hard to identify without help of semantic resources)
    nominal expression: [dead end], [dry run], [bird brain],
    verbal expression: [put in place], [shoot the breeze], [spill the beans], [pull srting], [make a face],
    adjectival expression: [all ears], [on the same wavelength]
  • "true" collocations (institutionalised phrases are fully compositional expression both syntactically and semantically, but co-occuring more than expected by chance)
    [traffic light]
    [strong coffee]

VIII. MWE Research

  • Acquisition (discovery) - finding new MWE (lexBuilding)
  • Identification (detection) - annotating MWE in applications
  • Interpretation - find the semantic meaning
  • Disambiguation - find the semantic meaning according to the text
    [English teachers], [look up the tower]
    • compositionally (literally)
      noun compounds
    • non-compositionally (figuratively)
      idioms
  • Applications

    IX. MWE Workshops

    • SIGLEX-MWE, including ACL, COLING, EACL, NAACL, LREC, etc.