Lexical Tools

Prefix Computer Programs

A set of computer programs is developed to retrieve prefix word|word in LEXICON and validation for derivations. This program is run annually for lvg release.

  1. Get all base forms from LEXICON (inflvars.data)
    • Program: GetBaseForms.java
    • Input:./dataOrg/inflVars.data
    • Output:./data/bases.data
    • Descriptions
      • go through all lines (inflectional variants) in file of "inflVars.data"
      • retrieve base form (infl = 1)

  2. Retrieve and validate prefix words|words
    • Program: GetPrefixWordsFromFile.java
    • Input:
      • ./dataOrg/prefix.data
      • ./data/bases.data
      • ./data/prefix.tag.data
    • Output:./data/prefixWords.meta.data
    • Descriptions
      • get prefixes from a file (./dataOrg/prefix.data)
      • get base forms from a file (./data/bases.data)
      • get prefix tags from a file (./data/prefix.tag.data)

      • Find all pairs of prefix words|words in LEXICON:
        • go through all prefixes from the sorted prefixes list
        • find all pairs of prefix word|word (prefixWordList) if:
          • prefix word is in base forms
          • word is base in base forms
      • validate all pairs of prefix words|words in prefixWordList
        • go through all pairs of prefixWord|words in prefixWordList
          • print tag ("yes" or "no") to ./data/prefixWords.meta.data
          • print "tbd" if no tag found

  3. Generate various reports from ./data/prefixWords.meta.data by tag
    • Program: GeneratePrefixFiles.java
    • Input:
      • ./data/prefixWords.meta.data
    • Output:
      • ./data/prefix.tbd.data
      • ./data/prefixWords.data
      • ./data/prefix.newTag.data
    • Descriptions
      • go through all pairs of tagged prefixWord|words in prefixWords.meta.data
        • send all "tbd" tags to prefix.tbd.data
        • send all "yes" and "no" tags to prefix.newTag.data
        • send all "yes" tags to prefixWords.data
        • Check if there is invalid tag
        • Check all comment lines

  4. Validate results:
    • Program: 2.GetPrefixWords
    • Input:
      • ./data/prefix.tag.data
      • ./data/prefix.newTag.dat
      • ./data/prefixWords.data.new
    • Output:
      • ./data/prefix.tag.data.noComment.sort
      • ./data/prefix.newTag.data.all.sort
    • Descriptions
      • Remove all comments line from prefix.tag.data
          			
        • fgrep -v '#' prefix.tag.data prefix.tag.data.noComment
        • sort -u prefix.tag.data.noComment > prefix.tag.data.noComment.sort
      • Combine results and new prefixWords (will be added in the future)
          			
        • cat prefix.newTag.data prefixWords.data.new > prefix.newTag.data.all
        • sort -u prefix.newTag.data.all > prefix.newTag.data.all.sort
      • Compare two input and results tagged files
          			
        • diff prefix.tag.data.noComment.sort prefix.newTag.data.all.sort > prefix.tag.diff

  5. Usage for (future) releases:
    • update inflVars.data from new release of LEXICON
    • update prefix.data
    • update prefixWords.data.new (for new prefix words that not in this release)

      	
    • ./bin/1.GetBaseForms ${YEAR}
    • ./bin/2.GetPrefixWords ${YEAR}
      • Check lines of prefix.tag.diff (should be 0)
      • prefixWords.data (to be added to derivations.data)
      • prefix.tbd.data (send to linguists for validations)