Lexical Tools

Derivations - Prefix

I. What are prefix derivations
A prefix is placed at the beginning of a base word to form another word. Usually, it changes meaning, but rarely change part of speech (en & be)

II. Prefix list
We collected the most common prefixes for derivations from the following sources:

The derivational prefix list in current Lexical Tools includes 149 unique prefixes and is subjected to be updated annually.
According to the Merriam-webster.com, a word element that is always and only used as a prefix or suffix, gets called a prefix or suffix. Otherwise, they are combining forms. Both prefixes and combining forms are included in our prefix list.

III. Prefix derivation pairs in LEXICON

All base forms are retrieved from inflectional variants list with inflection is base. These base forms include citations and spelling variants. Prefix derivation pairs are then are retrieved by computer programs if a both "prefix + base" and "base" exists. In lvg.2012, there are 114,902 prefix derivation pairs found in LEXICON for the 142 prefixes. Three type of prefix derivation pairs are found in this program as shown in the following example:

Three types of prefix derivation pair:
prefix: non

  • prefix: nonsignificant|significant
  • prefix and a dash: non-significant|significant
  • prefix and a space: non significant|significant

IV. Processes

  • Prepare input files
    • ${DERIVATIONS}/Prefix/data/${YEAR}/dataOrg/inflVars.data
      The latest inflVars.data from lexicon.${YEAR}
    • ${DERIVATIONS}/Prefix/data/${YEAR}/dataOrg/prefixList.data
      The list of all prefix word. The format is:
      prefixmeaningexamplesstatus
    • ${DERIVATIONS}/Prefix/data/${YEAR}/dataOrg/prefix.tag.txt
      A manual tag file for prefix derivation. The baseline of this file is the previous year tag file. The tagged file of prefix.tbd.data is then added. The format of this file is:
      prefixprefix+basecategory-1EUI-1basecategory-2EUI-2tag
      where tag: yes|no
    • ${DERIVATIONS}/Prefix/data/${YEAR}/dataOrg/prefix.new.data
      The list of all new prefix pairs from "prefix.tag.txt" that are from new lexRecords. This file is used to validate the program and results. The format is:
      prefixprefix+basecategory-1EUI-1basecategory-2EUI-2tag
    • ${DERIVATIONS}/Prefix/data/${YEAR}/dataOrg/LEXICON

      The latest LEXICON from lexicon.${YEAR}. This file is used to check and analyze the results.
  • Run the program
    shell>cd ${DERIVATIONS}/Prefix/bin
    shell>GetPrefixD ${YEAR}
    10

    The following iterative steps are need:
    • send the "prefix.tbd.data" file to linguists to tag
      • derivation tag: tag yes|no for each prefixD pair
      • negation tag: tag N|O if valid dPair for class-B prefixes (a-, an-, de-, dys-, in, under-)
    • add tagged "prefix.tbd.data" to "prefix.tag.txt"
    • update "prefix.new.data" if new lexRecord is added
    • rerun the program until:
      • no error in step 5
      • no difference in step 6

  • Process overview

V. Program Details (GetPrefixD)

  1. Generate bases of prefix derivations from LEXICON (inflVars.data)
    • Descriptions:
      Retrieve all legal base forms (base and spelling variants) from LEXICON. By definition, the inflection must be base (=1).
    • Input files:
      • inflVars.data: all bases (citations and spelling variants) for prefix derivations
    • Output files:
      • bases.data: all legal bases for prefix derivations
        basecategoryinflection (1)EUI
    • Associated Java files:
      • GetBaseForms.java
  2. Retrieve possible prefix derivation pairs from base list
    • Descriptions:
      Retrieve all possible derivations forms legal bases. It retrieves all prefix pairs is the pattern of prefix+base and base exists in LEXICON:
      Please set step 6 for extra options
    • Input files:
      • bases.data: all legal bases for prefix derivations
      • prefixList.data: prefixes list
    • Output files:
      • prefixD.raw.data: raw data of possible prefix derivation pairs
        prefixprefix+basecategory-1EUI-1basecategory-2EUI-2
      • prefixD.rawNo.data: a list of number distribution on all found prefix pairs sorted by descending order. This file is used for analysis purpose.
    • Associated Java files:
      • GetPrefixFromBaseFile.java
  3. Get prefix derivations meta tagged file
    • Descriptions:
      go through all pairs in "prefix.raw.data" and add tag information (from prefix.tag.txt):
      • yes: if tagged as "yes" in prefixD.tag.txt
      • no: if tagged as "no" in prefixD.tag.txt
      • tbd: if not tagged in prefixD.tag.txt

      Please note that not all prefix derivation pairs retrieved from LEXICON (step 2) are valid derivation pairs. We define an eight fields (pipe separated) format for tagging the prefix derivation pairs to validate derivational variants:

      prefixprefix+basecategory-1EUI-1basecategory-2EUI-2

      Examples

      an|ana|adv|E0008740|a|noun|E0598106|no
      an|anaplastic|adj|aplastic|adj|no
      ana|anabiotic|adj|E0008744|biotic|adj|E0013104|no
      

      The first line is not a valid derivational pair because "ana" and "a" are obviously not derivations. The second line is not a valid derivational pair ("anaplastic" and "aplastic"). The correct one should be:

      ana|anaplastic|adj|E0008830|plastic|adj|E0048247|yes
      

      The third line is not a valid derivational pair because "anabiotic" is derived from "anabiosis".

      In order to have a high accuracy of derivations, we have experienced domain experts (linguists) to valid all retrieved prefix derivation pairs from LEXICON.

    • Input files:
      • prefixD.raw.data: raw data of possible prefix derivation pairs
      • prefixD.tag.txt: tag file of prefix derivation pairs
        prefixprefix+basecategory-1EUI-1basecategory-2EUI-2tag
    • Output files:
      • prefixD.meta.data: meta file of tagged prefix derivation pairs
        prefixprefix+basecategory-1EUI-1basecategory-2EUI-2tag
    • Associated Java files:
      • GetPrefixMetaFile.java
  4. Split prefix derivations meta file
    • Descriptions:
      split "prefix.meta.data" into three files according to the tag:
      • prefixD.yes.data: if tag is "yes", prefix & tag removed
      • prefixD.no.data: if tag is "no", prefix & tag removed
      • prefixD.tbd.data: if tag is "tbd", prefix & tag removed
      • prefixD.yesNo.data: if tag is "yes" or "no", keep prefix & tag
      • prefixD.tbt.data: if tag is "tbd" and prefix is existing (not TBD), tag removed, keep prefix (this file is send to linguists for tagging)
    • Input files:
      • prefixD.meta.data: meta file of tagged prefix derivation pairs
      • prefixList.data: prefixes list
    • Output files:
      • prefixD.yes.data: valid prefixD pairs
        prefix+basecategory-1EUI-1basecategory-2EUI-2
      • prefixD.no.data: not used, just for reference
      • prefixD.tbd.data: prefixD does not have a tag
      • prefixD.tbt.data: need to tag this file, add to prefix.tag.txt, and rerun the program
      • prefixD.yesNo.data: used for validating the results (step 5)
        prefixprefix+basecategory-1EUI-1basecategory-2EUI-2tag
    • Associated Java files:
      • SplitMetaFile.java
  5. Add negation tag
    • Descriptions:
      Add negation tag (N|O) to all prefixD:
      • Auto-tag N|O for class-N & class-O prefixes
      • Get negation tag from (prefixD.tag.txt) for class-B prefixes
      • Make sure no prefixD has negation tag as B at the end
    • Input files:
      • prefixD.yes.data: valid prefixD pairs
      • prefixList.data: prefixes (with class-N, class-O, & class-B infomation)
      • prefixD.tag.txt: negation tag for class-B prefixD pairs
    • Output files:
      • prefixD.yes.data.${YEAR}: valid prefixD pairs with negation tag
        prefix+basecategory-1EUI-1basecategory-2EUI-2negation tag
    • Associated Java files:
      • AddNegationTagToFile.java
  6. Check difference on original and result tag files
    • Descriptions:
      check the resulting tagged file to the original tagged file:
      • original tagged file:
        • prefix.tag.txt
        • remove comment lines (line starts with #)
        • uSort the file (unify and sort)
      • resulting tagged file:
        • prefixD.yesNo.data: tagged prefix derivation pairs that are in current LEXICON
        • prefixD.new.data: prefix derivation pairs that are not in current LEXICON
        • uSort the file (unify and sort)
      • above two files should be the same
  7. Retrieve possible prefix derivation pairs from base list with options
    • Descriptions:
      This is the same process as step 2 with four options:
      • all: same as step 2. Retrieve all possible derivation pairs forms legal bases. It retrieves all prefix pairs is the pattern of prefix+base and base exists in LEXICON
      • tbd: retrieve all untagged possible prefix derivation pairs (the tag is "tbd")
      • done: retrieve all tagged possible prefix derivation pairs (the tag is "yes" or "no")
      • prefix: retrieve all untagged possible prefix derivation pairs by specifying the "prefix"
  8. Analyze prefix derivation pairs No
    • Descriptions:
      Analyze statistics number of prefixD:
      • Total yes (No & %)
      • Total no (No & %)
      • Total TBD (No & %)

      for all prefixes and each type ([prefix], [-prefix], [ prefix])
    • Input files:
      • prefixD.meta.data: meta file of tagged prefix derivation pairs
    • Output files:
      • prefixD.tagNo.rpt: analysis report file
  9. Analyze prefix derivation pairs source
    • Descriptions:
      Analyze valid prefix derivation based on
      • pattern of prefix+base|base
      • different category
      • if it is abbreviations
      • if it is acronyms
    • Input files:
      • prefixD.yes.data: valid prefix derivation pairs
      • LEXICON: the latest LEXICON
    • Output files:
      • prefixD.analyze.rpt: analysis report file