Lexical Tools

Derivations - Prefix

I. What are prefix derivations
A prefix is placed at the beginning of a base word to form another word. Usually, it changes meaning, but rarely change part of speech (en & be)

II. Prefix list
We collected the most common prefixes for derivations from the following sources:

Wikipedia - English prefixes
English Club - prefixes
About.com - Common Prefix in English
Prefix work by Bienvenido Ortiz in 2003 Summer
Merriam-webster.com

The derivational prefix list in current Lexical Tools includes 149 unique prefixes and is subjected to be updated annually.
According to the Merriam-webster.com, a word element that is always and only used as a prefix or suffix, gets called a prefix or suffix. Otherwise, they are combining forms. Both prefixes and combining forms are included in our prefix list.

III. Prefix derivation pairs in LEXICON

All base forms are retrieved from inflectional variants list with inflection is base. These base forms include citations and spelling variants. Prefix derivation pairs are then are retrieved by computer programs if a both "prefix + base" and "base" exists. In lvg.2012, there are 114,902 prefix derivation pairs found in LEXICON for the 142 prefixes. Three type of prefix derivation pairs are found in this program as shown in the following example:

Three types of prefix derivation pair:
prefix: non

prefix: nonsignificant|significant
prefix and a dash: non-significant|significant
prefix and a space: non significant|significant

IV. Processes

Prepare input files
- ```
${DERIVATIONS}/Prefix/data/${YEAR}/dataOrg/inflVars.data
```
  The latest inflVars.data from lexicon.${YEAR}
- ```
${DERIVATIONS}/Prefix/data/${YEAR}/dataOrg/prefixList.data
```
  The list of all prefix word. The format is:
  
  prefix meaning examples status
- ```
${DERIVATIONS}/Prefix/data/${YEAR}/dataOrg/prefix.tag.txt
```
  A manual tag file for prefix derivation. The baseline of this file is the previous year tag file. The tagged file of prefix.tbd.data is then added. The format of this file is:
  
  prefix prefix+base category-1 EUI-1 base category-2 EUI-2 tag
  where tag: yes|no
- ```
${DERIVATIONS}/Prefix/data/${YEAR}/dataOrg/prefix.new.data
```
  The list of all new prefix pairs from "prefix.tag.txt" that are from new lexRecords. This file is used to validate the program and results. The format is:
  
  prefix prefix+base category-1 EUI-1 base category-2 EUI-2 tag
- ```
${DERIVATIONS}/Prefix/data/${YEAR}/dataOrg/LEXICON
```
  The latest LEXICON from lexicon.${YEAR}. This file is used to check and analyze the results.
Run the program
```
shell>cd ${DERIVATIONS}/Prefix/bin
shell>GetPrefixD ${YEAR}
10
```
The following iterative steps are need:
- send the "prefix.tbd.data" file to linguists to tag
  - derivation tag: tag yes|no for each prefixD pair
  - negation tag: tag N|O if valid dPair for class-B prefixes (a-, an-, de-, dys-, in, under-)
- add tagged "prefix.tbd.data" to "prefix.tag.txt"
- update "prefix.new.data" if new lexRecord is added
- rerun the program until:
  - no error in step 5
  - no difference in step 6
Process overview

V. Program Details (GetPrefixD)

Generate bases of prefix derivations from LEXICON (inflVars.data)
- Descriptions:
  Retrieve all legal base forms (base and spelling variants) from LEXICON. By definition, the inflection must be base (=1).
- Input files:
  - inflVars.data: all bases (citations and spelling variants) for prefix derivations
- Output files:
  - bases.data: all legal bases for prefix derivations
    
    base category inflection (1) EUI
- Associated Java files:
  - GetBaseForms.java
Retrieve possible prefix derivation pairs from base list
- Descriptions:
  Retrieve all possible derivations forms legal bases. It retrieves all prefix pairs is the pattern of prefix+base and base exists in LEXICON:
  Please set step 6 for extra options
- Input files:
  - bases.data: all legal bases for prefix derivations
  - prefixList.data: prefixes list
- Output files:
  - prefixD.raw.data: raw data of possible prefix derivation pairs
    
    prefix prefix+base category-1 EUI-1 base category-2 EUI-2
  - prefixD.rawNo.data: a list of number distribution on all found prefix pairs sorted by descending order. This file is used for analysis purpose.
- Associated Java files:
  - GetPrefixFromBaseFile.java
Get prefix derivations meta tagged file
- Descriptions:
  go through all pairs in "prefix.raw.data" and add tag information (from prefix.tag.txt):
  - yes: if tagged as "yes" in prefixD.tag.txt
  - no: if tagged as "no" in prefixD.tag.txt
  - tbd: if not tagged in prefixD.tag.txt
  Please note that not all prefix derivation pairs retrieved from LEXICON (step 2) are valid derivation pairs. We define an eight fields (pipe separated) format for tagging the prefix derivation pairs to validate derivational variants:
  
  prefix prefix+base category-1 EUI-1 base category-2 EUI-2
  
  Examples
```
an|ana|adv|E0008740|a|noun|E0598106|no
an|anaplastic|adj|aplastic|adj|no
ana|anabiotic|adj|E0008744|biotic|adj|E0013104|no
```
  The first line is not a valid derivational pair because "ana" and "a" are obviously not derivations. The second line is not a valid derivational pair ("anaplastic" and "aplastic"). The correct one should be:
```
ana|anaplastic|adj|E0008830|plastic|adj|E0048247|yes
```
  The third line is not a valid derivational pair because "anabiotic" is derived from "anabiosis".
  In order to have a high accuracy of derivations, we have experienced domain experts (linguists) to valid all retrieved prefix derivation pairs from LEXICON.
- Input files:
  - prefixD.raw.data: raw data of possible prefix derivation pairs
  - prefixD.tag.txt: tag file of prefix derivation pairs
    
    prefix prefix+base category-1 EUI-1 base category-2 EUI-2 tag
- Output files:
  - prefixD.meta.data: meta file of tagged prefix derivation pairs
    
    prefix prefix+base category-1 EUI-1 base category-2 EUI-2 tag
- Associated Java files:
  - GetPrefixMetaFile.java
Split prefix derivations meta file
- Descriptions:
  split "prefix.meta.data" into three files according to the tag:
  - prefixD.yes.data: if tag is "yes", prefix & tag removed
  - prefixD.no.data: if tag is "no", prefix & tag removed
  - prefixD.tbd.data: if tag is "tbd", prefix & tag removed
  - prefixD.yesNo.data: if tag is "yes" or "no", keep prefix & tag
  - prefixD.tbt.data: if tag is "tbd" and prefix is existing (not TBD), tag removed, keep prefix (this file is send to linguists for tagging)
- Input files:
  - prefixD.meta.data: meta file of tagged prefix derivation pairs
  - prefixList.data: prefixes list
- Output files:
  - prefixD.yes.data: valid prefixD pairs
    
    prefix+base category-1 EUI-1 base category-2 EUI-2
  - prefixD.no.data: not used, just for reference
  - prefixD.tbd.data: prefixD does not have a tag
  - prefixD.tbt.data: need to tag this file, add to prefix.tag.txt, and rerun the program
  - prefixD.yesNo.data: used for validating the results (step 5)
    
    prefix prefix+base category-1 EUI-1 base category-2 EUI-2 tag
- Associated Java files:
  - SplitMetaFile.java
Add negation tag
- Descriptions:
  Add negation tag (N|O) to all prefixD:
  - Auto-tag N|O for class-N & class-O prefixes
  - Get negation tag from (prefixD.tag.txt) for class-B prefixes
  - Make sure no prefixD has negation tag as B at the end
- Input files:
  - prefixD.yes.data: valid prefixD pairs
  - prefixList.data: prefixes (with class-N, class-O, & class-B infomation)
  - prefixD.tag.txt: negation tag for class-B prefixD pairs
- Output files:
  - prefixD.yes.data.${YEAR}: valid prefixD pairs with negation tag
    
    prefix+base category-1 EUI-1 base category-2 EUI-2 negation tag
- Associated Java files:
  - AddNegationTagToFile.java
Check difference on original and result tag files
- Descriptions:
  check the resulting tagged file to the original tagged file:
  - original tagged file:
    - prefix.tag.txt
    - remove comment lines (line starts with #)
    - uSort the file (unify and sort)
  - resulting tagged file:
    - prefixD.yesNo.data: tagged prefix derivation pairs that are in current LEXICON
    - prefixD.new.data: prefix derivation pairs that are not in current LEXICON
    - uSort the file (unify and sort)
  - above two files should be the same
Retrieve possible prefix derivation pairs from base list with options
- Descriptions:
  This is the same process as step 2 with four options:
  - all: same as step 2. Retrieve all possible derivation pairs forms legal bases. It retrieves all prefix pairs is the pattern of prefix+base and base exists in LEXICON
  - tbd: retrieve all untagged possible prefix derivation pairs (the tag is "tbd")
  - done: retrieve all tagged possible prefix derivation pairs (the tag is "yes" or "no")
  - prefix: retrieve all untagged possible prefix derivation pairs by specifying the "prefix"
Analyze prefix derivation pairs No
- Descriptions:
  Analyze statistics number of prefixD:
  - Total yes (No & %)
  - Total no (No & %)
  - Total TBD (No & %)
  for all prefixes and each type ([prefix], [-prefix], [ prefix])
- Input files:
  - prefixD.meta.data: meta file of tagged prefix derivation pairs
- Output files:
  - prefixD.tagNo.rpt: analysis report file
Analyze prefix derivation pairs source
- Descriptions:
  Analyze valid prefix derivation based on
  - pattern of prefix+base|base
  - different category
  - if it is abbreviations
  - if it is acronyms
- Input files:
  - prefixD.yes.data: valid prefix derivation pairs
  - LEXICON: the latest LEXICON
- Output files:
  - prefixD.analyze.rpt: analysis report file