The SPECIALIST Lexicon

Derivations Procedures - suffixD

Generate suffixD pairs in derivation table:

I. Directory: ${DERIVATION}/3.suffixD

II. Input Files (./data/${YEAR}/dataOrg/):
shell> ${SUFFIX_D}/bin/GetSuffixD ${YEAR}
0

  • link LEXICON to LEXICON.${YEAR} (from ${LEXICON_DIR}/LEXICON.release)
  • link inflVars.data to inflVars.data.${YEAR} (from ${LEXICON_DIR})
  • link bases.data from prefixD/data/ (Complete step-1 in prefixD first)
  • link sdRules.data to sdRules.data.${YEAR} (from ${PREV_YEAR} and new rules)
  • link suffixD.tag.txt to suffixD.tag.txt.${YEAR} (copy from previous year)

  • touch/create suffixD.meta.data.conflict.tag.data for init phase

  • Must complete nomD first for auto-tag program to work
  • Must run prefixD step-1 first to get bases.data

III. Final file for allD (release)

  • ${TAR_DIR}/suffixD.yes.data.${YEAR}

IV. Summary of GetSuffixD

StepDescription and ProgramInputOutputNotes
0
  • Prepare directories and files
See section II.See section II.
  • 3.suffixD/data/${YEAR}/dataOrg
    • LEXICON
    • inflVars.data
    • bases.data
    • sdRules.data
    • suffixD.tag.txt
1
  • Retrieve std-raw suffixD pairs
  • GetSuffixDRawFromBaseFile.java
  • ${SRC_DIR}:
    • bases.data
    • sdRules.data
  • suffixD.raw.data.fromBase.all
  • sdRules.rawNo.rpt
  • Must complete prefixD Step-1 to get bases.data
  • Need to rerun from this step if there are new Sd-Rules invloved
    • Add new SD-Rules to ./dataOrg/sdRules.data.${YEAR}
    • Get sd-pair (TBD) for each new sdRules
    • Send TBD to linguist to tag [yes|no] from the following steps
    • Save new tag result to ./dataOrg/newRuleTag/
    • Add new tag result to ./dataOrg/suffixD.tag.txt.${YEAR}
2
  • Combine with nomD.S file (raw)
  • CheckWithNomDFile.java
  • ${NOM_TAR_DIR}:
    • nomD.yes.S.data.${YEAR}

  • ${TAR_DIR}:
    • suffixD.raw.data.fromBase
  • suffixD.raw.data.fromNomD
  • suffixD.raw.data
  • Must link suffixD.raw.data.fromBase to suffixD.raw.data.fromBase.all to run this step
3
  • Add tags to suffixD meta file
  • GetSuffixDMetaFile.java
  • DPairTagList.java
  • ${NOM_TAR_DIR}:
    • nomD.yes.S.data.${YEAR}

  • ${SRC_DIR}:
    • suffixD.tag.txt (suffixD.tag.txt.${YEAR}.uSort)

  • ${TAR_DIR}:
    • suffixD.raw.data
  • suffixD.meta.data
  • suffixD.meta.data.conflict

  • 1. Read and fix sdPair tags from tag file
    • Remove duplicat and conflict tags from ./dataOrg/suffixD.tag.txt
    • use uSort (shell> sort -u suffixD.tag.txt > suffixD.tag.txt.usort)
    • => after uSort, duplicated tag no. should = conflict tag no (duplicate are removed by sort -u).
    • go through the duplicated tag no and conflict tag no to fix them until both should be fixed to 0
    • conflict tag (different tag): need to be fixed, send to linguist to re-tag.
  • 2. Read and add sdPair tags from nomD file
    • Ignore the long list of duplicated tags (between manual tags and normD tags) in the log.3
    • Check and fix the Total conflict tag no (conflict between nomD and expert's tag)
  • 3. Verify and fix conflict tags from spVars
    • The file (suffixD.meta.data.conflict) are suffixD tag conflict caused by SpVar between two records
    • Ideally, all suffixD tag should be consistent among SpVars between records
    • In the 1st run (before add tags to annually updates), no conflict should exist. That is to skip Step-9, go to Step-4 for the 1st run.
    • The suffixD.meta.data.conflict should be empty (except for 1 known exception)
    • There is a known exception (since 2014+):
      1|E0056852|E0234312|both
      # 20092|space|noun|E0056852|spacey|adj|E0234312|no
      # 38379|space|noun|E0056852|spacy|adj|E0234312|yes
      

      => This known exception is corrected in 2023+ and change to yes.
    • If not empty, sent to linguists to tag (yes|no|both) on the EUI lines:
      • yes: all suffixD tags among SpVars between records are valid
      • no: all suffixD tags among SpVars between records are invalid
      • both: suffixD tags among SpVars between records inlcude valid and invalid (exception)
    • Run the next step (9) to resolve conflicts and update the results to suffixD.tag.txt automatically, then re-run this Step: 3 until all exception are known
    • make sure:
      • Empty line no = 0
      • Invalid tag no = 0
      • conflict (yes|no) tag no = 0
      • none (tbd) tag no = 0
    • If all conflict exceptions are known (fixed), go to step-4
9
  • Auto-fix suffixD.tag.txt
  • FixConflictDPairTags.java
  • ${SRC_DIR}:
    • suffixD.tag.txt.${YEAR}
    • suffixD.meta.data.conflict.tag.data
    ${SRC_DIR}
  • suffixD.tag.txt.${YEAR}.fixDPair
  • Make sure use linguist tagging result to ./dataOrg/suffixD.meta.data.conflict.tag.data
  • Manully exam ./dataOrg/suffixD.tag.txt.${YEAR}.fixDPair
  • If suffixD.tag.txt.${YEAR}.fixDPair passes exam, move it to suffixD.tag.txt.${YEAR}, then re-run Step-3 again.
4
  • Split suffixD meta file (yes|no|tbd)
  • SplitSuffixDMetaFile.java
  • ${TAR_DIR}:
    • suffixD.meta.data
  • suffixD.yes.data
  • suffixD.no.data
  • suffixD.tbd.data
  • suffixD.tbd.data.sort (sent to linguists)
  • suffixD.yesNo.data
  • Make sure suffixD.tbd.data(.sort) is empty. If not, sent to linguists to tag:
    • Tag suffixD: (yes|no)
      • valid suffixD: yes
      • invalid suffixD: no
  • Append (update) these new tagged sd-pairs (to ./dataOrg/suffixD.tag.txt) and rerun steps: 3~4
    • add [tbd] if tags are missing to pass step-3.
4a
  • Clean up tags on tagged file
  • CleanUpDPairTagList.java
  • ${SRC_DIR}:
    • suffixD.tag.txt
  • ${SRC_DIR}:
    • suffixD.tag.txt.cleanUp
Re-run this step until:
Go to the end of the log.4a file
  • duplicate = 0 If not, replace suffixD.tag.data with suffixD.tag.data.cleanUp
  • conflict = 0 If not, send conflict (from log.5a) to linguists to re-tag. Do NOT replace suffixD.tbd.data with suffixD.tbd.data.cleanUp until conflict = 0
  • diff = 0 If not, replace suffixD.tbd.data with suffixD.tbd.data.cleanUp
  • Then, rerun Steps: 3~4 until it is empty
5
  • Verify dType on suffixD.yes.data
  • DType.java
  • ${ALL_SRC_DIR}:
    • LRSPL
    • dTypeStr.data

  • ${TAR_DIR}:
    • suffixD.yes.data
  • suffixD.yes.data.type
  • suffixD.yes.data.type.Z
  • suffixD.yes.data.type.S
  • suffixD.yes.data.type.P
  • suffixD.yes.data.type.ZS
  • suffixD.yes.data.type.SS
  • suffixD.yes.data.type.PS
  • suffixD.yes.data.type.U
  • Make sure unknonw dType (|U|) from suffixD is empty
  • Must finish all new SD-rules (if any) before proceed this step
6
  • Automatically add negation tag [N|O], ~less$ is [N], others are [O]
    then sort uniquely
  • AddNegationTagToFile.java
  • DPairTagList.java
  • ${TAR_DIR}:
    • suffixD.yes.data
  • suffixD.yes.data.${YEAR}
  • suffixD.yes.data.${YEAR}.conflict
  • The conflict file (suffixD.yes.data.${YEAR}.conflict) lists all inconsistnent suffixD tags between SpVars in two records
    • Send conflicts to linguist to tag (N|O|B) on EUI lines
    • In the past, no both cases in suffixD
    • Manually update the results to suffixD.tag.txt
    • Rerun Steps: 3~6 until no unknown conflict (both) exist.
7
  • Check afflix on suffixD.yes.data.${YEAR}
  • CheckDerivationByAffix6.java
  • ${ALL_SRC_DIR}:
    • LRSPL

  • ${SRC_DIR}:
    • suffixD.tagYes.txt

  • ${TAR_DIR}:
    • suffixD.yes.data.${YEAR}
  • suffixD.pattern3.rpt
  • copy ${SRC_DIR}/suffixD.tagYes.txt.${PREV_YEAR} ${SRC_DIR}/suffixD.tagYes.txt.${YEAR}
  • suffixD.pattern3.rpt must be empty.
  • This rpt lists all potential invalid dPair by checking 1st and last 3 characters on afflix.
  • If not, send to linguists to tag (Yes|No):
    • invalid dPair (No): add to suffixD.tagNo.txt (no used!), This should not happen!
    • valid dPair (Yes): add to suffixD.tagYes.txt, then rerun Step: 7
8
  • Steps 1 ~ 7
See aboveSee aboveNot recomended!
Other options
11
  • Get stats for SD-rule
    ALL
  • GetSdRuleStatsFromTaggedSuffixD.java
  • ${SRC_DIR}:
    • sdRules.data
  • ${TAR_DIR}:
    • suffixD.meta.data
  • sdRules.stats.rpt
  • sdRules.stats.detail.rpt
Only Use for LVG SD-Rules
  • Used for analysis in finding the optimal Sd-Rules set, please refer to the design documents (SD-Rules evaluation/optimization) of Lexical Tools
12
  • Get the HTML files
    ALL
  • GetSdRuleListHtmlFile.java
  • ${SRC_DIR}:
    • sdRules.data
  • ${TAR_DIR}:
    • suffixD.meta.data
  • ${HTML_DIR}:
    • suffixDRules.html
    • SD-Examples
    • SD-Exceptions
Copy to ${LEXICON_WEB} for annually Sd-Rules updates
  • SD-Examples
  • SD-Exceptions
  • suffixDRules.html

V. Processes Details:

  • shell>cd ${DERIVATION}/suffixD/bin
  • shell>GetSuffixD ${YEAR}

    1: Retrieve std-raw suffixD pairs or
    => generate:

    • ./data/sdRules.rawNo.rpt
    • ./data/suffixD.raw.data.fromBase.all

    2: Check/integrate with nomD.S file (raw)
    => ln -s ./suffixD.raw.data.fromBase.all suffixD.raw.data.fromBase
    => generate:

    • ./data/suffixD.raw.data.fromNomD
    • ./data/suffixD.raw.data (= suffixD.raw.data.fromBase + suffixD.raw.data.fromNomD, has comment line #)

    3: Add tags to suffixD meta file (meta)
    => generate ./data/suffixD.meta.data (commnet lines # are removed from raw)

    • Make sure there is no duplicated tag in ./dataOrg/suffixD.tag.txt
    • Program automatically tags nomD.S as valid suffixD pairs
    • Duplicated dPairs are OK (from nomD)
    • Correct all conflict dPairs (from nomD)
      => verify with linguists

    3.1: Verify suffixD meta file (meta)
    => Check consistency on derivational tag between 2 records with SpVars
    => generate ./data/suffixD.meta.data.conflict

    • All conflict EUI pairs need to be manually reviewed and then update the tag in ./dataOrg/suffixD.tag.txt

    4: Split suffixD meta file (yes|no|tbd)
    => generates

    • ./data/suffixD.yes.data
    • ./data/suffixD.no.data
    • ./data/suffixD.tbd.data (should be 0 if annual updates is completed)
      => send to linguist to tag this annual updates, then add updates to ./dataOrg/suffixD.tag.txt.${YEAR}

    • Duplicated SD pairs are normal because they are generated from parent-child candidate SD-rules.

    5: Verify dType on suffixD.yes.data
    => generates:

    • ./data/suffixD.yes.data.type
      • ./data/suffixD.yes.data.type.Z (must be 0)
      • ./data/suffixD.yes.data.type.P (must be 0)
      • ./data/suffixD.yes.data.type.S (= suffixD.yes.data)
      • ./data/suffixD.yes.data.type.ZS (must be 0)
      • ./data/suffixD.yes.data.type.PS (must be 0)
      • ./data/suffixD.yes.data.type.SS (should be 0)
      • ./data/suffixD.yes.data.type.U (must be 0)

    6: Add negation tag (N|O), sort -u: for annualy suffixD
    generate ./data/suffixD.yes.data.${YEAR}

    10
    => generate ./data/

    7: Get stats for sd-Rule from suffixD.tag.txt use this option to generate all suffixD pair for a specified suffix (check the suffixD.rawNo.rpt)

  • send data/suffixD.tbt.data to linguists for tagging:
    • derivation: yes|no
  • re-run this process until all suffixD are tagged (0 in suffixD.tbd.data)
  • The final suffixD is in ${DERIVATION}/suffixD/data/${YEAR}/data/suffixD.yes.data.${YEAR}

Please refer to derivation design documents in Lexical Tools for deatils.