SPECIALIST Lexicon

Derivations Procedures - orgD

Retrieved and verify dPairs from original Lexical Tools DM.DB and add them to derivation table. This should be done after nomD, prefixD, suffixD, and zeroD. All orgD with EUIs (in Lexicon EUIs) and valid dPairs and not included in our system are added to our final derivation table. We don't expect too many valid dPair from orgD because only new LexRecords in orgD will be added. However, this procedure requires many manually updates (in Steps: 2,5,6,7,8,81)

I. Directory:

${DERIVATION}/0.orgD

II. Input Files (./data/${YEAR}/dataOrg/):
shell> ${ORG_D}/bin/GetOrgD ${YEAR}
0

Copy following 5 files from ${PREV_YEAR}:
- convers.fct
- dm.fct
- etc.fct
- nomiz.fct
- pd.fct
Copy ${SRC_DIR}/orgD.yes.data.{YEAR} from previous year
=>ln -sf ${SRC_DIR}/orgD.yes.data.final to orgD.yes.data.{YEAR}

III. Final files for allD (release)

${TAR_DIR}/orgD.yes.${YEAR}

IV. Summary of GetOrgD

Step	Description and Program	Input	Output	Notes
0	Prepare directories and files	See section II.	See section II.	0.orgD/data/${YEAR}/dataOrg convers.fct dm.fct etc.fct nomiz.fct pd.fct orgD.yes.data.final	0

1	Get all dParis from Original Lvg Facts	${SRC_DIR}: dm.fct etc.fct convers.fct nomiz.fct pd.fct	orgD.raw.data	Must run step-0 to copy 5 original files first!	1
2	Reformat to pure dPair file: remove comments, uSort, empty line	${TAR_DIR}: orgD.raw.data	orgD.yes.data	Manually remove the empty (1st) line	2
3	Add EUI (to orgD.yes.data) `AddEuiToOrgD.java`	${SRC_DIR}: LEXICON.${YEAR} orgD.yes.data.final	orgD.yes.data.final.allEui orgD.yes.data.final.noEui orgD.yes.data.final.yesEui	copy and link ./dataOrg/orgD.yes.data.final from previous year (step-0) Make sure No. of All EUI = yes + no	3
4	Add dType to valid orgD with EUI (in Lexicon) `DType.java`	${TAR_DIR}: orgD.yes.data.final.yesEui ${ALL_SRC_DIR}: LRSPL dTypeStr.data	orgD.yes.data.final.yesEui.type orgD.yes.data.final.yesEui.type.Z orgD.yes.data.final.yesEui.type.S orgD.yes.data.final.yesEui.type.P orgD.yes.data.final.yesEui.type.ZS orgD.yes.data.final.yesEui.type.SS orgD.yes.data.final.yesEui.type.PS orgD.yes.data.final.yesEui.type.U	Go through steps 5 ~ 8 to take care of types of Z, P, U, S.	4
5	Add known tags to type Z - zeroD `AppendFieldSeparator.java` `GetZeroDMetaFile.java` `SplitZeroDMetaFile.java`	${TAR_DIR}: orgD.yes.data.final.yesEui.type.Z ${ZERO_SRC_DIR}: zeroD.tag.txt ${NOM_TAR_DIR}: nomD.yes.Z.data.${YEAR}	orgD.yes.data.final.yesEui.type.Z.raw orgD.yes.data.final.yesEui.type.Z.meta orgD.yes.data.final.yesEui.type.Z.yes.data orgD.yes.data.final.yesEui.type.Z.no.data orgD.yes.data.final.yesEui.type.Z.tbd.data Manually update orgD.yes.data.final.yesEui.type.Z.tbd.data.yes.${YEAR} => See details in the next column	All dPairs in the file of orgD.yes.data.final.yesEui.type.Z.tbd.data must be tagged. This file should be empty. If not empty (or known from past), send them to linguists to tag. add the tags to orgD.yes.data.final.yesEui.type.Z.tbd.data.tag Manually retrieve valid (yes) of above file to orgD.yes.data.final.yesEui.type.Z.tbd.data.yes.${YEAR} In 2015 release, orgD.yes.data.final.yesEui.type.Z.tbd.data.yes.${YEAR} is empty. if the tbd file is empty (no updates), no new [yes] tag is tagged. Accordingly the file of tbd.yes.${YEAR} file should be empty too. => shell> touch orgD.yes.data.final.yesEui.type.Z.tbd.data.yes.${YEAR} These valid zeroD will be add to orgD.yes.${YEAR} in step 9	5
6	Add known tags to type P - prefixD `AppendFieldSeparator.java` `GetPrefixMetaFile.java` `SplitPrefixDMetaFile.java`	${TAR_DIR}: orgD.yes.data.final.yesEui.type.P ${PREFIX_SRC_DIR}: prefixD.tag.txt prefixList.data	orgD.yes.data.final.yesEui.type.P.raw orgD.yes.data.final.yesEui.type.P.meta orgD.yes.data.final.yesEui.type.P.yes.data orgD.yes.data.final.yesEui.type.P.no.data orgD.yes.data.final.yesEui.type.P.tbd.data orgD.yes.data.final.yesEui.type.P.tbt.data prefixD.yesNo.data Manually update orgD.yes.data.final.yesEui.type.P.tbd.data.yes.${YEAR} => See details in the next column	All dPairs in orgD.yes.data.final.yesEui.type.P.tbd.data must be tagged. If it is not empty (or 1 known exception from past), send them to linguists to tag. In 2015+ release, there is 1 known tbd (invalid) prefixD from past `motor neuron\|noun\|E0354096\|neuron\|noun\|E0042456\|no` => motor is not a prefix, "motor neuron" is a compound. add the linguist's tags to orgD.yes.data.final.yesEui.type.P.tbd.data.tag Manually retrieve valid (yes) of above file to orgD.yes.data.final.yesEui.type.P.tbd.data.yes.${YEAR} In 2015 release, orgD.yes.data.final.yesEui.type.P.tbd.data.yes.${YEAR} is empty. If no [yes] tag is tagged, => shell> touch orgD.yes.data.final.yesEui.type.P.tbd.data.yes.${YEAR} These valid prefixD will be add to orgD.yes.${YEAR} in step 9	6
7	Add known tags to type U - unknown type `AppendFieldSeparator.java` =>These orgD can't be identified dType by program	${TAR_DIR}: orgD.yes.data.final.yesEui.type.U ${PREV_TAR_DIR}: orgD.yes.data.final.yesEui.type.U.raw	orgD.yes.data.final.yesEui.type.U.raw orgD.yes.data.final.yesEui.type.U.raw.old orgD.yes.data.final.yesEui.type.U.raw.new Manually copy and update orgD.yes.data.final.yesEui.type.U.yes.${YEAR} => See details in the next column	orgD.yes.data.final.yesEui.type.U.raw.new should be empty (0) If not, send it to linguists to tag (yes\|no) => The difference (new) are the orgDs from new EUIs Copy orgD.yes.data.final.yesEui.type.U.yes.${YEAR} from previous year Manually update negation\|dType\|prefix on new valid orgDs to orgD.yes.data.final.yesEui.type.U.yes.${YEAR}	7
8	Add known tags to type S - suffixD `AppendFieldSeparator.java` `GetSuffixDMetaFile.java` `SplitSuffixDMetaFile.java`	${TAR_DIR}: orgD.yes.data.final.yesEui.type.S ${SUFFIX_SRC_DIR}: suffixD.tag.txt ${NOM_TAR_DIR}: nomD.yes.S.data.${YEAR}	orgD.yes.data.final.yesEui.type.S.raw orgD.yes.data.final.yesEui.type.S.meta orgD.yes.data.final.yesEui.type.S.yes.data orgD.yes.data.final.yesEui.type.S.no.data orgD.yes.data.final.yesEui.type.S.yesNo.data => Already known in suffixD (duplicates) orgD.yes.data.final.yesEui.type.S.tbd.data => Need to be tagged for these suffixD from new Lexicon updates	All dPairs in orgD.yes.data.final.yesEui.type.S.tbd.data must be tagged. If not empty: Continue with steps 81-83 to complete suffix with TBD tags in orgD in step 81 , it split into .old and .new => old: most of them were tagged (known from the past) from previous years => new: new orgD in the updated Lexcion, need to sent to linguist to tag	8
81	Find new suffix TBD orgD and Manually complete tag file `Subset1Way.java`	${PREV_YEAR_ORG_TAR_DIR}: orgD.yes.data.final.yesEui.type.S.tbd.data ${ORG_TAR_DIR}: orgD.yes.data.final.yesEui.type.S.tbd.data	orgD.yes.data.final.yesEui.type.S.tbd.data.old orgD.yes.data.final.yesEui.type.S.tbd.data.new Manually copy (from previous year) and update orgD.yes.data.final.yesEui.type.S.tag.data.${YEAR}, see detail from the next column	The new TBD suffix orgD (orgD.yes.data.final.yesEui.type.S.tbd.data.new), must be empty (0) => these are from updates of new Lexicon, SpVar, or nominalizations => even if it is not empty, should be very small If not empty, send to linguists to tag (yes\|no) Manually copy orgD.yes.data.final.yesEui.type.S.tag.data.${YEAR} from previous year Manually add tagging results (yes\|no) suffixD to orgD.yes.data.final.yesEui.type.S.tag.data.${YEAR} Go to Step 82.	81
82	Add tags (yes\|no) to suffix TBD orgD file Split tagged file (yes\|no\|tbd\|yesNo) `GetSuffixDMetaFile.java` `SplitSuffixDMetaFile.java`	${NOM_TAR_DIR}: nomD.yes.S.data.${YEAR} ${ORG_TAR_DIR}: orgD.yes.data.final.yesEui.type.S.tbd.data orgD.yes.data.final.yesEui.type.S.tag.data.${YEAR} => Copy from previous year => updated from tagged result of Step-81	orgD.yes.data.final.yesEui.type.S.tag.data orgD.yes.data.final.yesEui.type.S.tag.data.yes orgD.yes.data.final.yesEui.type.S.tag.data.no orgD.yes.data.final.yesEui.type.S.tag.data.yesNo orgD.yes.data.final.yesEui.type.S.tag.data.tbd	make sure no conflict tags from nomD when adding tag (from the log.82) => this might happen due to new nomalization => sent conflict to linugist to confirm the tag is [yes], or tag [yes\|no] make sure No. of tbd is 0 (orgD.yes.data.final.yesEui.type.S.tag.data.tbd) If not, check orgD.yes.data.final.yesEui.type.S.tag.data.${YEAR} in Step 81 and rerun Steps: 81 ~ 82	82
83	Finalize suffix OrgD: auto add negation (N\|O), dType\|prefix (S\|None) `AddNegationTagToFile.java` `GenerateSuffixDTable.java`	${ORG_TAR_DIR}: orgD.yes.data.final.yesEui.type.S.tag.data.yes	orgD.yes.data.final.yesEui.type.S.tag.data.yes.negation orgD.yes.data.final.yesEui.type.S.tag.data.yes.${YEAR}	orgD.yes.data.final.yesEui.type.S.tbd.data.yes.${YEAR} is valid suffixD from TBD orgD This final file is used in Step 9	83
9	Combine Z, S, P, U (Steps 4-7) to orgD.yes.${YEAR}	${TAR_DIR}: Must run 5-8 to get following files. orgD.yes.data.final.yesEui.type.Z.tbd.data.yes.${YEAR} orgD.yes.data.final.yesEui.type.P.tbd.data.yes.${YEAR} orgD.yes.data.final.yesEui.type.S.tbd.data.yes.${YEAR} orgD.yes.data.final.yesEui.type.U.yes.${YEAR}	orgD.yes.${YEAR}	Must run through steps 4 ~ 8 (81-83) first!	9

IV. Processes details:

shell>cd ${DERIVATION}/0.orgD/bin
shell>GetOrgD ${YEAR}
1) Combine Original Lvg Fact dPairs from 5 files
Combine above five files:
=> Generate ./data/${YEAR}/data/orgD.raw.data
2) Reformat: remove comments, uSort, empty line
Reformat: remove comments, uSort, empty lines:
=> Generate ./data/${YEAR}/data/orgD.yes.data
=> Remove 1st (empty) line in ./data/${YEAR}/data/orgD.yes.data
3) Add EUI to orgD.yes.data.final (use E0000000 for no EUI)
Add EUI to dPairs in orgD.yes.data, use E0000000 if no EUI found
=> Prepare:
- cd ./data/${YEAR}/dataOrg/
- link LEXICON.${YEAR} to /nfsvol/lex/Lu/Development/Lexicon/data/${YEAR}/data/LEXICON.release
- link orgD.yes.data.final to ../../${PRE_YEAR}/data/orgD.yes.data.${PRE_YEAR}
=> Generate:
- ./data/${YEAR}/data/orgD.yes.data.final.all
- ./data/${YEAR}/data/orgD.yes.data.final.noEui
  => Send to linguists to add in new record with E0000000
- ./data/${YEAR}/data/orgD.yes.data.final.yesEui
  => go to step 4.
4) Add dType of orgD.yes.data.final.yesEui
Add dType to orgD.yes.data.final.yesEui to zeroD, suffixD, and prefixD:
generate:
- orgD.yes.data.final.yesEui.P
  => go to step 6
- orgD.yes.data.final.yesEui.S
  => go to step 7
- orgD.yes.data.final.yesEui.Z
  => go to step 8
- orgD.yes.data.final.yesEui.PS
  => ignore because PDs by SpVars are excluded
- orgD.yes.data.final.yesEui.SS
  => ignore because SDs by SpVars are excluded
- orgD.yes.data.final.yesEui.ZS
  => ignore because ZDs by SpVars are excluded
- orgD.yes.data.final.yesEui.U
  => Manually review and assign dTag and dType
  - If valid dPairs:
    => manually add to P, S, Z
    => add to ${ALL_D}/data/${YEAR}/dataOrg/dTypeStr.data
  - If invalid dPairs
    => add to ./dataOrg/orgD.tag.txt
5) Add tag to prefixD: orgD.yes.data.final.yesEui.type.P
Generates:
- orgD.yes.data.final.yesEui.type.P.raw
- orgD.yes.data.final.yesEui.type.P.meta
  - orgD.yes.data.final.yesEui.type.P.yes.data
  - orgD.yes.data.final.yesEui.type.P.no.data
  - orgD.yes.data.final.yesEui.type.P.tbd.data
    => send to linguist, then add all "yes" of it to prefixD
  - orgD.yes.data.final.yesEui.type.P.tbt.data
6) Add tag to suffixD: orgD.yes.data.final.yesEui.type.S
Generates:
- orgD.yes.data.final.yesEui.type.S.raw
- orgD.yes.data.final.yesEui.type.S.meta
  - orgD.yes.data.final.yesEui.type.S.yesNo.data
  - orgD.yes.data.final.yesEui.type.S.yes.data
  - orgD.yes.data.final.yesEui.type.S.no.data
  - orgD.yes.data.final.yesEui.type.S.tbd.data
    => send to linguist, then add all "yes" of it to suffixD
7) Add tag to zeroD: orgD.yes.data.final.yesEui.type.Z
Generates:
- orgD.yes.data.final.yesEui.type.Z.raw
- orgD.yes.data.final.yesEui.type.Z.meta
  - orgD.yes.data.final.yesEui.type.Z.yes.data
  - orgD.yes.data.final.yesEui.type.Z.no.data
  - orgD.yes.data.final.yesEui.type.Z.tbd.data
    should be 0 because all zeroZ.raw are generated automatically.
    Some of these might be acronym or abbreviation, which is not legal in zeroD.
    => send to linguist, then add all "yes" of it to zeroD

Ideally, all orgD should be automatically generated by our new derivations generation processes by adding: more prefix (for prefixD) and SD candidate rules (for suffixD). No new zeroD should be found because our system should cover all possible zeroD (please notes that acronyms or abbreviations can't be zeroD). In 2014 release, we manually verify and add orgD into derivational table. Please see the reports on orgD, 2014 for detail.

Please refer to derivation design documents in Lexical Tools for details.

The SPECIALIST Lexicon