CSpell

Generating Gold Standard

Directory:
- Top directory: ${C_SPELL}/PostProcess
- Binary directory: ${C_SPELL}/PostProcess/bin
- Data directory: ${C_SPELL}/PostProcess/data/Brat/NewTestRevised/
- Brat directory: ${C_SPELL}/PostProcess/data/Brat/NewTestRevised/bratRev
Programs:
- ${C_SPELL}/PostProcess/bin/PostBratNewTest
  2
  - 20 (real-word)
  - 21 (non-word)
Algorithm:
- Read in Brat annotation files from (./Brat), convert to ArrayList<BratTagObj> bratTagList
- Auto-generate correction (NewTarTxt) for ToMerge and ToSplitOnPunc
- Generate correction (NewTarTxt) for correctable Informal, where their Annotation notes:
  - not starts with "ABB: ", because abbreviation is considered as correct spelling
  - not starts with "ACR: ", because abbreviation is considered as correct spelling
  - not include "'", such as I'm (because contraction is considered as correct spelling)
- Generate gold standard by replacing the correction string (NewTarTxt) for each file
  Take care span-related tags (contain, multiTag, overlap) first. The idea is each span should have only 1 correction even they are multiple tagged:
  - to find the correction (NewTarTxt) for container and key (in multiTag) Btos.
  - Only correct container or key multiTag, skip containee and not-key multiTag.
  Core Algorithm
  - Check if the Tag type is: ToSplit, ToSplitOnPunc, ToMerge, Misspell, Informal, and realWord (only used for RealWord Included),
    - Case 1: Bto is in span-contain
      - Container: Replace with the corrected string
      - Containee: No change. The corrected string is already changed in its container.
      - Use the bto.ToUidStr() as key for finding container and containee
        fileId|tagId|type|startPos|endPos|srcTxt|orgTarTxt|
      - In CorrectionMap, key is the container with corrected NewTarTxt; while value is the list of all container and containee
      - The first element of values is the container with original NewTarTxt
      - All corrections for containers are calculated in Utility, they are done as follows:
        
        case 1: same start position
        case 2: same end position
        case 3: in the middle (there is no multiple containee for a same container in this data set)
    - Case 2: Bto is multi-Tag
      - keyBto: Replace with the corrected string
      - not keyBto: No Change. The corrected string is already changed in the keyBto.
      - Use the bto.ToReportStr() as key for keyBto
        fileId|tagId|type|startPos|endPos|srcTxt|orgTarTxt|newTarTxt|
      - All corrections for keyBtos are calculated in Utility, they use the following priority:
        RealWord -> misspell -> informal -> merge
    - Case 3: Span-Overlap
      => no issue found for overlap in this data set, so span-overlap is excluded in the codes.
    - Case 4: Normal Correction
      => For all other tags that are not span-related, they are already sorted by bigger start position first
  - The Utility Tools are used to all information needed (such as corrected NewTarTxt) for Span (Contain, Multi-Tag and Overlap) and RealWord.

Results & Stats:

Total question files: 224
Total tokens: 16,707

Total Brat tags: 1,946

Tag Type	Correction Text	Number
ToSplit	Annotation Notes	164
ToSplitOnP	Auto-generated	320
ToMerge	Auto-generated	27
Misspell	Annotation Notes	438
Informal	Annotation Notes (only correctable)	413 (ABB: 102; ACR: 107; ':106; Correctable: 98)
RealWord	Annotation Notes	223 (= 1195 + 14 - 986)

Punctuation	N/A	246
WordExists	N/A	79
Unknown	N/A	21
Garbage	N/A	15
Oov	N/A	0

Total corrections for Non-Word Only: 986
Total corrections for Real-Word Included: 1195 (+ 14 RealWord-Containees)