CSpell

Generating Gold Standard

  • Directory:
    • Top directory: ${C_SPELL}/PostProcess
    • Binary directory: ${C_SPELL}/PostProcess/bin
    • Data directory: ${C_SPELL}/PostProcess/data/Brat/NewTestRevised/
    • Brat directory: ${C_SPELL}/PostProcess/data/Brat/NewTestRevised/bratRev

  • Programs:
    • ${C_SPELL}/PostProcess/bin/PostBratNewTest
      2
      • 20 (real-word)
      • 21 (non-word)
  • Algorithm:
    • Read in Brat annotation files from (./Brat), convert to ArrayList<BratTagObj> bratTagList
    • Auto-generate correction (NewTarTxt) for ToMerge and ToSplitOnPunc
    • Generate correction (NewTarTxt) for correctable Informal, where their Annotation notes:
      • not starts with "ABB: ", because abbreviation is considered as correct spelling
      • not starts with "ACR: ", because abbreviation is considered as correct spelling
      • not include "'", such as I'm (because contraction is considered as correct spelling)

    • Generate gold standard by replacing the correction string (NewTarTxt) for each file
      Take care span-related tags (contain, multiTag, overlap) first. The idea is each span should have only 1 correction even they are multiple tagged:
      • to find the correction (NewTarTxt) for container and key (in multiTag) Btos.
      • Only correct container or key multiTag, skip containee and not-key multiTag.

      Core Algorithm

      • Check if the Tag type is: ToSplit, ToSplitOnPunc, ToMerge, Misspell, Informal, and realWord (only used for RealWord Included),

        • Case 1: Bto is in span-contain
          • Container: Replace with the corrected string
          • Containee: No change. The corrected string is already changed in its container.

          • Use the bto.ToUidStr() as key for finding container and containee
            fileId|tagId|type|startPos|endPos|srcTxt|orgTarTxt|
          • In CorrectionMap, key is the container with corrected NewTarTxt; while value is the list of all container and containee
          • The first element of values is the container with original NewTarTxt
          • All corrections for containers are calculated in Utility, they are done as follows:
            • case 1: same start position
            • case 2: same end position
            • case 3: in the middle (there is no multiple containee for a same container in this data set)

        • Case 2: Bto is multi-Tag
          • keyBto: Replace with the corrected string
          • not keyBto: No Change. The corrected string is already changed in the keyBto.

          • Use the bto.ToReportStr() as key for keyBto
            fileId|tagId|type|startPos|endPos|srcTxt|orgTarTxt|newTarTxt|
          • All corrections for keyBtos are calculated in Utility, they use the following priority:
            RealWord -> misspell -> informal -> merge

        • Case 3: Span-Overlap
          => no issue found for overlap in this data set, so span-overlap is excluded in the codes.

        • Case 4: Normal Correction
          => For all other tags that are not span-related, they are already sorted by bigger start position first
      • The Utility Tools are used to all information needed (such as corrected NewTarTxt) for Span (Contain, Multi-Tag and Overlap) and RealWord.
  • Results & Stats:
    • Total question files: 224
    • Total tokens: 16,707
    • Total Brat tags: 1,946

      Tag TypeCorrection TextNumber
      ToSplitAnnotation Notes164
      ToSplitOnPAuto-generated320
      ToMergeAuto-generated27
      MisspellAnnotation Notes438
      InformalAnnotation Notes (only correctable)413 (ABB: 102; ACR: 107; ':106; Correctable: 98)
      RealWordAnnotation Notes223 (= 1195 + 14 - 986)
      PunctuationN/A246
      WordExistsN/A79
      UnknownN/A21
      GarbageN/A15
      OovN/A0
  • Total corrections for Non-Word Only: 986
  • Total corrections for Real-Word Included: 1195 (+ 14 RealWord-Containees)