Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov

CSpell

Generating Gold Standard

  • Directory:
    • Top directory: ${C_SPELL}/PostProcess
    • Binary directory: ${C_SPELL}/PostProcess/bin
    • Data directory: ${C_SPELL}/PostProcess/data/Brat/NewTestRevised/
    • Brat directory: ${C_SPELL}/PostProcess/data/Brat/NewTestRevised/bratRev

  • Programs:
    • ${C_SPELL}/PostProcess/bin/PostBratNewTest
      2
      • 20 (real-word)
      • 21 (non-word)
  • Algorithm:
    • Read in Brat annotation files from (./Brat), convert to ArrayList<BratTagObj> bratTagList
    • Auto-generate correction (NewTarTxt) for ToMerge and ToSplitOnPunc
    • Generate correction (NewTarTxt) for correctable Informal, where their Annotation notes:
      • not starts with "ABB: ", because abbreviation is considered as correct spelling
      • not starts with "ACR: ", because abbreviation is considered as correct spelling
      • not include "'", such as I'm (because contraction is considered as correct spelling)

    • Generate gold standard by replacing the correction string (NewTarTxt) for each file
      Take care span-related tags (contain, multiTag, overlap) first. The idea is each span should have only 1 correction even they are multiple tagged:
      • to find the correction (NewTarTxt) for container and key (in multiTag) Btos.
      • Only correct container or key multiTag, skip containee and not-key multiTag.

      Core Algorithm

      • Check if the Tag type is: ToSplit, ToSplitOnPunc, ToMerge, Misspell, Informal, and realWord (only used for RealWord Included),

        • Case 1: Bto is in span-contain
          • Container: Replace with the corrected string
          • Containee: No change. The corrected string is already changed in its container.

          • Use the bto.ToUidStr() as key for finding container and containee
            fileId|tagId|type|startPos|endPos|srcTxt|orgTarTxt|
          • In CorrectionMap, key is the container with corrected NewTarTxt; while value is the list of all container and containee
          • The first element of values is the container with original NewTarTxt
          • All corrections for containers are calculated in Utility, they are done as follows:
            • case 1: same start position
            • case 2: same end position
            • case 3: in the middle (there is no multiple containee for a same container in this data set)

        • Case 2: Bto is multi-Tag
          • keyBto: Replace with the corrected string
          • not keyBto: No Change. The corrected string is already changed in the keyBto.

          • Use the bto.ToReportStr() as key for keyBto
            fileId|tagId|type|startPos|endPos|srcTxt|orgTarTxt|newTarTxt|
          • All corrections for keyBtos are calculated in Utility, they use the following priority:
            RealWord -> misspell -> informal -> merge

        • Case 3: Span-Overlap
          => no issue found for overlap in this data set, so span-overlap is excluded in the codes.

        • Case 4: Normal Correction
          => For all other tags that are not span-related, they are already sorted by bigger start position first
      • The Utility Tools are used to all information needed (such as corrected NewTarTxt) for Span (Contain, Multi-Tag and Overlap) and RealWord.
  • Results & Stats:
    • Total question files: 224
    • Total tokens: 16,707
    • Total Brat tags: 1,946

      Tag TypeCorrection TextNumber
      ToSplitAnnotation Notes164
      ToSplitOnPAuto-generated320
      ToMergeAuto-generated27
      MisspellAnnotation Notes438
      InformalAnnotation Notes (only correctable)413 (ABB: 102; ACR: 107; ':106; Correctable: 98)
      RealWordAnnotation Notes223 (= 1195 + 14 - 986)
      PunctuationN/A246
      WordExistsN/A79
      UnknownN/A21
      GarbageN/A15
      OovN/A0
  • Total corrections for Non-Word Only: 986
  • Total corrections for Real-Word Included: 1195 (+ 14 RealWord-Containees)