CSpell

Computer-Aided Revision

A set of computer-aided program is developed to validate and revise the reconciled Brat annotation data. They are described follows:

  • Directory:
    • Top directory: ${C_SPELL}/PostProcess
    • Binary directory: ${C_SPELL}/PostProcess/bin
    • Data directory: ${C_SPELL}/PostProcess/data/Brat/NewTestRevised/
    • Brat directory: ${C_SPELL}/PostProcess/data/Brat/NewTestRevised/brat

  • Programs and Algorithm:
    • ${C_SPELL}/PostProcess/bin/PostBratNewTest
      2
      • 1
        • Converts tags from Brat format to pipe separate format
        • Auto-derived corrected text (tarTxt) for Merge, ToSplitOnPunct, and correctable Informal (not ABB: , ACR:, or contains "'")
        • Send output to ./tags/*.ann.rpt

      • 2
        • Retrieve all tags from step-1 and send to ./reports/Tag*.rpt
        • Check all tags
          TagCheck Items
          ToSplit
          • Check if there is no correction
          • Check if legit split
          • Check if newTarTxtLc a valid term
          ToSplitOnPunct
          • Check if there is no correction
          • Check if legit split after punctuation
          • Check special cases
          • Check if newTarTxtLc a valid term
          ToMerge
          • Check if there is no correction
          • Check if a legit merge operation
          • Check if newTarTxtLc a valid term
          Misspelling
          • Check if there is a correction
          • Check if a misspelling
          • Check if newTarTxtLc a valid term
          Informal
          • Check if there is a correction
          • Check if a correctable informal type
          • Check if newTarTxtLc a valid term
          RealWord
          • Check if there is a correction
          • Check if srcTxtLc legit real words
          • Check if newTarTxtLc a valid term
          OutOfVocabulary
          • Should not exist
          WordExists
          • Check if srcTxt a valid term
          Punctuation
          • Not check
          Garbage
          • Not check
          Unknown
          • Not check

        • Real-Word Checks
        • check if srcTxt real words
        • check if has multi-tag as RealWords
        • check if is correctable Informal
        • Send to ./reports/realWordErr.rpt
        • ./reports/realWordErr.rpt.ok are acceptable exceptions

        From our experience, there are two types of errors that commonly seen in spelling annotation.

        • Real-word tags are solely tagged as other tags (misspelling, merge, split, etc.). It is very hard for annotators to identify all real-word tags, such as abbreviations, acronyms, proper nouns. Consumer data are not cases (UPPER CASE, lowercase, Mixed Case) and punctuation sensitive, which make it even harder for the annotator to identify real-word. Utilizing computer-aided program (uses Lexicon as default dictionary) to identify real-word is a good way to resolve this issue.
        • Not-tagged words are spelling errors. This issue was taken care of during the generation of this test set. A computer-aided program is used to tag all words that is not in the Lexicon as OOV. All OOV tags are required to be evaluated and changed by annotators during the annotation process. By the ends of annotation, none of OOV tags should be exists.

      • 3

        Check Brat Tags spans - the purpose of this check is to ensure generate gold standard correctly for the cases of contain, multi-tag and overlap for both non-word and real-word

        • Check Bto span - contain and multi-Tag
          • Contain is when a Bto (containee) is inside another Bto (container)
          • Multi-Tag is when two Btos have the same span
          • Retrieve Btos with contain case (both containers and containees) have multi-tag
          • Retrieve Btos with contain case (both containers and containees) have RealWord Bto (by the span)
        • Check Bto span - overlap
          • Overlap are spans of two Btos have overlap
          • Find Btos with overlap to make sure generate correct gold standard
          • RealWord: ./reports/checkBtoSpans.rpt.overlap.realWord
            => Make sure the correction are within overlap range
          • NonWord: ./reports/checkBtoSpans.rpt.overlap.nonWord

  • Revision Logs