CSpell

Ensemble Gold Standard Revision

I. Introduction

Annotation are corrected from the Ensemble factor analysis results. The gold standard are thus modified accordingly. This revised gold standard are used as the training set to evaluate performance for developing the CSpell (training set). The summary of the original and revised Brat annotation are listed as below:

OriginalRevisedNotes
Files472471
  • Remove 1 non-English file
Tokens2518624837
Brat tags10521055= 793 + 196 + 47 + 19
NonWord tags
  • Misspelling
  • ToSplit
  • ToMerge
  • Punctuation
834793
  • includes non-word, real-word, ND and multiple tags:
  • some misspelling are ND or multiple correction
  • most punctuation are ND correction, few are real-word or multiple corrections
  • some ToMerge and ToSplit are real-word or multiple
RealWord tags154196
  • Added 42 RealWord tags
Grammatical tags4747Not used in CSpell training set
Overlap tags1719Includes real-word and multiple tags.

II. Revision Results (143)

  • Remove non-English file - 11199.txt (11199.ann)
  • Not in Brat annotation - incorrect spelling
    • 46 instances found (detail logs)
      • 45 instances are new tags (terms need to be corrected)
      • 1 instance is updated from original Brat tags

  • In Brat annotation - correct spelling for original terms
    • 49 instances found (detail logs)
      • 9 instances are deleted from original Brat tags (the original term is correct)
      • 40 new instances (of RealWord) are added to original Brat tags

  • In Brat annotation - incorrect spelling for corrected terms
    • 6 instances found (detail logs)
      • 4 instances needs are updated on corrected terms from original Brat tags
      • 2 new instances (of RealWord) are added to original Brat tags

  • In Brat annotation - terms with multiple tagged (overlap positions)
    • 35 instances of overlap tags with RealWord found in Ensemble nonWord goldStd (detail logs)
      • 33 instances are updated (from RealWord)
      • 1 instance is deleted
      • 1 new instance is added

  • In Brat annotation - extra spaces for original and correct terms
    • 8 instances found (detail logs)
      This fix is because the performance evaluation program from Ensemble does not handle extra space consistently.
      • 5 instances are updated in the original Brat, for original terms with extra spaces (removed)
      • 3 instances are updated in the original Brat, for corrected terms with extra spaces (removed)

III. Revision Program

Generate gold standard from:

  • Original Brat annotation
  • Manually Revised annotation file (from above revision log)
  • Please notes that Brat tags includes 47 grammatical tags, which are not used in the training set gold standard. Thus, the total count of tags is 1008 (= 1055 - 47).
  • Some tags are overlapped and requires special algorithm to take care when generating the gold standard.