CSpell

Issues on Ensemble Original Gold Standard

Some issues found in the Ensemble gold standard data set:

  • The gold standard is not completed from the brat annotation:
    • As for the non-word gold standard, there are 7 annotations that does not have the corrected spelling. The original program does not change these annotated text at all.
      	- Warning: no tarTxt: 62|T1|ToMerge|173|182|T_OK|life long|
      	- Warning: no tarTxt: 14514|T9|ToMerge|334|341|T_OK|my selp|
      	- Warning: no tarTxt: 16823|T2|ToMerge|36|46|T_OK|After noon|
      	- Warning: no tarTxt: 18203|T1|ToSplit|60|71|T_OK|PTHrPeptide|
      	- Warning: no tarTxt: 11665|T1|Misspelling|90|96|T_OK|Btensl|
      	- Warning: no tarTxt: 15759|T22|Misspelling|366|376|T_OK|depresants|
      	
    • A new program is developed to generate the gold-standard to check the algorithm. The results show 5 files with 6 differences. They are:
      • Unicode: 4 files with 5 differences (11199.txt, 12085.txt, 12624.txt, 13090.txt)
        =>This is FP because the diff does not handle Unicode well and the baseline does not use UTF-8.
      • extra space: 1 file (73.txt)
    • nonWord.diff.txt
  • The program used to calculate the Precision/recall and F1 seems not work 100% correct. Here are some observed issue (use non-word for example):
    • There are 851 annotation tags from brat for non-word gold standard (misspell, merge, split, punctuation). However, only 814 total relevant (TP + FN) from the program.

    • Use 2.txt as example, here are 4 difference:
      • 2|T2|Punctuation|1423|1432|Thank-you|Thank you
        => Included with other tags:
        2|T20|ToSplit|1414|1432|anorexia?Thank-you|anorexia? Thank you

      • 2|T12|ToSplit|831|844|anorexia?8) |anorexia? 8)
      • 2|T10|Misspelling|773|781|year?(in|year? (in
      • 2|T9|ToSplit|701|712|anorexia?6)|anorexia? 6)
        Anything with '?' are not calculated?? (bug?)

      ================= Not Included ==================
      2|T12|ToSplit|831|844|anorexia?8)  |anorexia? 8)
      2|T10|Misspelling|773|781|year?(in|year? (in
      2|T9|ToSplit|701|712|anorexia?6)|anorexia? 6)
      ================= Included by other tag T20 =========
      2|T2|Punctuation|1423|1432|Thank-you|Thank you
      ================== TP ==================
      2|T11|ToSplit|791|796|7)How|7) How
      2|T18|ToSplit|1257|1263|14)Who|14) Who
      2|T13|ToSplit|910|915|9)Can|9) Can
      2|T4|ToSplit|433|440|1)Where|1) Where
      2|T17|ToSplit|1203|1209|13)Why|13) Why
      2|T15|ToSplit|1054|1061|11)What|11) What
      2|T14|ToSplit|978|984|10)How|10) How
      2|T5|ToSplit|477|483|2)When|2) When
      2|T19|ToSplit|1352|1358|one(or|one (or
      2|T16|ToSplit|1137|1143|12)Are|12) Are
      2|T8|ToSplit|675|681|5)What|5) What
      2|T7|ToSplit|617|623|4)What|4) What
      2|T6|ToSplit|536|541|3)Why|3) Why
      ==================== FN ==================
      2|T3|Misspelling|107|116|year-long|yearlong
      2|T20|ToSplit|1414|1432|anorexia?Thank-you|anorexia? Thank you
      2|T1|Misspelling|311|323|MedicinePlus|MedlinePlus
      

    • The non-word gold-std should have 834 total relevant:
      • misspell tags no: 436
      • split tags no: 312
      • merge tags no: 45
      • punctuation tags no: 58

      • duplicated tag by contain (not contain by real-word or grammatical: 17)
        	2|T2|Punctuation|1423|1432|T_C_T20|Thank-you|Thank you
        	23|T5|Misspelling|255|258|T_C_T4|plz|please
        	11186|T19|Misspelling|522|525|T_C_T5|pls|please
        	11186|T9|Misspelling|360|367|T_C_T16|SEGMENS|SEGMENTS
        	11243|T4|Misspelling|51|55|T_C_T1|neef|need
        	11243|T2|Misspelling|42|51|T_C_T1|menimgtis|meningitis
        	12235|T1|Misspelling|80|88|T_C_T8|treatmet|treatment
        	13347|T7|Misspelling|137|140|T_C_T3|plz|please
        	14514|T7|Misspelling|337|341|T_C_T9|selp|self
        	15759|T22|Misspelling|366|376|T_C_T1|depresants|
        	16481|T8|Misspelling|421|425|T_C_T5|anyt|any
        	17170|T5|ToMerge|138|143|T_C_T8|i 'll|. i'll
        	17170|T4|Misspelling|130|136|T_C_T8|"|"
        	17740|T12|Misspelling|429|437|T_C_T6|treament|treatment
        	17757|T9|Punctuation|692|695|T_C_T8|etc|etc.
        	18341|T9|Punctuation|168|171|T_C_T4|etc|etc.
        	18341|T3|ToMerge|122|134|T_C_T7|cryo surgery|cryosurgery
        	

      • non-word no: 436 + 312 + 45 + 58 - 17 = 834 (not 814, number from baseline??)