CSpell

Factor Analysis Results

I. Error Types

  • Tokens in Brat annotation data (spelling errors to be corrected) and not in Brat annotation data (not spelling errors, should not be corrected) are tested through a set of computer program. Each type of errors are identified and coded in the program for further process as shown bellows:

    Correction TypeDetails
    PreCorrection
    • B1.1. PreCorr (T)
    • B1.2. PreCorr (F)
    Dictionary-based Correction
    • Spelling detector
    • Candidates
    • Ranking
    • B2.1. DicCorr (T)
    • B2.2. DicCorr (F)
      • B2.2.1. Not detect, real-word (error tag)
      • B2.2.2. Not detect, spelling error (non-word)
      • B2.2.3. Detect, not candidates by edit-distance
      • B2.2.4. Detect, not candidates by suggestion Dic
      • B2.2.5. Detect, not candidates by multi-corrections
      • B2.2.6. Detect, candidates, wrong (not top) rank
      • B2.2.7. Detect, candidates, wrong top rank
    CombinationTBD

  • Tokens not in Brat annotation data (Correct spelling, no need to be corrected)

    Correction TypeDetails
    Not in checkDic, Not Correct
    • A2.2.1. Not in checkDic, corrected wrong, by dictionary
    • A2.2.2. Not in checkDic, corrected wrong, by preCorrection

II. Analysis Results

The results on baseline data are shown belows:

  • PreCorrection (365, 43.8175%) :
    • T: 332 (90.9589%)
    • F: 33 (9.0411%)

  • Dictionary-based correction:
    • * LexcionE: use Lexicon, with Aa, unit, and Mw (includes spVar and Pn)
    • ** Combo1: use LexcionE, with replacing suggDic by baseline (eng_med.dic)
    • ** Combo2: use LexcionE+Medline, with replacing sgDic by baseline (eng_med.dic)

    ResultsJazzyBaselineMedlineLexiconLexicon.E*Combo1**Combo2***
    Performance (by Baseline program)
    TP|Ret.|Rel.
    Precision, Recall, F1
    • 498|2606|814
    • 0.19|0.62|0.29
    • 548|845|814
    • 0.65|0.67|0.66
    • 524|809|814
    • 0.65|0.64|0.64
    • 535|829|814
    • 0.65|0.66|0.65
    • 534|814|814
    • 0.66|0.66|0.66
    • 543|737|814
    • 0.74|0.67|0.70
    • 529|695|814
    • 0.76|0.65|0.70
    Tagged terms (833), should be corrected
    B2.1. DicCorr (T) 227 (48.5043%)232 (49.5726%)205 (43.8034%)234 (50.0000%)235 (50.2137%)226 (48.2906%)210 (44.8718%)
    B2.2. DicCorr (F) 241 (51.4957%)236 (50.4274%)263 (56.1966%)234 (50.0000%)233 (49.7863%)242 (51.7094%)258 (55.1282%)
    Tag issue: re-check the annotation
    B2.2.1.
    Not detect, real-word (error tag)
    36 (7.6923%)49 (10.4701%)43 (9.1880%)50 (10.6838%)50 (10.6838%)50 (10.6838%)50 (10.6838%)
    Detection issue: Check dictionary + exception algorithm
    B2.2.2.
    Not detect, spelling error (non-word)
    20 (4.2735%)54 (11.5385%)76 (16.2393%)57 (12.1795%)57 (12.1795%)57 (12.1795%)85 (18.1624%)
    Candidate issue: edit distance + phonetic + Suggesting dictionary
    B2.2.3.
    Detect, not candidates by edit-distance
    37 (7.9060%)34 (7.2650%)29 (6.1966%)32 (6.8376%)32 (6.8376%)32 (6.8376%)28 (5.9829%)
    B2.2.4.
    Detect, not candidates by suggestion Dic
    79 (16.8803%)11 (2.3504%)19 (4.0598%)17 (3.6325%)20 (4.2735%)15 (3.2051%)15 (3.2051%)
    B2.2.5.
    Detect, not candidates by multi-corrections
    2 (0.4274%)6 (1.2821%)13 (2.7778%)5 (1.0684%)5 (1.0684%)6 (1.2821%)6 (1.2821%)
    Ranking issue: in candidate list
    B2.2.6.
    Detect, Candidates, wrong (not top) rank
    62 (13.2479%)75 (16.0256%)77 (16.4530%)65 (13.8889%)57 (12.1795%)75 (16.0256%)69 (14.7436%)
    B2.2.7.
    Detect, Candidates, wrong top rank
    5 (1.0684%)7 (1.4957%)6 (1.2821%)8 (1.7094%)12 (2.5641%)7 (1.4957%)5 (1.0684%)
    Valid word (not-tagged), but not in checkDic, corrected wrong
    A2.2.1.
    Not in checkDic, corrected wrong, by Dic
    1912 (7.8287%)139 (0.5691%)121 (0.4954%)143 (0.5855%)137 (0.5609%)70 (0.2866%)51 (0.2088%)
    A2.2.2.
    Not in checkDic, corrected wrong, by Pre
    41 (0.1679%)33 (0.1351%)27 (0.1106%)31 (0.1269%)31 (0.1269%)31 (0.1269%)26 (0.1065%)
    Summary
    Check Dic
    B2.2.2+A2.2.1+A2.2.2
    1973226224231225158162
    Sugg Dic
    B2.2.3+B2.2.3+B2.2.4
    118516154575349

  • Edit Distance:
    edit distanceinstancepercentageAccu. percentage
    131767.74%67.74%
    211023.50%91.24%
    3245.13%96.37%
    481.71%98.08%
    561.28%99.36%
    620.43%99.79%
    710.21%100.00%