Training Set
II. Description
We used both the training set and the test set from the Ensemble method as our training set to develop CSpell. The training set is summarized as follows:
Consumer health questions | 471* |
Tokens | 24,837 |
Annotation tags | 1,008 |
Instances of non-word corrections | 774 |
Instances of real-word corrections | 964 |
Word count per question | 5 - 328 |
Average word count per question | 52.49 |
Error per question | 0 - 27 |
Average error per question | 2.14 |
Error rate (error per token) | 0.04 (= 964/24,837) |
*One question (11199.txt) is removed from the Ensemble method data because it contains too many non-English words.
III. Distribution of Errors in the Training Set
Count | Minimum | Maximum | Average |
---|---|---|---|
Character | 34 | 1985 | 296.37 |
Word | 5 | 328 | 52.49 |
Error Tag | 0 | 27 | 2.14 |
Correction needed | non-word | real-word | ND | Multiple | Total |
---|---|---|---|---|---|
Spelling | 348 | 153 | 113 | N/A | 614 |
Merge | 10 | 38 | 0 | N/A | 48 |
Split | 24 | 10 | 281 | N/A | 315 |
Multiple | N/A | N/A | N/A | 31 | 31 |
Total | 382 | 201 | 394 | 31 | 1008 |
Percentage | 37.90% | 19.94% | 39.09%A | 3.08% | 100.00% |
where:
IV. Other Components
V. Performance Tests