CSpell

Training Set

Training Set, 215 KB
- 5.3 MB
- OrgData.471: 471 original data from health-related questions in NLM asked by consumers
- GoldStd-NonWord: non-word gold standard
- GoldStd-RealWord: real-word gold standard
Training Set (Brat format), 107 KB

II. Description

We used both the training set and the test set from the Ensemble method as our training set to develop CSpell. The training set is summarized as follows:

Summary statistics:

Consumer health questions	471*
Tokens	24,837
Annotation tags	1,008
Instances of non-word corrections	774
Instances of real-word corrections	964
Word count per question	5 - 328
Average word count per question	52.49
Error per question	0 - 27
Average error per question	2.14
Error rate (error per token)	0.04 (= 964/24,837)

*One question (11199.txt) is removed from the Ensemble method data because it contains too many non-English words.

III. Distribution of Errors in the Training Set

Stats on file size and error tags

Count Minimum Maximum Average
Character 34 1985 296.37
Word 5 328 52.49
Error Tag 0 27 2.14

Count	Minimum	Maximum	Average
Character	34	1985	296.37
Word	5	328	52.49
Error Tag	0	27	2.14

Error types and corrections

Correction needed	non-word	real-word	ND	Multiple	Total
Spelling	348	153	113	N/A	614
Merge	10	38	0	N/A	48
Split	24	10	281	N/A	315
Multiple	N/A	N/A	N/A	31	31

Total	382	201	394	31	1008
Percentage	37.90%	19.94%	39.09%A	3.08%	100.00%

where:

ND: errors that do not need a dictionary for correction
Multiple: errors that combine serval type and require multiple corrections

IV. Other Components

V. Performance Tests