CSpell

Ensemble Performance

This page describes the initial performance tests on the Ensemble method (from Dr. Halil).

The Source code of Ensemble Spelling Correction that is used as baseline (for developing and comparison) is slightly better than what was reported in the paper due to following reasons.

  • Source code is enhanced (by Dr. Halil) after paper was submitted
  • Data are updated after paper was submitted
  • All 472 question files are used (instead of only the 100 test questions)

  • To run different test modes:
    • Need to comment out codes in LinearWeightedEnsembleSpellCorrection.java and recompile to run test between preProcess, orthographic, Corpus, Context, and ensemble.

    • RealWord correction: update correctRealWordErrors in chqa.properties

The results of 472 files are listed in the following tables (tested on lexdev):

  • I. Training set + Test Set (472)

    TypeOptionTPFPFNRetrievedRelevantPrecisionRecallF-1RunTime
    Non-wordPreProcess Only289585253478140.83290.35500.497887 min.
    Non-wordW/ Orthographic similarity4953293198248140.60070.60810.604482 Min.
    Non-wordW/ Corpus Frequency3614494538108140.44570.44350.444683 min.
    Non-wordW/ Context Similarity3504574648078140.43370.43000.431880 min.
    Non-wordAll (Ensemble)5312942838258140.64360.65230.648080 min.
    Real-wordAll (Ensemble)

  • II. Training set (372)
    TypeOptionTPFPFNRetrievedRelevantPrecisionRecallF-1RunTime
    Non-wordPreProcess Only221534162746370.80660.34690.485280 min.
    Non-wordW/ Orthographic similarity3882672496556370.59240.60910.600671 Min.
    Non-wordW/ Corpus Frequency2783633596416370.43370.43640.435172 min.
    Non-wordW/ Context Similarity2683713696396370.41940.42070.420170 min.
    Non-wordAll (Ensemble)4132432246566370.62960.64840.638870 min.
    Real-wordAll (Ensemble)

  • III. Test Set (100)
    TypeOptionTPFPFNRetrievedRelevantPrecisionRecallF-1RunTime
    Non-wordPreProcess Only685109731770.93150.38420.544010 min.
    Non-wordW/ Orthographic similarity10762701691770.63310.60450.618510 Min.
    Non-wordW/ Corpus Frequency8386941691770.49110.46890.479810 min.
    Non-wordW/ Context Similarity83(82)85(86)94(95)1681770.49400.46890.481210 min.
    Non-wordAll (Ensemble)11752601691770.69230.66100.676310 min.
    Real-wordAll (Ensemble)