CSpell

Ensemble Source Code Analysis

The ensemble spelling correction (by Halil) is used as the baseline for this project. The reviewed status and suggesting plan for the original source code are described as follows:

Original Java CodeNotesModuleStatusPlan
1SpellingPreProcessor.java
  • Preprocess for input text (contractions, punctuation, split digit, etc.)
  • Decomposition:
    • PreProcXml.java (refactoring with modifications)
    • PreProcContractions.java (rewrite)
    • PreProcSentence.java (redesign)
    • PreProcSplit.java (refactoring with modifications)
PreProcessor
  • 1st review
Rewrite or refactoring
2DictionaryBasedSpellChecker.java
  • Uses jazzy for the dictionary and suggestion
  • Planned to be replaced by a better mechanism.
Dictionary
  • 1st review
Rewrite
3SpellingCandidateGenerator.java
  • It uses exhausted mechanism (slow) to get candidates
  • Method, getLevenshteinEdits(), is used to get all candidates from the dictionary
  • Does not use jazzySpellChecker
  • decomposition:
    • EditDistance.java (refactoring)
    • OverLapUtil.java (refactoring)
    • MergeUtil.java (refactoring)
    • SplitUtil.java (rewrite for better performance and bug fixed)
Candidate
  • 1st review
Rewrite or refactoring
4CorpusFrequencyCounts.java
  • Get frequency score
  • A bug found: getUnigramScore( )
Ranking
  • 1st review
Rewrite
5Word2Vector.java
  • Used for WordEmbedding algorithm (contextual Similarity)
Ranking
  • No Review
TBD
6SpellCorrectionEvaluator.java
  • OK code to use it as is.
  • Suggested to rewrite for simplicity and speed.
  • Decomposition:
    • Span.java
    • TokenSpan.java
    • TokenSpanUtil.java

    • CoreNLPWrapper.java
    • FileUtils.java
Evaluator
  • 1st review
Rewrite
7diff_match_patch.java
  • Library codes for comparing two text
  • Nice to use it. Plan to rewrite for simplicity, maintenance and speed
Evaluator
  • No review
Rewrite
8SpellCorrection.java Interface, might not need it System
  • 1st review
Remove or redesign
9LinearWeightedEnsembleSpellCorrection.java
  • Add a new class for configurable setting
    • ESpellCorrection.java (process multiple files)
  • Decomposition:
    • EnsembleSpellCorrectText.java (correct 1 text file)
      • EnsembleSpellPreProcess.java
        • EnsembleSpellPreProcessObj.java
        • EnsembleSpellSpans.java (convert input text to spanText)

      • EnsembleSpellProcess.java
        • EnsembleSpellCorrectSentence.java (correct a sentence)
          • EnsembleSpellFindCandidates.java (find candidates for a instance)
            • EnsembleSpellCandidates.java (find candidate for a token)
            • EnsembleSpellMergeCandidates.java (find merge candidate for a token)
          • EnsembleSpellFindRanking.java (find best ranked candidate for a instance)
System
  • 1st review
Rewrite
10JazzySpellCorrection.java Use ASpell (Jazzy) to correct text System
  • No review
  • Completed
Remove
11ESpellCorrection.java Use ESpell to correct text System
  • No review
  • Completed
Remove

where,

  • Remove: delete the code (no use)
  • Refactoring: compile the code to meet coding standard
  • Revise: modify the current code and meet coding standard
  • Rewrite: change the algorithm, signature, name for the individual file
  • Redesign: Change the architecture design, data structure, algorithm, ...

Plan Summary:

  • Improve from research code to production code:

    PerformanceMaintenanceThread SafeDistributableConfigurable
    OO DesignXX
    Coding standardXX
    Limited commentX
    static global variableX
    Algorithm (exhausted)X
    Package DependencyXXX

  • Systematic approach for data acquisition:
    • Dictionary
    • Frequency
    • Contextual
    • lemma