CSpell

Real-word Spelling (1-To-1)

This page describes the processes for real-word spelling (1-to-1) detection and correction.

I. Processes

  • Detector:
    RealWord1To1Detector.java
    • Not corrected previously in the CSpell pipeline.
    • real-word: valid word (in checkDic)
    • Not exceptions: digit, punctuation, digit/punctuation, url, email, empty string, measurement, properNoun, abbreviation/acronym
    • word has context score
    • word WC >= 65 (configurable: CS_DETECTOR_RW_1TO1_WORD_MIN_WC)
    • word has length >= 2 (configurable: CS_DETECTOR_RW_1TO1_WORD_MIN_LENGTH)
  • Candidates:
    RealWord1To1Candidates.java
    • Max. length of real-word <= 10 (configurable: CS_CAN_RW_1TO1_WORD_MAX_LENGTH)
      Only generate real-word 1-to-1 candidates for the word has length less than certain value to prevent over-generating and slow performance. The recall will be decreased if this number is too small (with faster speed).
    • Generate all possible candidate as in the non-word
    • Filter out invalid candidates (IsValid1To1Cand)
      => Ideally, we only correct real-word with candidates that are very similar to the inWord, that is they looks (orthographic) and sounds (phonetic) alike. If we loose this restriction, the real-word correction will be mainly rely on the context score (word2vec). In this version, our corpus for word2vec is relatively small and thus it generates too much noise [FP] and results in low precision and F1. This restriction of sounds and looks alike also helps (a little) on the run time performance (less context score calculation in ranking).
      • in suggDic (valid word)
      • has context score (word2Vec)
      • WC >= 1 (has word count, configurable: CS_CAN_RW_1TO1_CAND_MIN_WC)
      • length >= 2 (configurable: CS_CAN_RW_1TO1_CAND_MIN_LENGTH)
      • candidate is not a inflectional variant of inWord
        In this version, we do not correct grammar and thus no inflectional variants (such as plural nouns, 3rd personal singular verb, etc.) are corrected.
      • Heuristic rules of looks and sounds alike:
        • sounds alike: both phonetic codes of double metaphone and refined soundex must be the same
          • same double metaphone code (pmDist = 0)
          • same refined soundex code (prDist = 0)
        • look alike: small edit distance with similar sounds
          • leadDist + endDist + lengthDist + pmDist + prDist < 3
          • editDist + pmDist + prDist < 4
          • phonetic codes for double metaphone (pmDist = 0)

    • Key size in HashMap to store real-time 1-To-1 candidates in memoery: 1,000,000,000 (configurable: CS_CAN_RW_1TO1_CAND_MAX_KEY_SIZE)
      Slow run time performance due to too many real-words and their candidates. The generation of all possible candidates on the fly causes slow performances. To resolve this issue, we saved generated candidates (values) with real-word (key) to memory (in HashMap) to improve performance. Our test showed the elastped time is improved from 25+ min. to 3.5 min. on the training set. This is because:
      • lots of real-word are repeated
      • the candidates of real-word are the same
  • Ranker:
    RankRealWord1To1ByCSpell.java
    • Find the top rank candidate
      Sort the candidates by CSpellScoreRw1To1Comparator.java:
      • OrthographicScoreComparator
        The top ranked candidate (highest Orthographic score) must also have the highest scores of the follows in the candidate list:
      • FrequencyScore
      • EditDistSimilarityScore
      • PhoneticSimilarityScore
      • OverlapSimilarityScore
    • Validate the top ranked candidate
      Use context score to validate the top ranked candidate (IsTopCandValid):
      • context radius = 2 (configurable, CS_RW_1TO1_CONTEXT_RADIUS)
      • Set the RealWord_1To1_Confidence_Factor = 0.0 (configurable:CS_RANKER_RW_1TO1_C_FAC) for more strict restriction to avoid false-positive candidates
      • orgScore < 0
        • & topScore > 0
          • Context Score Check (on min., distance, and ratio)
            • Min: topScpre > rw1To1CandMinCs (0.00, configurable: CS_RANKER_RW_1TO1_CAND_MIN_CS)
            • Dist: topScore - orgScore > rw1To1CandCsDist (0.085, configurable: CS_RANKER_RW_1TO1_CAND_CS_DIST)
            • Ratio: (topScore/-orgScore) > rw1To1CandCsFactor (0.1, configurable: CS_RANKER_RW_1TO1_CAND_CS_FAC)

            • Min: orgScore > rw1To1WordMinCs (-0.085, configurable: CS_RANKER_RW_1TO1_WORD_MIN_CS)
          • Frequency Score Check (on min., distance, and ratio)
            • Min: topFScore > rw1To1CandMinFs (0.0006, configurable: CS_RANKER_RW_1TO1_CAND_MIN_FS)
            • Dist: topFScore > orgFScore or (orgFScore - topFScore) < rw1To1CandFsDist (0.02, configurable: CS_RANKER_RW_1TO1_CAND_FS_DIST)
            • Ratio: (topFScore/orgFScore) > rw1To1CandFsFactor (0.035, configurable: CS_RANKER_RW_1TO1_CAND_FS_FAC)
        • & topScore < 0 & topScore * RealWord1To1CFactor > orgScore
      • orgScore > 0
        • & topScore * RealWord_1To1_Confidence_Factor > orgScore
          => Never happen beacuse RealWord_1To1_Confidence_Factor is 0.0
      • orgScore = 0
        • No real-word 1-to-1 correction because they are exclusive from the detector (no word2Vec information on the inspected word)
  • Corrector:
    OneToOneCorrector.java
    • Update the focused (inspected) token with the top ranked candidate.
    • Update process history to real-word-1To1

II. Development Tests

Tested different real-word 1-to-1 factors on the revised real-word included gold standard from the training set. Each test takes about 3~5 min. (depends on computer and memory size)

  • Detector (check on focus token):
    FunctionMin. LengthMin. WCRaw dataPerformance
    NW (All)N/AN/A607|777|9640.7812|0.6297|0.6973
    NW + RW_1To1165612|786|9640.7786|0.6349|0.6994
    NW + RW_1To1265612|786|9640.7786|0.6349|0.6994
    NW + RW_1To1365612|786|9640.7786|0.6349|0.6994
    NW + RW_1To1465612|786|9640.7786|0.6349|0.6994
    NW + RW_1To1565611|783|9640.7803|0.6338|0.6995
    NW + RW_1To1665609|781|9640.7798|0.6317|0.6980
    NW + RW_1To1765608|778|9640.7815|0.6307|0.6980
    NW + RW_1To1865607|777|9640.7812|0.6297|0.6973
    NW + RW_1To121612|786|9640.7786|0.6349|0.6994
    NW + RW_1To1210612|786|9640.7786|0.6349|0.6994
    NW + RW_1To1265612|786|9640.7786|0.6349|0.6994
    NW + RW_1To12100611|785|9640.7783|0.6338|0.6987
    NW + RW_1To12500610|784|9640.7781|0.6328|0.6979
    NW + RW_1To121000610|782|9640.7801|0.6328|0.6987
    NW + RW_1To1210000608|778|9640.7815|0.6307|0.6980

    • Test on Min. length:
      • Increase it for better precision, worse recall.
      • Use a small number, precision does not increase.
      • The TPs starts to drop after 5. This might results in better/worse F1.
      • No TPs by RW-1To1 when it is 8 (>= 8), because the length of all corrections in the development set are less than 8.
      • Choose 2 for more recall with same F1 and precision. This means if the length of target word is 1, it is not a valid real-word for 1-To-1 correction.
    • Test on Min. WC (word count)
      • Increase it for better precision, worse recall, and faster run time.
      • Use a small number is precision does not increase.
      • Choose 1 for more recall with same F1 and precision.

  • Candidates (check on candidates):
    FunctionMin. LengthMin. WCRaw dataPerformance
    NW (All)N/AN/A607|777|9640.7812|0.6297|0.6973
    NW + RW_1To111612|786|9640.7786|0.6349|0.6994
    NW + RW_1To121612|786|9640.7786|0.6349|0.6994
    NW + RW_1To131612|787|9640.7776|0.6349|0.6990
    NW + RW_1To141612|785|9640.7796|0.6349|0.6998
    NW + RW_1To151612|785|9640.7796|0.6349|0.6998
    NW + RW_1To161609|779|9640.7818|0.6317|0.6988
    NW + RW_1To171608|778|9640.7815|0.6307|0.6980
    NW + RW_1To121612|786|9640.7786|0.6349|0.6994
    NW + RW_1To1210612|787|9640.7776|0.6349|0.6990
    NW + RW_1To12100612|791|9640.7737|0.6349|0.6974
    NW + RW_1To121000611|791|9640.7724|0.6338|0.6963
    NW + RW_1To1210000608|782|9640.7775|0.6307|0.6964
    • Candidate Min. length:
      • Increase it for better precision, worse recall.
      • If it pass a threshold, recall and precision drops.
      • Best F1 when it is at 4-5 because all TP are >= 4 (see example below).
      • This number must coordinated with min. focus length.
      • Choose 2 (candidate with length of 1 is not a valid candidate)
    • Candidate Min. WC:
      • Increase it for better precision, worse recall.
      • Choose 1 (corrections might be at small WC)

  • Rankers - confidence factor for selecting and validating the top candidate:
    FunctionC FactorC ScoreF ScoreRaw dataPerformance
    NW (All)N/AN/AN/A607|777|9640.7812|0.6297|0.6973
    NW + RW_1To10.000.01|0.00|0.085|-0.0850.035|0.0006|0.02612|786|9640.7786|0.6349|0.6994
    NW + RW_1To10.010.01|0.00|0.085|-0.0850.035|0.0006|0.02612|789|9640.7757|0.6349|0.6982
    NW + RW_1To10.100.01|0.00|0.085|-0.0850.035|0.0006|0.02612|813|9640.7528|0.6349|0.6888
    NW + RW_1To10.500.01|0.00|0.085|-0.0850.035|0.0006|0.02612|998|9640.6132|0.6349|0.6239
    NW + RW_1To10.000.01|0.00|0.085|-0.0850.035|0.0006|0.02612|786|9640.7786|0.6349|0.6994
    NW + RW_1To10.000.10|0.00|0.085|-0.0850.035|0.0006|0.02612|786|9640.7786|0.6349|0.6994
    ... TBD ...
    • Confidence Factor:
      • A very strict restriction is needed for confident factor to eliminate the FP.
      • Choose C factor to 0.00. (top candidate is only valid when the focus token has negative score and top candidate has positive score

III. Observations from Development test set (F1 = 0.6994)

  • [TP] real-word 1-To-1 corrections:
    IDSourceDetected WordsCorrected WordTextNotes
    TP-111225weatherwhetherfrom one Person to another. Weather it can happen or
    TP-211597bowlbowelirregular bowl movements.
    TP-312748effectaffectwhat is TSD/Clubfoot, and how does it effect a baby
    TP-413922theirtherein the Chicago area hospitals is their a surgeon familiar with the shoudice
    TP-517713smallsmelllost ability to taste and small, and who is profoundly depressedsmell size

    Example: smell vs. small

    • taste and small, foul small, bad small, small an odor, sense of small
    • smell size, smell amounts, a smell sip of water, smeller amounts, smell intestine

  • [FP] real-word 1-To-1:
    IDSourceDetected WordCorrected WordText
    FP-110349pleaseplace...give me good advice please
    FP-318855headhad... backalso inner head pain.com
    FP-42causescasesWhat are some causes of anorexia
    • FP-3: Corpus has more "also and had" than "inner head"
    • FP-4: "some causes of anorexia", but add "are" the "causes" is corrected to "cases". But it is OK for "What are some causes of pain" or "What are causes of anorexia"

  • [FN] real-word 1-To-1:
    IDSourceFocus WordsCorrected WordText
    TP-132thenthan
    TP-251thingthink
    TP-310138knownow
    TP-410375triedtired
    TP-510934speciallyespecially
    TP-611186repotreport
    TP-711378thenthanIs Radioiodine treatment better then surgery for me?
    TP-816734weatherwhetherI was particularly interested in learning weather parents should be worried about cribs death
    TP-912286lessonlessenWhat can I do to lesson the severity of the adema
    TP-1012757pregnancypregnant
    TP-1112788leavelive
    TP-1215759tenttend
    TP-1316256accessexcess
    TP-1416297loosinglosing
    • TP-9: "lesson" is not in the corpus of word2Vec.
      => Only "lessons" is in. Maybe use inflVars for detection.
      => Need a much bigger corpus for the word2Vec
      => The word2vec is very good on precision. However, the corpus used for training have to include such information (words and their context).