CSpell

Real-word Spelling (1-To-1)

This page describes the processes for real-word spelling (1-to-1) detection and correction.

I. Processes

Detector:
RealWord1To1Detector.java
- Not corrected previously in the CSpell pipeline.
- real-word: valid word (in checkDic)
- Not exceptions: digit, punctuation, digit/punctuation, url, email, empty string, measurement, properNoun, abbreviation/acronym
- word has context score
- word WC >= 65 (configurable: CS_DETECTOR_RW_1TO1_WORD_MIN_WC)
- word has length >= 2 (configurable: CS_DETECTOR_RW_1TO1_WORD_MIN_LENGTH)
Candidates:
RealWord1To1Candidates.java
- Max. length of real-word <= 10 (configurable: CS_CAN_RW_1TO1_WORD_MAX_LENGTH)
  Only generate real-word 1-to-1 candidates for the word has length less than certain value to prevent over-generating and slow performance. The recall will be decreased if this number is too small (with faster speed).
- Generate all possible candidate as in the non-word
- Filter out invalid candidates (IsValid1To1Cand)
  => Ideally, we only correct real-word with candidates that are very similar to the inWord, that is they looks (orthographic) and sounds (phonetic) alike. If we loose this restriction, the real-word correction will be mainly rely on the context score (word2vec). In this version, our corpus for word2vec is relatively small and thus it generates too much noise [FP] and results in low precision and F1. This restriction of sounds and looks alike also helps (a little) on the run time performance (less context score calculation in ranking).
  - in suggDic (valid word)
  - has context score (word2Vec)
  - WC >= 1 (has word count, configurable: CS_CAN_RW_1TO1_CAND_MIN_WC)
  - length >= 2 (configurable: CS_CAN_RW_1TO1_CAND_MIN_LENGTH)
  - candidate is not a inflectional variant of inWord
    In this version, we do not correct grammar and thus no inflectional variants (such as plural nouns, 3rd personal singular verb, etc.) are corrected.
  - Heuristic rules of looks and sounds alike:
    - sounds alike: both phonetic codes of double metaphone and refined soundex must be the same
      - same double metaphone code (pmDist = 0)
      - same refined soundex code (prDist = 0)
    - look alike: small edit distance with similar sounds
      - leadDist + endDist + lengthDist + pmDist + prDist < 3
      - editDist + pmDist + prDist < 4
      - phonetic codes for double metaphone (pmDist = 0)
- Key size in HashMap to store real-time 1-To-1 candidates in memoery: 1,000,000,000 (configurable: CS_CAN_RW_1TO1_CAND_MAX_KEY_SIZE)
  Slow run time performance due to too many real-words and their candidates. The generation of all possible candidates on the fly causes slow performances. To resolve this issue, we saved generated candidates (values) with real-word (key) to memory (in HashMap) to improve performance. Our test showed the elastped time is improved from 25+ min. to 3.5 min. on the training set. This is because:
  - lots of real-word are repeated
  - the candidates of real-word are the same
Ranker:
RankRealWord1To1ByCSpell.java
- Find the top rank candidate
  Sort the candidates by CSpellScoreRw1To1Comparator.java:
  - OrthographicScoreComparator
    The top ranked candidate (highest Orthographic score) must also have the highest scores of the follows in the candidate list:
  - FrequencyScore
  - EditDistSimilarityScore
  - PhoneticSimilarityScore
  - OverlapSimilarityScore
- Validate the top ranked candidate
  Use context score to validate the top ranked candidate (IsTopCandValid):
  - context radius = 2 (configurable, CS_RW_1TO1_CONTEXT_RADIUS)
  - Set the RealWord_1To1_Confidence_Factor = 0.0 (configurable:CS_RANKER_RW_1TO1_C_FAC) for more strict restriction to avoid false-positive candidates
  - orgScore < 0
    - & topScore > 0
      - Context Score Check (on min., distance, and ratio)
        
        Min: topScpre > rw1To1CandMinCs (0.00, configurable: CS_RANKER_RW_1TO1_CAND_MIN_CS)
        Dist: topScore - orgScore > rw1To1CandCsDist (0.085, configurable: CS_RANKER_RW_1TO1_CAND_CS_DIST)
        Ratio: (topScore/-orgScore) > rw1To1CandCsFactor (0.1, configurable: CS_RANKER_RW_1TO1_CAND_CS_FAC)
        
        Min: orgScore > rw1To1WordMinCs (-0.085, configurable: CS_RANKER_RW_1TO1_WORD_MIN_CS)
      - Frequency Score Check (on min., distance, and ratio)
        
        Min: topFScore > rw1To1CandMinFs (0.0006, configurable: CS_RANKER_RW_1TO1_CAND_MIN_FS)
        Dist: topFScore > orgFScore or (orgFScore - topFScore) < rw1To1CandFsDist (0.02, configurable: CS_RANKER_RW_1TO1_CAND_FS_DIST)
        Ratio: (topFScore/orgFScore) > rw1To1CandFsFactor (0.035, configurable: CS_RANKER_RW_1TO1_CAND_FS_FAC)
    - & topScore < 0 & topScore * RealWord1To1CFactor > orgScore
  - orgScore > 0
    - & topScore * RealWord_1To1_Confidence_Factor > orgScore
      => Never happen beacuse RealWord_1To1_Confidence_Factor is 0.0
  - orgScore = 0
    - No real-word 1-to-1 correction because they are exclusive from the detector (no word2Vec information on the inspected word)
Corrector:
OneToOneCorrector.java
- Update the focused (inspected) token with the top ranked candidate.
- Update process history to real-word-1To1

II. Development Tests

Tested different real-word 1-to-1 factors on the revised real-word included gold standard from the training set. Each test takes about 3~5 min. (depends on computer and memory size)

Detector (check on focus token):

Function	Min. Length	Min. WC	Raw data	Performance
NW (All)	N/A	N/A	607\|777\|964	0.7812\|0.6297\|0.6973

NW + RW_1To1	1	65	612\|786\|964	0.7786\|0.6349\|0.6994
NW + RW_1To1	2	65	612\|786\|964	0.7786\|0.6349\|0.6994
NW + RW_1To1	3	65	612\|786\|964	0.7786\|0.6349\|0.6994
NW + RW_1To1	4	65	612\|786\|964	0.7786\|0.6349\|0.6994
NW + RW_1To1	5	65	611\|783\|964	0.7803\|0.6338\|0.6995
NW + RW_1To1	6	65	609\|781\|964	0.7798\|0.6317\|0.6980
NW + RW_1To1	7	65	608\|778\|964	0.7815\|0.6307\|0.6980
NW + RW_1To1	8	65	607\|777\|964	0.7812\|0.6297\|0.6973

NW + RW_1To1	2	1	612\|786\|964	0.7786\|0.6349\|0.6994
NW + RW_1To1	2	10	612\|786\|964	0.7786\|0.6349\|0.6994
NW + RW_1To1	2	65	612\|786\|964	0.7786\|0.6349\|0.6994
NW + RW_1To1	2	100	611\|785\|964	0.7783\|0.6338\|0.6987
NW + RW_1To1	2	500	610\|784\|964	0.7781\|0.6328\|0.6979
NW + RW_1To1	2	1000	610\|782\|964	0.7801\|0.6328\|0.6987
NW + RW_1To1	2	10000	608\|778\|964	0.7815\|0.6307\|0.6980

Test on Min. length:
- Increase it for better precision, worse recall.
- Use a small number, precision does not increase.
- The TPs starts to drop after 5. This might results in better/worse F1.
- No TPs by RW-1To1 when it is 8 (>= 8), because the length of all corrections in the development set are less than 8.
- Choose 2 for more recall with same F1 and precision. This means if the length of target word is 1, it is not a valid real-word for 1-To-1 correction.
Test on Min. WC (word count)
- Increase it for better precision, worse recall, and faster run time.
- Use a small number is precision does not increase.
- Choose 1 for more recall with same F1 and precision.

Candidates (check on candidates):

Function	Min. Length	Min. WC	Raw data	Performance
NW (All)	N/A	N/A	607\|777\|964	0.7812\|0.6297\|0.6973

NW + RW_1To1	1	1	612\|786\|964	0.7786\|0.6349\|0.6994
NW + RW_1To1	2	1	612\|786\|964	0.7786\|0.6349\|0.6994
NW + RW_1To1	3	1	612\|787\|964	0.7776\|0.6349\|0.6990
NW + RW_1To1	4	1	612\|785\|964	0.7796\|0.6349\|0.6998
NW + RW_1To1	5	1	612\|785\|964	0.7796\|0.6349\|0.6998
NW + RW_1To1	6	1	609\|779\|964	0.7818\|0.6317\|0.6988
NW + RW_1To1	7	1	608\|778\|964	0.7815\|0.6307\|0.6980

NW + RW_1To1	2	1	612\|786\|964	0.7786\|0.6349\|0.6994
NW + RW_1To1	2	10	612\|787\|964	0.7776\|0.6349\|0.6990
NW + RW_1To1	2	100	612\|791\|964	0.7737\|0.6349\|0.6974
NW + RW_1To1	2	1000	611\|791\|964	0.7724\|0.6338\|0.6963
NW + RW_1To1	2	10000	608\|782\|964	0.7775\|0.6307\|0.6964

Candidate Min. length:
- Increase it for better precision, worse recall.
- If it pass a threshold, recall and precision drops.
- Best F1 when it is at 4-5 because all TP are >= 4 (see example below).
- This number must coordinated with min. focus length.
- Choose 2 (candidate with length of 1 is not a valid candidate)
Candidate Min. WC:
- Increase it for better precision, worse recall.
- Choose 1 (corrections might be at small WC)

Rankers - confidence factor for selecting and validating the top candidate:

Function	C Factor	C Score	F Score	Raw data	Performance
NW (All)	N/A	N/A	N/A	607\|777\|964	0.7812\|0.6297\|0.6973

NW + RW_1To1	0.00	0.01\|0.00\|0.085\|-0.085	0.035\|0.0006\|0.02	612\|786\|964	0.7786\|0.6349\|0.6994
NW + RW_1To1	0.01	0.01\|0.00\|0.085\|-0.085	0.035\|0.0006\|0.02	612\|789\|964	0.7757\|0.6349\|0.6982
NW + RW_1To1	0.10	0.01\|0.00\|0.085\|-0.085	0.035\|0.0006\|0.02	612\|813\|964	0.7528\|0.6349\|0.6888
NW + RW_1To1	0.50	0.01\|0.00\|0.085\|-0.085	0.035\|0.0006\|0.02	612\|998\|964	0.6132\|0.6349\|0.6239

NW + RW_1To1	0.00	0.01\|0.00\|0.085\|-0.085	0.035\|0.0006\|0.02	612\|786\|964	0.7786\|0.6349\|0.6994
NW + RW_1To1	0.00	0.10\|0.00\|0.085\|-0.085	0.035\|0.0006\|0.02	612\|786\|964	0.7786\|0.6349\|0.6994
... TBD ...

Confidence Factor:
- A very strict restriction is needed for confident factor to eliminate the FP.
- Choose C factor to 0.00. (top candidate is only valid when the focus token has negative score and top candidate has positive score

III. Observations from Development test set (F1 = 0.6994)

[TP] real-word 1-To-1 corrections:

ID	Source	Detected Words	Corrected Word	Text	Notes
TP-1	11225	weather	whether	from one Person to another. Weather it can happen or
TP-2	11597	bowl	bowel	irregular bowl movements.
TP-3	12748	effect	affect	what is TSD/Clubfoot, and how does it effect a baby
TP-4	13922	their	there	in the Chicago area hospitals is their a surgeon familiar with the shoudice
TP-5	17713	small	smell	lost ability to taste and small, and who is profoundly depressed	smell size

Example: smell vs. small

taste and small, foul small, bad small, small an odor, sense of small
smell size, smell amounts, a smell sip of water, smeller amounts, smell intestine

[FP] real-word 1-To-1:

ID Source Detected Word Corrected Word Text
FP-1 10349 please place ...give me good advice please
FP-3 18855 head had ... backalso inner head pain.com
FP-4 2 causes cases What are some causes of anorexia
- FP-3: Corpus has more "also and had" than "inner head"
- FP-4: "some causes of anorexia", but add "are" the "causes" is corrected to "cases". But it is OK for "What are some causes of pain" or "What are causes of anorexia"

ID	Source	Detected Word	Corrected Word	Text
FP-1	10349	please	place	...give me good advice please
FP-3	18855	head	had	... backalso inner head pain.com
FP-4	2	causes	cases	What are some causes of anorexia

[FN] real-word 1-To-1:

ID	Source	Focus Words	Corrected Word	Text
TP-1	32	then	than
TP-2	51	thing	think
TP-3	10138	know	now
TP-4	10375	tried	tired
TP-5	10934	specially	especially
TP-6	11186	repot	report
TP-7	11378	then	than	Is Radioiodine treatment better then surgery for me?
TP-8	16734	weather	whether	I was particularly interested in learning weather parents should be worried about cribs death
TP-9	12286	lesson	lessen	What can I do to lesson the severity of the adema
TP-10	12757	pregnancy	pregnant
TP-11	12788	leave	live
TP-12	15759	tent	tend
TP-13	16256	access	excess
TP-14	16297	loosing	losing

TP-9: "lesson" is not in the corpus of word2Vec.
=> Only "lessons" is in. Maybe use inflVars for detection.
=> Need a much bigger corpus for the word2Vec
=> The word2vec is very good on precision. However, the corpus used for training have to include such information (words and their context).

Function	Min. Length	Min. WC	Raw data	Performance
NW (All)	N/A	N/A	607\|777\|964	0.7812\|0.6297\|0.6973

NW + RW_1To1	1	1	612\|786\|964	0.7786\|0.6349\|0.6994
NW + RW_1To1	2	1	612\|786\|964	0.7786\|0.6349\|0.6994
NW + RW_1To1	3	1	612\|787\|964	0.7776\|0.6349\|0.6990
NW + RW_1To1	4	1	612\|785\|964	0.7796\|0.6349\|0.6998
NW + RW_1To1	5	1	612\|785\|964	0.7796\|0.6349\|0.6998
NW + RW_1To1	6	1	609\|779\|964	0.7818\|0.6317\|0.6988
NW + RW_1To1	7	1	608\|778\|964	0.7815\|0.6307\|0.6980

NW + RW_1To1	2	1	612\|786\|964	0.7786\|0.6349\|0.6994
NW + RW_1To1	2	10	612\|787\|964	0.7776\|0.6349\|0.6990
NW + RW_1To1	2	100	612\|791\|964	0.7737\|0.6349\|0.6974
NW + RW_1To1	2	1000	611\|791\|964	0.7724\|0.6338\|0.6963
NW + RW_1To1	2	10000	608\|782\|964	0.7775\|0.6307\|0.6964

Function	C Factor	C Score	F Score	Raw data	Performance
NW (All)	N/A	N/A	N/A	607\|777\|964	0.7812\|0.6297\|0.6973

NW + RW_1To1	0.00	0.01\|0.00\|0.085\|-0.085	0.035\|0.0006\|0.02	612\|786\|964	0.7786\|0.6349\|0.6994
NW + RW_1To1	0.01	0.01\|0.00\|0.085\|-0.085	0.035\|0.0006\|0.02	612\|789\|964	0.7757\|0.6349\|0.6982
NW + RW_1To1	0.10	0.01\|0.00\|0.085\|-0.085	0.035\|0.0006\|0.02	612\|813\|964	0.7528\|0.6349\|0.6888
NW + RW_1To1	0.50	0.01\|0.00\|0.085\|-0.085	0.035\|0.0006\|0.02	612\|998\|964	0.6132\|0.6349\|0.6239

NW + RW_1To1	0.00	0.01\|0.00\|0.085\|-0.085	0.035\|0.0006\|0.02	612\|786\|964	0.7786\|0.6349\|0.6994
NW + RW_1To1	0.00	0.10\|0.00\|0.085\|-0.085	0.035\|0.0006\|0.02	612\|786\|964	0.7786\|0.6349\|0.6994
... TBD ...