Real-word Spelling (1-To-1)
This page describes the processes for real-word spelling (1-to-1) detection and correction.
I. Processes
- Detector:
RealWord1To1Detector.java
- Not corrected previously in the CSpell pipeline.
- real-word: valid word (in checkDic)
- Not exceptions: digit, punctuation, digit/punctuation, url, email, empty string, measurement, properNoun, abbreviation/acronym
- word has context score
- word WC >= 65 (configurable:
CS_DETECTOR_RW_1TO1_WORD_MIN_WC
)
- word has length >= 2 (configurable:
CS_DETECTOR_RW_1TO1_WORD_MIN_LENGTH
)
- Candidates:
RealWord1To1Candidates.java
- Max. length of real-word <= 10 (configurable:
CS_CAN_RW_1TO1_WORD_MAX_LENGTH
)
Only generate real-word 1-to-1 candidates for the word has length less than certain value to prevent over-generating and slow performance. The recall will be decreased if this number is too small (with faster speed).
- Generate all possible candidate as in the non-word
- Filter out invalid candidates (IsValid1To1Cand)
=> Ideally, we only correct real-word with candidates that are very similar to the inWord, that is they looks (orthographic) and sounds (phonetic) alike. If we loose this restriction, the real-word correction will be mainly rely on the context score (word2vec). In this version, our corpus for word2vec is relatively small and thus it generates too much noise [FP] and results in low precision and F1. This restriction of sounds and looks alike also helps (a little) on the run time performance (less context score calculation in ranking).
- in suggDic (valid word)
- has context score (word2Vec)
- WC >= 1 (has word count, configurable:
CS_CAN_RW_1TO1_CAND_MIN_WC
)
- length >= 2 (configurable:
CS_CAN_RW_1TO1_CAND_MIN_LENGTH
)
- candidate is not a inflectional variant of inWord
In this version, we do not correct grammar and thus no inflectional variants (such as plural nouns, 3rd personal singular verb, etc.) are corrected.
- Heuristic rules of looks and sounds alike:
- sounds alike: both phonetic codes of double metaphone and refined soundex must be the same
- same double metaphone code (pmDist = 0)
- same refined soundex code (prDist = 0)
- look alike: small edit distance with similar sounds
- leadDist + endDist + lengthDist + pmDist + prDist < 3
- editDist + pmDist + prDist < 4
- phonetic codes for double metaphone (pmDist = 0)
- Key size in HashMap to store real-time 1-To-1 candidates in memoery: 1,000,000,000 (configurable:
CS_CAN_RW_1TO1_CAND_MAX_KEY_SIZE
)
Slow run time performance due to too many real-words and their candidates. The generation of all possible candidates on the fly causes slow performances. To resolve this issue, we saved generated candidates (values) with real-word (key) to memory (in HashMap) to improve performance. Our test showed the elastped time is improved from 25+ min. to 3.5 min. on the training set. This is because:
- lots of real-word are repeated
- the candidates of real-word are the same
- Ranker:
RankRealWord1To1ByCSpell.java
- Find the top rank candidate
Sort the candidates by CSpellScoreRw1To1Comparator.java
:
OrthographicScoreComparator
The top ranked candidate (highest Orthographic score) must also have the highest scores of the follows in the candidate list:
- FrequencyScore
- EditDistSimilarityScore
- PhoneticSimilarityScore
- OverlapSimilarityScore
- Validate the top ranked candidate
Use context score to validate the top ranked candidate (IsTopCandValid
):
- context radius = 2 (configurable,
CS_RW_1TO1_CONTEXT_RADIUS
)
- Set the RealWord_1To1_Confidence_Factor = 0.0 (configurable:
CS_RANKER_RW_1TO1_C_FAC
) for more strict restriction to avoid false-positive candidates
- orgScore < 0
- & topScore > 0
- Context Score Check (on min., distance, and ratio)
- Min: topScpre > rw1To1CandMinCs (0.00, configurable:
CS_RANKER_RW_1TO1_CAND_MIN_CS
)
- Dist: topScore - orgScore > rw1To1CandCsDist (0.085, configurable:
CS_RANKER_RW_1TO1_CAND_CS_DIST
)
- Ratio: (topScore/-orgScore) > rw1To1CandCsFactor (0.1, configurable:
CS_RANKER_RW_1TO1_CAND_CS_FAC
)
- Min: orgScore > rw1To1WordMinCs (-0.085, configurable:
CS_RANKER_RW_1TO1_WORD_MIN_CS
)
- Frequency Score Check (on min., distance, and ratio)
- Min: topFScore > rw1To1CandMinFs (0.0006, configurable:
CS_RANKER_RW_1TO1_CAND_MIN_FS
)
- Dist: topFScore > orgFScore or
(orgFScore - topFScore) < rw1To1CandFsDist (0.02, configurable:
CS_RANKER_RW_1TO1_CAND_FS_DIST
)
- Ratio: (topFScore/orgFScore) > rw1To1CandFsFactor (0.035, configurable:
CS_RANKER_RW_1TO1_CAND_FS_FAC
)
- & topScore < 0 & topScore * RealWord1To1CFactor > orgScore
- orgScore > 0
- & topScore * RealWord_1To1_Confidence_Factor > orgScore
=> Never happen beacuse RealWord_1To1_Confidence_Factor is 0.0
- orgScore = 0
- No real-word 1-to-1 correction because they are exclusive from the detector (no word2Vec information on the inspected word)
- Corrector:
OneToOneCorrector.java
- Update the focused (inspected) token with the top ranked candidate.
- Update process history to real-word-1To1
II. Development Tests
Tested different real-word 1-to-1 factors on the revised real-word included gold standard from the training set. Each test takes about 3~5 min. (depends on computer and memory size)
- Detector (check on focus token):
Function | Min. Length | Min. WC | Raw data | Performance
|
---|
NW (All) | N/A | N/A | 607|777|964 | 0.7812|0.6297|0.6973
|
|
NW + RW_1To1 | 1 | 65 | 612|786|964 | 0.7786|0.6349|0.6994
|
NW + RW_1To1 | 2 | 65 | 612|786|964 | 0.7786|0.6349|0.6994
|
NW + RW_1To1 | 3 | 65 | 612|786|964 | 0.7786|0.6349|0.6994
|
NW + RW_1To1 | 4 | 65 | 612|786|964 | 0.7786|0.6349|0.6994
|
NW + RW_1To1 | 5 | 65 | 611|783|964 | 0.7803|0.6338|0.6995
|
NW + RW_1To1 | 6 | 65 | 609|781|964 | 0.7798|0.6317|0.6980
|
NW + RW_1To1 | 7 | 65 | 608|778|964 | 0.7815|0.6307|0.6980
|
NW + RW_1To1 | 8 | 65 | 607|777|964 | 0.7812|0.6297|0.6973
|
|
NW + RW_1To1 | 2 | 1 | 612|786|964 | 0.7786|0.6349|0.6994
|
NW + RW_1To1 | 2 | 10 | 612|786|964 | 0.7786|0.6349|0.6994
|
NW + RW_1To1 | 2 | 65 | 612|786|964 | 0.7786|0.6349|0.6994
|
NW + RW_1To1 | 2 | 100 | 611|785|964 | 0.7783|0.6338|0.6987
|
NW + RW_1To1 | 2 | 500 | 610|784|964 | 0.7781|0.6328|0.6979
|
NW + RW_1To1 | 2 | 1000 | 610|782|964 | 0.7801|0.6328|0.6987
|
NW + RW_1To1 | 2 | 10000 | 608|778|964 | 0.7815|0.6307|0.6980
|
- Test on Min. length:
- Increase it for better precision, worse recall.
- Use a small number, precision does not increase.
- The TPs starts to drop after 5. This might results in better/worse F1.
- No TPs by RW-1To1 when it is 8 (>= 8), because the length of all corrections in the development set are less than 8.
- Choose 2 for more recall with same F1 and precision. This means if the length of target word is 1, it is not a valid real-word for 1-To-1 correction.
- Test on Min. WC (word count)
- Increase it for better precision, worse recall, and faster run time.
- Use a small number is precision does not increase.
- Choose 1 for more recall with same F1 and precision.
- Candidates (check on candidates):
Function | Min. Length | Min. WC | Raw data | Performance
|
---|
NW (All) | N/A | N/A | 607|777|964 | 0.7812|0.6297|0.6973
|
|
NW + RW_1To1 | 1 | 1 | 612|786|964 | 0.7786|0.6349|0.6994
|
NW + RW_1To1 | 2 | 1 | 612|786|964 | 0.7786|0.6349|0.6994
|
NW + RW_1To1 | 3 | 1 | 612|787|964 | 0.7776|0.6349|0.6990
|
NW + RW_1To1 | 4 | 1 | 612|785|964 | 0.7796|0.6349|0.6998
|
NW + RW_1To1 | 5 | 1 | 612|785|964 | 0.7796|0.6349|0.6998
|
NW + RW_1To1 | 6 | 1 | 609|779|964 | 0.7818|0.6317|0.6988
|
NW + RW_1To1 | 7 | 1 | 608|778|964 | 0.7815|0.6307|0.6980
|
|
NW + RW_1To1 | 2 | 1 | 612|786|964 | 0.7786|0.6349|0.6994
|
NW + RW_1To1 | 2 | 10 | 612|787|964 | 0.7776|0.6349|0.6990
|
NW + RW_1To1 | 2 | 100 | 612|791|964 | 0.7737|0.6349|0.6974
|
NW + RW_1To1 | 2 | 1000 | 611|791|964 | 0.7724|0.6338|0.6963
|
NW + RW_1To1 | 2 | 10000 | 608|782|964 | 0.7775|0.6307|0.6964
|
- Candidate Min. length:
- Increase it for better precision, worse recall.
- If it pass a threshold, recall and precision drops.
- Best F1 when it is at 4-5 because all TP are >= 4 (see example below).
- This number must coordinated with min. focus length.
- Choose 2 (candidate with length of 1 is not a valid candidate)
- Candidate Min. WC:
- Increase it for better precision, worse recall.
- Choose 1 (corrections might be at small WC)
- Rankers - confidence factor for selecting and validating the top candidate:
Function | C Factor | C Score | F Score | Raw data | Performance
|
---|
NW (All) | N/A | N/A | N/A | 607|777|964 | 0.7812|0.6297|0.6973
|
|
NW + RW_1To1 | 0.00 | 0.01|0.00|0.085|-0.085 | 0.035|0.0006|0.02 | 612|786|964 | 0.7786|0.6349|0.6994
|
NW + RW_1To1 | 0.01 | 0.01|0.00|0.085|-0.085 | 0.035|0.0006|0.02 | 612|789|964 | 0.7757|0.6349|0.6982
|
NW + RW_1To1 | 0.10 | 0.01|0.00|0.085|-0.085 | 0.035|0.0006|0.02 | 612|813|964 | 0.7528|0.6349|0.6888
|
NW + RW_1To1 | 0.50 | 0.01|0.00|0.085|-0.085 | 0.035|0.0006|0.02 | 612|998|964 | 0.6132|0.6349|0.6239
|
|
NW + RW_1To1 | 0.00 | 0.01|0.00|0.085|-0.085 | 0.035|0.0006|0.02 | 612|786|964 | 0.7786|0.6349|0.6994
|
NW + RW_1To1 | 0.00 | 0.10|0.00|0.085|-0.085 | 0.035|0.0006|0.02 | 612|786|964 | 0.7786|0.6349|0.6994
|
... TBD ...
|
- Confidence Factor:
- A very strict restriction is needed for confident factor to eliminate the FP.
- Choose C factor to 0.00. (top candidate is only valid when the focus token has negative score and top candidate has positive score
III. Observations from Development test set (F1 = 0.6994)
- [TP] real-word 1-To-1 corrections:
ID | Source | Detected Words | Corrected Word | Text | Notes
|
---|
TP-1 | 11225 | weather | whether | from one Person to another. Weather it can happen or |
|
TP-2 | 11597 | bowl | bowel | irregular bowl movements. |
|
TP-3 | 12748 | effect | affect | what is TSD/Clubfoot, and how does it effect a baby |
|
TP-4 | 13922 | their | there | in the Chicago area hospitals is their a surgeon familiar with the shoudice |
|
TP-5 | 17713 | small | smell | lost ability to taste and small, and who is profoundly depressed | smell size
|
Example: smell vs. small
- taste and small, foul small, bad small, small an odor, sense of small
- smell size, smell amounts, a smell sip of water, smeller amounts, smell intestine
- [FP] real-word 1-To-1:
ID | Source | Detected Word | Corrected Word | Text
|
---|
FP-1 | 10349 | please | place | ...give me good advice please
|
FP-3 | 18855 | head | had | ... backalso inner head pain.com
|
FP-4 | 2 | causes | cases | What are some causes of anorexia
|
- FP-3: Corpus has more "also and had" than "inner head"
- FP-4: "some causes of anorexia", but add "are" the "causes" is corrected to "cases". But it is OK for "What are some causes of pain" or "What are causes of anorexia"
- [FN] real-word 1-To-1:
ID | Source | Focus Words | Corrected Word | Text
|
---|
TP-1 | 32 | then | than |
|
TP-2 | 51 | thing | think |
|
TP-3 | 10138 | know | now |
|
TP-4 | 10375 | tried | tired |
|
TP-5 | 10934 | specially | especially |
|
TP-6 | 11186 | repot | report |
|
TP-7 | 11378 | then | than | Is Radioiodine treatment better then surgery for me?
|
TP-8 | 16734 | weather | whether | I was particularly interested in learning weather parents should be worried about cribs death
|
TP-9 | 12286 | lesson | lessen | What can I do to lesson the severity of the adema
|
TP-10 | 12757 | pregnancy | pregnant |
|
TP-11 | 12788 | leave | live |
|
TP-12 | 15759 | tent | tend |
|
TP-13 | 16256 | access | excess |
|
TP-14 | 16297 | loosing | losing |
|
- TP-9: "lesson" is not in the corpus of word2Vec.
=> Only "lessons" is in. Maybe use inflVars for detection.
=> Need a much bigger corpus for the word2Vec
=> The word2vec is very good on precision. However, the corpus used for training have to include such information (words and their context).