12085.txt: gaurdian's -> guardian's
Redesign model to handle possessive systematically:
- Issue: Correct the main word and keep the possessive
- Current: xxx's is in dictionary, corpus, WordVec
- Propose: remove possessive, only use the root word
- Assumption: very little change miss type 's (namely, all 's are not typo. However, the typo might happen in the root word)
- Use root word for check valid word (done), TBD: candidates, score, ...
- possessive is not as important as the root in NLP
- Implment a possessiveObj and possessiveUtil
|
| 10 | Case sensitive correction | - 12969.txt: cysys -> (Cyss) -> cysts
- 17756.txt: stil -> STIL -> still
- 32.txt: piruvate -> PruVate -> pyruvate
| Correct the main word and keep the possessive |
|
11 | Use Metaphone if it is the same | - 86.txt: trisomie -> trisomy
- 10475.txt: diagnost -> diagnosed
| If the graphic ranking are similar, and Metaphone are the same, use it
|
|
12 | ignore case for pre-Correction | | Case should be ignored for unit in Pre-Correction split
|
|
13 | Performance Test Tool should take care of spVars | - can't -> can not
- home town -> hometown
| Need to considered spelling variants as correct answers in the evaluation tools
|
|
14 | Check Split case | - friendshare -> friend share
- aftercaremail -> aftercare mail
- unknowledgeable
| Need to check to ensure split correctly
|
|
15 | Use Nosie Channel to rank merge | | Need more merge cases to tested
|
|
16 | Handle possessive in the coreTerm | | Better and graceful way of software design
|
|
17 | Special Pattern Issues in Context Score | - 16734.txt: [CONTACT] -> [EMAIL]
| - Three special Patterns in context: [NUM], [EMAIL], [URL]
- The test data include [CONTACT], which could be [EMAIL] or [NUM].
- They need to be synchronized
- Also, CoreTerm operation change [CONTACT] to "contact", need to be handled differently.
|
|
18 | Add numbers, order (1st, 2nd, 3rd, etc. to merge dictionary | - 13423.txt: 3rd stage -> 3rd-stage
| Need more merge cases to tested
|
|
19 | Add the max. word length for rw/nw split | | Need to prevent wasting time on splitting long words
|
|
20 | Bigger corpus | | Need a bigger and completed corpus for word2Vec and suggDic. "lesson" cann't be corrected to "lessen" because the WC of "lesson" is 2 and thus "lesson" does not have w2v.
|
|
21 | Skip context of a real-word correction | | Real-word correction uses context score, which assume the context is correct when there is a real-word correction. Thus, these tokens in the context should be marked and not to correct again in the real-word correction. Theorder of RW: merge -> split -> 1-to-1.
|
|
22 | Update context if there is a correction | | If there are multiple real-word correction in a sentence within a context window. The correct token shold be updated so that the following real-word correction can use the correct context.
|
|
23 | make swap score smaller in EditDist Score | | It seems swap should have less edit distance
|
|
24 | Enhancement: "imple ment ation" is merged to "implementimplementation" | | merge twice without correct context:
- implement (imple)
- implementation(ation)
, but context is "implement ment ation", need to update nonEmptyList when there is a merge right away.
This is fixed by taking care of contain/overlap for all mergeObj before the merge. The better solution is to correct the text as soon as a merge happen (instead of correct all merge at one time).
|
|
25 | Change all rankning to CSpell Score | | All ranking should use cSpell score
|
|
26 | Change flat files to database or inversion file system | | Requires fast init time and small footprint
|
|
27 | add feature of reading str from a specified field | |
|
|
28 | add feature of keeping input str | | | Done
|
29 | add maxLength of 1To1 Candidate to config file | | - CS_CAN_NW_1TO1_WORD_MAX_LENGTH
- CS_CAN_RW_1TO1_WORD_MAX_LENGTH
| Done
|
30 | speed optimization | | |
|
31 | Add orthographic weighting factors in config | | - default value: 1.0, 0.7, 0.8
| Done
|
32 | Add get non-word candidates API | | - both staeg 1 and stage 2 candidates
- only candidates in stage 2
| 33 | Add Is non-word (detectin) API | | - In the dictionary
- exclude those errors that can't find correction?
|
|
|