CSpell

To Do List (after Cspell.2018)

This page includes to do list for features enhancement and bug fixes:

IDDescriptionExampleNotesStatus
1CoreTermUtil should handle term ends with . followed by ,?
  • i.e., to i.e.
Find example from Lexicon to exclude this or use pattern of ., .?
2Terms include "/"
  • nnol/d -> mmol/d
  • a/b -> a/b
should include check and correct
3Special tag
  • [ORGANIZATION]
  • [DATE]
  • [NAME]
  • [CONTACT]
  • [LOCATION]
  • [NB: THE HELL]
Treat these tags as a valid token
4Remove non-English file from training set
  • 11199.txt (French)
move to 11199.txt.rmDone
5Convert non-ASCII Unicode to ASCII
  • 13090.txt (Italian): doesn’t -> doesn't
From non-English speaking customers
6Terms include "-"
  • 15835.txt: myostatin-related
  • gastreonterology-colonoscopy -> gastroenterology-colonoscopy
  • 63.txt: private-pay
  • 10054.txt: klippel-tranaunay -> klippel-trenaunay
Check all words split by "-" (myostatin and related)
7Terms include "."
  • 16282.txt: w.b.c. -> wbc
Remove "." (and space between words)
8Informal I 'll to I'll
  • 17170.txt: I 'll -> I'll
Remove space
9Remain possessive after correction
  • 12085.txt: gaurdian's -> guardian's
Redesign model to handle possessive systematically:
  • Issue: Correct the main word and keep the possessive
  • Current: xxx's is in dictionary, corpus, WordVec
  • Propose: remove possessive, only use the root word
  • Assumption: very little change miss type 's (namely, all 's are not typo. However, the typo might happen in the root word)
  • Use root word for check valid word (done), TBD: candidates, score, ...
  • possessive is not as important as the root in NLP
  • Implment a possessiveObj and possessiveUtil
10Case sensitive correction
  • 12969.txt: cysys -> (Cyss) -> cysts
  • 17756.txt: stil -> STIL -> still
  • 32.txt: piruvate -> PruVate -> pyruvate
Correct the main word and keep the possessive
11Use Metaphone if it is the same
  • 86.txt: trisomie -> trisomy
  • 10475.txt: diagnost -> diagnosed
If the graphic ranking are similar, and Metaphone are the same, use it
12ignore case for pre-Correction
  • 12MG -> 12 mg
Case should be ignored for unit in Pre-Correction split
13Performance Test Tool should take care of spVars
  • can't -> can not
  • home town -> hometown
Need to considered spelling variants as correct answers in the evaluation tools
14Check Split case
  • friendshare -> friend share
  • aftercaremail -> aftercare mail
  • unknowledgeable
Need to check to ensure split correctly
15Use Nosie Channel to rank merge
  • TBD
Need more merge cases to tested
16Handle possessive in the coreTerm
  • TBD
Better and graceful way of software design
17Special Pattern Issues in Context Score
  • 16734.txt: [CONTACT] -> [EMAIL]
  • Three special Patterns in context: [NUM], [EMAIL], [URL]
  • The test data include [CONTACT], which could be [EMAIL] or [NUM].
  • They need to be synchronized
  • Also, CoreTerm operation change [CONTACT] to "contact", need to be handled differently.
18Add numbers, order (1st, 2nd, 3rd, etc. to merge dictionary
  • 13423.txt: 3rd stage -> 3rd-stage
Need more merge cases to tested
19Add the max. word length for rw/nw split
Need to prevent wasting time on splitting long words
20Bigger corpus
Need a bigger and completed corpus for word2Vec and suggDic. "lesson" cann't be corrected to "lessen" because the WC of "lesson" is 2 and thus "lesson" does not have w2v.
21Skip context of a real-word correction
Real-word correction uses context score, which assume the context is correct when there is a real-word correction. Thus, these tokens in the context should be marked and not to correct again in the real-word correction. Theorder of RW: merge -> split -> 1-to-1.
22Update context if there is a correction
If there are multiple real-word correction in a sentence within a context window. The correct token shold be updated so that the following real-word correction can use the correct context.
23make swap score smaller in EditDist Score
It seems swap should have less edit distance
24Enhancement: "imple ment ation" is merged to "implementimplementation"
merge twice without correct context:
  • implement (imple)
  • implementation(ation)
, but context is "implement ment ation", need to update nonEmptyList when there is a merge right away.

This is fixed by taking care of contain/overlap for all mergeObj before the merge. The better solution is to correct the text as soon as a merge happen (instead of correct all merge at one time).

25Change all rankning to CSpell Score
All ranking should use cSpell score
26Change flat files to database or inversion file system
Requires fast init time and small footprint
27add feature of reading str from a specified field
28add feature of keeping input str
  • added option -si
Done
29add maxLength of 1To1 Candidate to config file
  • CS_CAN_NW_1TO1_WORD_MAX_LENGTH
  • CS_CAN_RW_1TO1_WORD_MAX_LENGTH
Done
30speed optimization
31Add orthographic weighting factors in config
  • default value: 1.0, 0.7, 0.8
Done
32Add get non-word candidates API
  • both staeg 1 and stage 2 candidates
  • only candidates in stage 2
33Add Is non-word (detectin) API
  • In the dictionary
  • exclude those errors that can't find correction?