CSpell

To Do List (after Cspell.2018)

This page includes to do list for features enhancement and bug fixes:

ID	Description	Example	Notes	Status
1	CoreTermUtil should handle term ends with . followed by ,?	i.e., to i.e.	Find example from Lexicon to exclude this or use pattern of ., .?
2	Terms include "/"	nnol/d -> mmol/d a/b -> a/b	should include check and correct
3	Special tag	[ORGANIZATION] [DATE] [NAME] [CONTACT] [LOCATION] [NB: THE HELL]	Treat these tags as a valid token
4	Remove non-English file from training set	11199.txt (French)	move to 11199.txt.rm	Done
5	Convert non-ASCII Unicode to ASCII	13090.txt (Italian): doesn’t -> doesn't	From non-English speaking customers
6	Terms include "-"	15835.txt: myostatin-related gastreonterology-colonoscopy -> gastroenterology-colonoscopy 63.txt: private-pay 10054.txt: klippel-tranaunay -> klippel-trenaunay	Check all words split by "-" (myostatin and related)
7	Terms include "."	16282.txt: w.b.c. -> wbc	Remove "." (and space between words)
8	Informal I 'll to I'll	17170.txt: I 'll -> I'll	Remove space
9	Remain possessive after correction	12085.txt: gaurdian's -> guardian's	Redesign model to handle possessive systematically: Issue: Correct the main word and keep the possessive Current: xxx's is in dictionary, corpus, WordVec Propose: remove possessive, only use the root word Assumption: very little change miss type 's (namely, all 's are not typo. However, the typo might happen in the root word) Use root word for check valid word (done), TBD: candidates, score, ... possessive is not as important as the root in NLP Implment a possessiveObj and possessiveUtil
10	Case sensitive correction	12969.txt: cysys -> (Cyss) -> cysts 17756.txt: stil -> STIL -> still 32.txt: piruvate -> PruVate -> pyruvate	Correct the main word and keep the possessive
11	Use Metaphone if it is the same	86.txt: trisomie -> trisomy 10475.txt: diagnost -> diagnosed	If the graphic ranking are similar, and Metaphone are the same, use it
12	ignore case for pre-Correction	12MG -> 12 mg	Case should be ignored for unit in Pre-Correction split
13	Performance Test Tool should take care of spVars	can't -> can not home town -> hometown	Need to considered spelling variants as correct answers in the evaluation tools
14	Check Split case	friendshare -> friend share aftercaremail -> aftercare mail unknowledgeable	Need to check to ensure split correctly
15	Use Nosie Channel to rank merge	TBD	Need more merge cases to tested
16	Handle possessive in the coreTerm	TBD	Better and graceful way of software design
17	Special Pattern Issues in Context Score	16734.txt: [CONTACT] -> [EMAIL]	Three special Patterns in context: [NUM], [EMAIL], [URL] The test data include [CONTACT], which could be [EMAIL] or [NUM]. They need to be synchronized Also, CoreTerm operation change [CONTACT] to "contact", need to be handled differently.
18	Add numbers, order (1st, 2nd, 3rd, etc. to merge dictionary	13423.txt: 3rd stage -> 3rd-stage	Need more merge cases to tested
19	Add the max. word length for rw/nw split		Need to prevent wasting time on splitting long words
20	Bigger corpus		Need a bigger and completed corpus for word2Vec and suggDic. "lesson" cann't be corrected to "lessen" because the WC of "lesson" is 2 and thus "lesson" does not have w2v.
21	Skip context of a real-word correction		Real-word correction uses context score, which assume the context is correct when there is a real-word correction. Thus, these tokens in the context should be marked and not to correct again in the real-word correction. Theorder of RW: merge -> split -> 1-to-1.
22	Update context if there is a correction		If there are multiple real-word correction in a sentence within a context window. The correct token shold be updated so that the following real-word correction can use the correct context.
23	make swap score smaller in EditDist Score		It seems swap should have less edit distance
24	Enhancement: "imple ment ation" is merged to "implementimplementation"		merge twice without correct context: implement (imple) implementation(ation) , but context is "implement ment ation", need to update nonEmptyList when there is a merge right away. This is fixed by taking care of contain/overlap for all mergeObj before the merge. The better solution is to correct the text as soon as a merge happen (instead of correct all merge at one time).
25	Change all rankning to CSpell Score		All ranking should use cSpell score
26	Change flat files to database or inversion file system		Requires fast init time and small footprint
27	add feature of reading str from a specified field
28	add feature of keeping input str		added option -si	Done
29	add maxLength of 1To1 Candidate to config file		CS_CAN_NW_1TO1_WORD_MAX_LENGTH CS_CAN_RW_1TO1_WORD_MAX_LENGTH	Done
30	speed optimization
31	Add orthographic weighting factors in config		default value: 1.0, 0.7, 0.8	Done
32	Add get non-word candidates API		both staeg 1 and stage 2 candidates only candidates in stage 2
33	Add Is non-word (detectin) API		In the dictionary exclude those errors that can't find correction?