CSpell

Configuration Setup

CSpell ^Java provides users choices of different set up options through the configuration file. The default configuration file is ${CSPELL_DIR}/data/Config/cSpell.properties. The variables used in the configuration file are the empirical best value and listed in the following table. "Relative path" refers to the path relative to cSpell top directory, ${CSPELL_DIR}.

I. Configuration Variables

Directories and Files (13)
Variable Names	Descriptions	Variable Values (Default)
CS_DIR	the absolute path of the CSpell directory	CS_AUTO_MODE (use the current directory, must invoke CSpell at ${CSPELL_DIR}) /Projects/cSpell2018 d:/Projects/cSpell2018/
CS_INFORMAL_EXP_FILE	the relative path of the informal expression file	data/Misc/informalExpression.data
CS_CHECK_DIC_FILES	the relative path of the check dictionary file	data/Dictionary/check.dic
CS_SUGGEST_DIC_FILES	the relative path of the suggestion dictionary file	data/Dictionary/sugg.dic data/Dictionary/check.dic
CS_SPLIT_WORD_DIC_FILES	the relative path of the split word dictionary file	data/Dictionary/split.dic
CS_MW_DIC_FILE	the relative path of the multiword dictionary file	data/Dictionary/lexicon.mw.dic
CS_UNIT_DIC_FILE	the relative path of the units file	data/Dictionary/unit.data
CS_SV_DIC_FILE	the relative path of the spelling variants dictionary file	data/Dictionary/sv.dic
CS_AA_DIC_FILE	the relative path of the abbreviation/acronym dictionary file	data/Dictionary/lexicon.aa.dic
CS_PN_DIC_FILE	the relative path of the proper noun dictionary file	data/Dictionary/lexicon.pn.dic
CS_FREQUENCY_FILE	the relative path of the word frequency file	data/Frequency/wcWord.data
CS_W2V_IM_FILE	the relative path of the word2Vec CBOW input matrix file	data/Context/syn0.data
CS_W2V_OM_FILE	the relative path of the word2Vec CBOW output matrix file	data/Context/syn1n.data

Modes Setup (2)
Variable Names	Descriptions	Variable Values (Default)
CS_FUNC_MODE	Functional mode	0: non-dictionary-based correction 1: ND + non-word, 1-to-1 2: ND + non-word, split 3: ND + non-word, merge 4: ND + non-word, split and spelling (1-to-1) 5: ND + non-word, all 6: ND + NW + real-word, 1-to-1 7: ND + NW + real-word, split 8: ND + NW + real-word, merge 9: ND + NW + real-word, merge and split 10: ND + NW + real-word, all (default)
CS_RANK_MODE	Ranking mode for non-word, 1-to-1 and Split	0: Orthographic score 1: Frequency score 2: Context score 3: Noisy Channel score 4: Ensemble score 5: CSpell score (default)

Detector Variables (5)
Variable Names	Descriptions	Variable Values (Default)
CS_MAX_LEGIT_TOKEN_LENGTH	The maximum length of a legit token for spelling detection and correction.	30
CS_DETECTOR_RW_SPLIT_WORD_MIN_LENGTH	The minimum length for real-word split detection.	4
CS_DETECTOR_RW_SPLIT_WORD_MIN_WC	The minimum word count (frequency) for real-word split detection.	200
CS_DETECTOR_RW_1TO1_WORD_MIN_LENGTH	The minimum length for real-word 1-to-1 detection.	2
CS_DETECTOR_RW_1TO1_WORD_MIN_WC	The minimum word count for real-word 1-to-1 detection.	65

Candidate Generator Variables (17)
Variable Names	Descriptions	Variable Values (Default)
CS_CAN_MAX_CANDIDATE_NO	The maximum number of candidates.	30
CS_CAN_ND_MAX_SPLIT_NO	The maximum number of non-dictionary splits.	5
CS_CAN_NW_1TO1_WORD_MAX_LENGTH	The maximum length of word for non-word 1-to-1 correction.	25
CS_CAN_NW_MAX_SPLIT_NO	The maximum number of splits for non-words.	5
CS_CAN_NW_MAX_MERGE_NO	The maximum number of words to merge for non-words.	2
CS_CAN_NW_MERGE_WITH_HYPHEN	Boolean flag for merging with hyphen for non-words.	true
CS_CAN_RW_1TO1_WORD_MAX_LENGTH	The maximum length of word for real-word 1-to-1 correction.	10
CS_CAN_RW_MAX_SPLIT_NO	The maximum number of splits for real-words.	2
CS_CAN_RW_MAX_MERGE_NO	The maximum number of words to merge for real-words.	2
CS_CAN_RW_MERGE_WITH_HYPHEN	Boolean flag for merging with hyphen for real-words.	false
CS_CAN_RW_SHORT_SPLIT_WORD_LENGTH	The length of short split word for real-word split.	3
CS_CAN_RW_MAX_SHORT_SPLIT_WORD_NO	The maximum number of short split word for real-word.	2
CS_CAN_RW_MERGE_CAND_MIN_WC	The minimum word count for real-word merge candidates.	15
CS_CAN_RW_SPLIT_CAND_MIN_WC	The minimum word count for real-word split candidates.	200
CS_CAN_RW_1TO1_CAND_MIN_WC	The minimum word count for real-word 1-to-1 candidates.	1
CS_CAN_RW_1TO1_CAND_MIN_LENGTH	The minimum length of real-word 1-to-1 candidates.	2
CS_CAN_RW_1TO1_CAND_MAX_KEY_SIZE	The maximum size of keys in HashMap for real-word 1-to-1 candidates in memory.	1,000,000,000 (default) Max. theoretic value: 2**31-1 = 2,147,483,647 Empirical value: < 1,500,000,000

Ranker Variables (12)
Variable Names	Descriptions	Variable Values (Default)
CS_RANKER_NW_S1_RANK_RANGE_FAC	The range factor of the top orthographic score for qualifying stage-2 ranking for non-word split/1-to-1.	0.08
CS_RANKER_NW_S1_MIN_OSCORE	The minimum orthographic score for 1 candidate in stage-2 ranking for non-word split/1-to-1.	2.70
CS_RANKER_RW_MERGE_C_FAC	The confidence factor of context score for real-word merge.	060
CS_RANKER_RW_SPLIT_C_FAC	The confidence factor of context score for real-word split.	0.01
CS_RANKER_RW_1TO1_C_FAC	The confidence factor of context score for real-word 1-to-1.	0.00
CS_RANKER_RW_1TO1_CAND_MIN_CS	The minimum context score of the top candidate for real-word 1-to-1.	0.00
CS_RANKER_RW_1TO1_CAND_CS_DIST	The minimum distance of context score between the top candidate and the original token for real-word 1-to-1.	0.085
CS_RANKER_RW_1TO1_CAND_CS_FAC	The factor of context score between the top candidate and the original token for real-word 1-to-1.	0.10
CS_RANKER_RW_1TO1_WORD_MIN_CS	The minimum context score of the original token for real-word 1-to-1.	-0.085
CS_RANKER_RW_1TO1_CAND_MIN_FS	The minimum frequency score of the original token for real-word 1-to-1.	0.0006
CS_RANKER_RW_1TO1_CAND_FS_DIST	The minimum distance of frequency score between the top candidate and the original token for real-word 1-to-1.	0.02
CS_RANKER_RW_1TO1_CAND_FS_FAC	The factor of frequency score between the top candidate and the original token for real-word 1-to-1.	0.035

Score Variables (3)
Variable Names	Descriptions	Variable Values (Default)
CS_ORTHO_SCORE_ED_DIST_FAC	Weighting factor of edit distance for orthographic score.	1.00
CS_ORTHO_SCORE_PHONETIC_FAC	Weighting factor of phonetic for orthographic score.	0.70
CS_ORTHO_SCORE_OVERLAP_FAC	Weighting factor of overlap for orthographic score.	0.80

Context Setup Variables (7)
Variable Names	Descriptions	Variable Values (Default)
CS_W2V_SKIP_WORD	A Boolean flag of skipping context words if have no word2Vec score.	true
CS_NW_1TO1_CONTEXT_RADIUS	Context radius for non-word 1-to-1.	2
CS_NW_SPLIT_CONTEXT_RADIUS	Context radius for non-word split.	2 Not used (CSpell combined non-word split and 1-to-1 in one model)
CS_NW_MERGE_CONTEXT_RADIUS	Context radius for non-word merge.	2
CS_RW_1TO1_CONTEXT_RADIUS	Context radius for real-word 1-to-1.	2
CS_RW_SPLIT_CONTEXT_RADIUS	Context radius for real-word split.	2
CS_RW_MERGE_CONTEXT_RADIUS	Context radius for real-word merge.	2

II. Syntax

# -- comment lines begin with "#".
variable=value: set variable to value

III. File Location

default: ${CSPELL_DIR}/data/Config/cSpell.properties
may be specified by option -x:config_file_absolute_path

Notes: The CSpell installation program generates ${CSPELL_DIR}/data/config/cSpell.properties automatically (from ${CSPELL_DIR}/data/Config/cSpell.properties.TEMPLATE) according to options users chose during the installation.