CSpell

Frequently Asked Questions

(Please read before asking a question)

  • How can I ask a question?
    See Contact Us

  • What is the license agreement for CSpell?
    CSpell is distributed to public users by NLM under a open source agreement. It allows for use and redistribution without warranty. Please refer to NLM copyright information and terms and conditions for details. CSpell uses third party software. The license information are shown as follows:

    Third party softwareLicense
    Apache Commons Codec The APACHE software foundation license
    GNU CYGWIN package GNU General Public License (GPL)

  • How to use CSpell Java APIs?
    The easiest processes are summarized as follows:
    • Download and install CSpell
      • The installation configure CSpell automatically!
    • Instantiate a CSpellApi object and run it.
      	...
      	import gov.nih.nlm.nls.cSpell.Api.CSpellApi;
      	...
      	# Instantiate a CSpellApi with CSpell config file
      	String configFile = "/CSpell/cSpell2018/data/Config/cSpell.properties";
      	CSpellApi cSpellApi = new CSpellApi(configFile);
      	...
      	# process inText and save to outText
      	String outText = cSpellApi.ProcessToStr(inText);
      	...
      	
      Please refer to:
    • Compile and run the Java codes
      • Make sure the CSpell configuration is configured correctly (cSpell.properties)
      • include cSpell2018api.jar to compile
      • include cSpell2018dist.jar to run

  • What is the limitation of CSpell?
    The limitation is summarized below:

    ComponentLimitationNotes
    DetectorWords are not in the dictionary will be detected as non-word with exceptions
    • Dictionary can be easily customized
    • telephone number, digit, e-Mail, Url, etc. are exception is considered as valid token
    CandidatesMax. edit distance of correction (from error) is 2
    • Use reverse minimum edit distance technique with edit distance of 2 to cover over 91.24% of errors for fast speed performance
    • Errors with Edit distance >= 3 is hard to correct
    Context scoreCorrected word and context information must be in the training corpus for using context-dependent correction
    • Both word vectors {IM} and [OM] can be trained and configured in CSpell (see details in the next Question and answering)

  • How to generate IM and OM from word2vec for dual embedding?
    The processes are:
    • Download source code from word2vec
    • Compile c code (make) and run, to make sure this code is run-able on your computer. The output is the IM (syn0).
    • Modify the word2vec.c:
      • save word vector syn1-neg (OM)
      • print out word vector syn1-neg (OM)
    • Compile and run the modified c code (word2vec.c)