Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov

CSpell

Corpora in CSpell

I. Introduction

The corpus is used to:

  • calculate the word frequency scores (word count) and noisy channel scores.
  • generate word vector, use it to train word2vec to generate IM and OM

  • Generate n-gram set (not used directly in CSpell)

II. Corpora Tested in CSpell

Three corpora were tested for comparison:

BaselineConsumer Health CorpusMedline N-gram Set
Resources7 web sites20 (16) web sitesThe Medline N-gram Set (2017 release)
Statistics
  • Articles: 8,590
  • Tokens: 5,771,363
  • Unique Word:51,190
  • Dic Words in Corpus: 35,781|6.3115%
  • Dic Words WC: 5,637,436|98.4708%
  • Articles: 17,139
  • Sentences: 550,193
  • Tokens: 10,228,699
  • Unique Word: 192,818
  • Unique CoreTerm.Lc: 109,175
  • Dic Words in Corpus: 48690|8.5886%
  • Dic Words WC: 9,979,195|97.6123%
  • Articles: 26,759,399
  • Sentences: 163,021,640
  • Tokens: 3,386,661,350
  • Unique Word: 976,872 (WC > 30)
  • Unique CoreTerm.Lc: 496,388
  • Dic Words in Corpus: 214,581|37.8507%
  • Dic Words WC: 3,224,585,163|97.1439%

PS.

  • Total words in CSpell Suggesting Dictionary: 566,914
    • lexicon.enEwLc.dic.addRm
    • customerDic.data
    • NRVAR.1.uSort.data
  • shell> ${PRE_PROCESS}/bin/RunCorpus
  • shell> ${PRE_PROCESS}/bin/RunPreProc
    Setup the dictionary and corpus in CSpell for test
    4
    71

III. Corpora

IV. Development Tests

  • Compare above three corpora for word frequency test on cSpell:
    • Setup: Non-word, Revised GoldStd, unigram model for word frequency, new rank sorting algorithm

      ModelEnsemble CorpusConsumer Health CorpusMedline.2017
      Frequency Only
      Frequency - Halil438|770|774
      0.5688|0.5659|0.5674
      438|770|774
      0.5688|0.5659|0.5674
      404|770|774
      0.5247|0.5220|0.5233
      Frequency - cSpell-Dev-1536|769|774
      0.6970|0.6925|0.6948
      534|770|774
      0.6935|0.6899|0.6917
      521|770|774
      0.6766|0.6731|0.6749
      Frequency - cSpell-Dev-2536|769|774
      0.6970|0.6925|0.6948
      534|770|774
      0.6935|0.6899|0.6917
      522|770|774
      0.6779|0.6744|0.6762
      Combined method
      Noisy Channel552|769|774
      0.7178|0.7132|0.7155
      551|770|774
      0.7156|0.7119|0.7137
      523|770|774
      0.6792|0.6757|0.6775
      CSpell Combined
      Orthographic and Frequency
      598|769|774
      0.7776|0.7726|0.7751
      598|769|774
      0.7776|0.7726|0.7751
      597|769|774
      0.7763|0.7713|0.7738

V. Notes

  • The corpus is used for word2Vec. It need to be big enough (for recall) to cover the word and frequency. Current corpus need to be enhanced for better context ranking.