Text Categorization

Legal Words

Legal words are those words in the interested domain and will be used to calculate the JD or ST scores. Different applications, training sets, or control words should have different set of legal words. In our tools, we have two different default settings of legal words for JDI (STRI) and STI. The default settings of STRI are same as JDI since STRI uses JDI to get the JD scores first and then get the similarity. Basically, the default settings of legal words are:

  • Word with length greater than 2 (min. word length is 3)
  • Not a stopword
  • Must be a restrictword (from UMLS)
  • Within the range of normalized signal (only for JDI)
  • Match word count criteria
  • Match document count criteria

  • Description:

    This Java method is used to detect if a word is a legal word for JDI, STI, or STRI. This class provides options on:

    • min. word length:
      • use min. word length criteria
      • set value of min. word length
    • remove stopwords: remove or keep
    • use restrictwords: use or not use
    • min. normalized signal
      • use min. normalized signal criteria
      • set value of min. normalized signal
    • max. normalized signal
      • use max. normalized signal criteria
      • set value of max. normalized signal
    • min. total word count
      • use min. total word count criteria
      • set value of min. word count
    • min. total document count
      • use min. total document count criteria
      • set value of min. document count
    Please refer to legalWordsOption for details.

  • Usage:
    • IsLegalWord(String word, LegalWordsOption option)

  • Inputs:
    This Java class reads in three files:
    • StopWords
    • RestrictWords
    • Word-Signal-Wc-Dc

  • Algorithm:
      • Check minimum length
      • Check stopword: should not be a stopword
      • Check restrictword: should be a restrictword
      • Check normalized signal range: within lower and upper limit
      • check word count: should be greater than 1 (lower limit)
      • check document count: should be greater than 1 (lower limit)