Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

Text Categorization

PreProcess: ST-JDs Table


  • Description:

    JDI is applied on St-Documents to get St-JD Scores table. The JD scores vector includes

    • word count score
    • document count score

  • Input:

  • Java File & Algorithm:

    Run JDI (use the latest word-JD table) on each ST through St-Documents to and get

    • Word count score
    • Document count score

    The default input filter option of JDI should be used. The settings are as follows:

    • Remove stopwords
    • Use restrictwords
    • Use normalized signal filter between 2 ~ 510754
      => Please note the default max. signal in JDI.2008 is 645881 (not 510754). This is because there is a SCR (44) for the change after STI table is generated. Along with 5 stop words changes (SCR-43), there is minor different in the stJdsTables for ftcn, neop, orgf.
      =>The max. signal must include cancer, blood, risk and exclude function and therapy. Susanne suggests use "cancer" as upper limit since it is not a stop word.
    • Use min. word count of 2
    • Use min. document count of 2
    • Use min. length of 3

  • Output Files:
    • stJdsTable.txt

      STTUIWc ScoresDc ScoreJdidJd Name

  • Notes: