Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

Text Categorization

STRI: Text


  • Description:

    Read in the input text and perform ST real-time indexing based on

    • word frequency count
    • document count for word

  • Inputs:
    • a phrase, such as the combination of title and abstract
    • a file, such as 9801.2004.TIAB.in

  • Algorithm:
    • Pre-Process (Input Filter):
      • Tokenize all words of the input term
      • Apply Word Extraction Filter
      • Apply acronym filter (TBD)
      • Filter out not legal words
      • Filter out duplicated words if unique flag is true
      • Assign the final words for processing
    • Process:
    • Post-process (Output Filter):
      • Print out input text (term)
      • Detail output filter
      • Score entries display number
      • No output message
      • Cluster option
      • ST candidates
      • Use alphabetical order for Sts have same score

  • Sample commands:
    > stri -p
    => index a text from standard input with prompt
    
    > stri -i:9801.2004.TIAB.in -o:9801.2004.TIAB.out
    => index text from file, 9801.2004.TIAB.in, and send the results to a file, 9801.2004.TIAB.out
    

  • Sample Outputs:
    • a file, such as 9801.2004.TIAB.out