Text Categorization

JDI: Text and MeSH


  • Description:

    Read in text and MeSH and then perform JD indexing based on

    • word frequency count
    • document count for word

    text, MH, and SH are separated by '|'.

  • Inputs:
    • a text and MeSH: text, MH, and SH are separated by '|'
    • a file, such as 9801.2004.TIABMH.in

  • Algorithm:
    • Pre-Process (Input Filter):
      • Separate phrase and MeSH

      • Tokenize all words of the input phrase
      • Apply Word Extraction Filter (if it is MEDLINE TI or AB)
      • Apply acronym filter (TBD)
      • Filter out not legal words
      • Filter out duplicated words if unique flag is true
      • Assign the final words for processing

      • Tokenize SH and MH from the input Meshs
      • Filter out illegal Meshs (not in Mh-Jd Table or Sh-Jd Table
      • Assign legal Meshs
    • Process:
    • Post-process (Output Filter):
      • Print out Input term (text and MeSH)
      • Output filter details
      • Score entries display number
      • No output message
      • Cluster option
      • JD candidates
      • Use alphabetical order for JDs have same score (Ex: "taylor", "assault")

  • Sample commands:
    > jdi -itmh -d -p
    => index a text and MeSHs input from standard input with prompt and detail scores
    
    > jdi -itmh -d -f:ml -i:9801.2004.TIABMH.in -o:9801.2004.TIABMH.out
    => index text and MeSHs from file, 9801.2004.TIABMH.in, use MedLine filter options with detail scores, and send the results to a file, 9801.2004.TIABMH.out
    

  • Sample Outputs:
    • a file, such as 9801.2004.TIABMH.out