Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

Text Categorization

PreProcess: St-Documents & Data Set

There are 6 steps to generate the complete data set of STI and STRI. They are detailed as follows:

  • Setup Input data:

    shell> cd ${TC_PRE_PROCESS}/tcPre2008/data/${YEAR}/Sti/Input

    Update following files:

    • MRSTY: latest version of MRSTY.${YEAR-1}AC
    • MRCONSO.RRF: latest version of MRCONSO.RRF.${YEAR-1}AC
    • SRDEF.txt: copy from previous version
    • stGroups.txt: generate in step 2
    • jds.txt: from results of JDI
    • wordJdidWcDcTable.txt: from results of JDI

  • Step 1: Generate Semantic Types file
    • Program:
      • GenSti
    • Inputs:
      • Input/MRSTY
      • Input/SRDEF.txt
    • Outputs:
      • Output/sts.txt
    • Duration: 5 Sec.

  • Step 2: Modify Semantic Type Groups file
    • Program:
      • GenSti
    • Inputs:
      • Input/stGroups.txt
    • Outputs:
      • Output/stGroups.txt
    • Duration: 1 Sec.

  • Step 3: Generate stDocuments
    • Program:
      • GenSti
    • Inputs:
      • Input/MRCONSO.RRF
      • Input/MRSTY

      • Output/sts.txt (from step 1)
      • Output/stGroups.txt (from step 2)
    • Outputs:
      • Output/stDocument.txt (combine of stDocument1.txt and stDocument2.txt)
      • Output/stDocument1.txt (words associated with only 1 St-Group)
      • Output/stDocument2.txt (words associated with multiple St-Groups)
    • Duration: 5 Min.

  • Step 4: Generate St-Jd Table
    • Program:
      • GenSti
    • Inputs:
      • Output/stDocument.txt.in (link to results from step 3 or 3-1)
        Output/stDocument.txt.sort (sort from stDocument.txt.in)
      • tcConfigFile:
        /export/home/lu/Projects/TC/tc2009/data/Config/tc.properties.preProc
        • Use DB_NAME in tc.properties.preProc to JDI on St-Documents

      • SetMaxSignal:
        • 2007: 510754
        • 2008: 645881
        • 2009: 705815
        • 2010: 754648
        • 2011: 792054
        • 201X: ...

      • Output/sts.txt (from step 1)
    • Outputs:
      • Output/stJdsTable.txt
    • Duration: 3 Min.

  • Step 5: Generate Word-St Table
    • Program:
      • GenSti
    • Inputs:
      • Jdi/Output/jds.txt (fro JDI)
      • Jdi/Output/wordJdidWcDc/wordJdidWcDcTable.txt (from JDI)
      • Sti/Output/stJdsTable.txt (from step 4)
    • Outputs:
      • Sti/Output/wordStsTable.txt.sort
    • Duration: 70 Min.

  • Refine stDocument.txt
    • Program: Use STRI to refine words in stDocuments.
      Please note that this program need to be run after step 4 is done to get the stJdTable.txt.
      • RefineStDoc
    • Inputs:
      • Output/stDocumentRefine.txt.in (from step 3)
      • tcConfigFile
        /export/home/lu/Projects/TC/tc2009/data/Config/tc.properties.refine
        • Use DB_NAME in tc.properties.refine for JDI
        • Use ST_JD_FILE in tc.properties.refine for STRI (stJdTable.txt.2008x)
      • Refine Type
        • optimum:
        • stdDev: with 1 standard deviation distance from the top DC score
        • a number: for example, 5 means top 5 DC rank

      • Output/sts.txt (from step 1)
    • Outputs:
      • Output/stDocumentRefine.txt
    • Duration: 10 Min.

  • Combine stDocuments
    • Program: Combine stDocuments (1 stGroup and 2 stGroups)
      • CombineStDoc
    • Inputs:
      • Output/stDocument.1.txt.in (from step 3)
      • Output/stDocument.2.txt.in (from step 3)

      • Output/sts.txt (from step 1)
    • Outputs:
      • Output/stDocument.txt.combine
    • Duration: 10 Sec.

  • Backup Generated Files:
    • Directory:
      • Source: ${TC_PRE}/data/${YEAR}/Sti/Output
      • Target: /nfsvol/crfiler-lex/Development/TC/2008a/data/2008X
    • Files:
      • ./stDoc/stDocument.txt
      • ./Sti/wordStsTable.txt
      • ./Stri/stJdTable.txt.2008X