Text Categorization

PreProcess: St-Documents & Data Set

There are 6 steps to generate the complete data set of STI and STRI. They are detailed as follows:

  • Setup Input data:

    shell> cd ${TC_PRE_PROCESS}/tcPre2008/data/${YEAR}/Sti/Input

    Update following files:

    • MRSTY: latest version of MRSTY.${YEAR-1}AC
    • MRCONSO.RRF: latest version of MRCONSO.RRF.${YEAR-1}AC
    • SRDEF.txt: copy from previous version
    • stGroups.txt: generate in step 2
    • jds.txt: from results of JDI
    • wordJdidWcDcTable.txt: from results of JDI

  • Step 1: Generate Semantic Types file
    • Program:
      • GenSti
    • Inputs:
      • Input/MRSTY
      • Input/SRDEF.txt
    • Outputs:
      • Output/sts.txt
    • Duration: 5 Sec.

  • Step 2: Modify Semantic Type Groups file
    • Program:
      • GenSti
    • Inputs:
      • Input/stGroups.txt
    • Outputs:
      • Output/stGroups.txt
    • Duration: 1 Sec.

  • Step 3: Generate stDocuments
    • Program:
      • GenSti
    • Inputs:
      • Input/MRCONSO.RRF
      • Input/MRSTY

      • Output/sts.txt (from step 1)
      • Output/stGroups.txt (from step 2)
    • Outputs:
      • Output/stDocument.txt (combine of stDocument1.txt and stDocument2.txt)
      • Output/stDocument1.txt (words associated with only 1 St-Group)
      • Output/stDocument2.txt (words associated with multiple St-Groups)
    • Duration: 5 Min.

  • Step 4: Generate St-Jd Table
    • Program:
      • GenSti
    • Inputs:
      • Output/stDocument.txt.in (link to results from step 3 or 3-1)
        Output/stDocument.txt.sort (sort from stDocument.txt.in)
      • tcConfigFile:
        /export/home/lu/Projects/TC/tc2009/data/Config/tc.properties.preProc
        • Use DB_NAME in tc.properties.preProc to JDI on St-Documents

      • SetMaxSignal:
        • 2007: 510754
        • 2008: 645881
        • 2009: 705815
        • 2010: 754648
        • 2011: 792054
        • 201X: ...

      • Output/sts.txt (from step 1)
    • Outputs:
      • Output/stJdsTable.txt
    • Duration: 3 Min.

  • Step 5: Generate Word-St Table
    • Program:
      • GenSti
    • Inputs:
      • Jdi/Output/jds.txt (fro JDI)
      • Jdi/Output/wordJdidWcDc/wordJdidWcDcTable.txt (from JDI)
      • Sti/Output/stJdsTable.txt (from step 4)
    • Outputs:
      • Sti/Output/wordStsTable.txt.sort
    • Duration: 70 Min.

  • Refine stDocument.txt
    • Program: Use STRI to refine words in stDocuments.
      Please note that this program need to be run after step 4 is done to get the stJdTable.txt.
      • RefineStDoc
    • Inputs:
      • Output/stDocumentRefine.txt.in (from step 3)
      • tcConfigFile
        /export/home/lu/Projects/TC/tc2009/data/Config/tc.properties.refine
        • Use DB_NAME in tc.properties.refine for JDI
        • Use ST_JD_FILE in tc.properties.refine for STRI (stJdTable.txt.2008x)
      • Refine Type
        • optimum:
        • stdDev: with 1 standard deviation distance from the top DC score
        • a number: for example, 5 means top 5 DC rank

      • Output/sts.txt (from step 1)
    • Outputs:
      • Output/stDocumentRefine.txt
    • Duration: 10 Min.

  • Combine stDocuments
    • Program: Combine stDocuments (1 stGroup and 2 stGroups)
      • CombineStDoc
    • Inputs:
      • Output/stDocument.1.txt.in (from step 3)
      • Output/stDocument.2.txt.in (from step 3)

      • Output/sts.txt (from step 1)
    • Outputs:
      • Output/stDocument.txt.combine
    • Duration: 10 Sec.

  • Backup Generated Files:
    • Directory:
      • Source: ${TC_PRE}/data/${YEAR}/Sti/Output
      • Target: /nfsvol/crfiler-lex/Development/TC/2008a/data/2008X
    • Files:
      • ./stDoc/stDocument.txt
      • ./Sti/wordStsTable.txt
      • ./Stri/stJdTable.txt.2008X