The SPECIALIST Lexicon

Procedures: Generate Element Words from MEDLINE (TBD)

This page describes the details of generating high frequency element words from MEDLINE.

I. Description

  • Retrieve all TI (titles) and AB (abstracts) from Medline
  • Use Lexical Tools - wordIndex to get the word list (lowercase, remove punctuation, and use space as word separator)
  • For each (single) word, updates the total word count
  • For each (single word, assign the type (LEXICON, NUMBER, NON_WORD, DIGIT, TBD)
  • Analyze results and generate reports
    • Word list for words are in TBD type with high frequency
    • All word list sorted by frequency

II. Processes

  • Root directory: ${LEXICON_DIR}/Components/Medline
  • Data:
    • Input
      • ${ROOT_DIR}/data/Medline/${YEAR}/medline${YY}n${NNNN}.txt: Medline files
      • ${ROOT_DIR}/data/${YEAR}/inData/inflVars.data: all existing words in Lexicon
      • ${ROOT_DIR}/data/${YEAR}/inData/NRVAR: all number variants included Lexicon release
      • ${ROOT_DIR}/data/${YEAR}/inData/inflVars.data/exceptions.data: words that not in Lexicon and no need to add in, such as ii, iii, etc. This list should be updated periodically.
    • Output directory: ${ROOT_DIR}/data/${YEAR}/outData/

  • Procedures:
    • MedlineFileList
      • Program: GenFileList.java
      • Input: Medline files (${ROOT_DIR}/data/Medline/${YEAR}/medline${YY}n${NNNN}.txt)
      • Algorithm: get file list of MedLine.{$YEAR}
      • Output: ${OUT_DIR}/MedlineFiles2014.txt
    • Generate PmidTiAb${YY}n${DDDD}.txt
      • Program: GenPmidTiAbFiles.java
      • Input: Medline files (${ROOT_DIR}/data/Medline/${YEAR}/medline${YY}n${NNNN}.txt)
      • Algorithm: retrieve PMID, Title, and abstract from Medline.${YEAR}, separated by space, keep the original case.
      • Output: ${OUT_DIR}/PmidTiAb/PmidTiAb${YY}n${NNNN}.txt
    • Generate words|count|type
      • Program: GetWordCountFromTiAbFiles.java
      • Input:
        • Medline files: ${ROOT_DIR}/data/Medline/${YEAR}/medline${YY}n${NNNN}.txt
        • inflVars: ${IN_DIR}/inflVars.data
        • numbers: ${IN_DIR}/NRVAR
        • exceptions: ${IN_DIR}/exceptions.data (words should not be included in Lexicon, such as iii)
      • Algorithm: get word count from title and abstract from Medline.${YEAR}
        • Use wordIndex to get word list (use space as word separator with lowercase all words)
        • Update count
      • Output: ${OUT_DIR}/wordCount.out, with following format
        wordcounttype

        where, type can be:
        LEXICONa existing word in the Lexicon, such as of
        NUMBERa existing number in the Lexicon, such as nine
        NON_WORDnot exists in the Lexicon and not a real word or element of multiwords, such as iii
        DIGITdigit, such as 9
        TBDTo be done, not in above types
    • Analyze results and generate reports
      • Program:
        • AnalyzeWordCountFile.java
        • GetTbdWords.java
      • Input: ${OUT_DIR}/wordCount.out
      • Algorithm:
      • Output:
        • ${OUT_DIR}/wordCount.sum: summary
        • ${OUT_DIR}/wordCount.rpt: report, sorted by frequency
          rankwordtypeword countcum. word countcum. coverage (recall)
        • ${OUT_DIR}/wordCount.csv: in csv format for diagram
          rankword counttype
        • ${OUT_DIR}/wordCount.tbd

  • Program: