Procedures: Generate Element Words from MEDLINE (TBD)
This page describes the details of generating high frequency element words from MEDLINE.
I. Description
- Retrieve all TI (titles) and AB (abstracts) from Medline
- Use Lexical Tools - wordIndex to get the word list (lowercase, remove punctuation, and use space as word separator)
- For each (single) word, updates the total word count
- For each (single word, assign the type (LEXICON, NUMBER, NON_WORD, DIGIT, TBD)
- Analyze results and generate reports
- Word list for words are in TBD type with high frequency
- All word list sorted by frequency
II. Processes
- Root directory: ${LEXICON_DIR}/Components/Medline
- Data:
- Input
- ${ROOT_DIR}/data/Medline/${YEAR}/medline${YY}n${NNNN}.txt: Medline files
- ${ROOT_DIR}/data/${YEAR}/inData/inflVars.data: all existing words in Lexicon
- ${ROOT_DIR}/data/${YEAR}/inData/NRVAR: all number variants included Lexicon release
- ${ROOT_DIR}/data/${YEAR}/inData/inflVars.data/exceptions.data: words that not in Lexicon and no need to add in, such as ii, iii, etc. This list should be updated periodically.
- Output directory: ${ROOT_DIR}/data/${YEAR}/outData/
- Procedures:
- MedlineFileList
- Program: GenFileList.java
- Input: Medline files (${ROOT_DIR}/data/Medline/${YEAR}/medline${YY}n${NNNN}.txt)
- Algorithm: get file list of MedLine.{$YEAR}
- Output: ${OUT_DIR}/MedlineFiles2014.txt
- Generate PmidTiAb${YY}n${DDDD}.txt
- Program: GenPmidTiAbFiles.java
- Input: Medline files (${ROOT_DIR}/data/Medline/${YEAR}/medline${YY}n${NNNN}.txt)
- Algorithm: retrieve PMID, Title, and abstract from Medline.${YEAR}, separated by space, keep the original case.
- Output: ${OUT_DIR}/PmidTiAb/PmidTiAb${YY}n${NNNN}.txt
- Generate words|count|type
- Program: GetWordCountFromTiAbFiles.java
- Input:
- Medline files: ${ROOT_DIR}/data/Medline/${YEAR}/medline${YY}n${NNNN}.txt
- inflVars: ${IN_DIR}/inflVars.data
- numbers: ${IN_DIR}/NRVAR
- exceptions: ${IN_DIR}/exceptions.data (words should not be included in Lexicon, such as iii)
- Algorithm: get word count from title and abstract from Medline.${YEAR}
- Use wordIndex to get word list (use space as word separator with lowercase all words)
- Update count
- Output: ${OUT_DIR}/wordCount.out, with following format
where, type can be:
LEXICON | a existing word in the Lexicon, such as of
|
---|
NUMBER | a existing number in the Lexicon, such as nine
|
---|
NON_WORD | not exists in the Lexicon and not a real word or element of multiwords, such as iii
|
---|
DIGIT | digit, such as 9
|
---|
TBD | To be done, not in above types
|
---|
- Analyze results and generate reports
- Program:
- AnalyzeWordCountFile.java
- GetTbdWords.java
- Input: ${OUT_DIR}/wordCount.out
- Algorithm:
- Output:
- ${OUT_DIR}/wordCount.sum: summary
- ${OUT_DIR}/wordCount.rpt: report, sorted by frequency
rank | word | type | word count | cum. word count | cum. coverage (recall)
|
---|
- ${OUT_DIR}/wordCount.csv: in csv format for diagram
- ${OUT_DIR}/wordCount.tbd
- Program: