SPECIALIST Lexicon

Procedures: Generate Element Words from MEDLINE (TBD)

This page describes the details of generating high frequency element words from MEDLINE.

I. Description

Retrieve all TI (titles) and AB (abstracts) from Medline
Use Lexical Tools - wordIndex to get the word list (lowercase, remove punctuation, and use space as word separator)
For each (single) word, updates the total word count
For each (single word, assign the type (LEXICON, NUMBER, NON_WORD, DIGIT, TBD)
Analyze results and generate reports
- Word list for words are in TBD type with high frequency
- All word list sorted by frequency

II. Processes

Procedures:

MedlineFileList
- Program: GenFileList.java
- Input: Medline files (${ROOT_DIR}/data/Medline/${YEAR}/medline${YY}n${NNNN}.txt)
- Algorithm: get file list of MedLine.{$YEAR}
- Output: ${OUT_DIR}/MedlineFiles2014.txt
Generate PmidTiAb${YY}n${DDDD}.txt
- Program: GenPmidTiAbFiles.java
- Input: Medline files (${ROOT_DIR}/data/Medline/${YEAR}/medline${YY}n${NNNN}.txt)
- Algorithm: retrieve PMID, Title, and abstract from Medline.${YEAR}, separated by space, keep the original case.
- Output: ${OUT_DIR}/PmidTiAb/PmidTiAb${YY}n${NNNN}.txt

Generate words|count|type

Program: GetWordCountFromTiAbFiles.java
Input:
- Medline files: ${ROOT_DIR}/data/Medline/${YEAR}/medline${YY}n${NNNN}.txt
- inflVars: ${IN_DIR}/inflVars.data
- numbers: ${IN_DIR}/NRVAR
- exceptions: ${IN_DIR}/exceptions.data (words should not be included in Lexicon, such as iii)
Algorithm: get word count from title and abstract from Medline.${YEAR}
- Use wordIndex to get word list (use space as word separator with lowercase all words)
- Update count

Output: ${OUT_DIR}/wordCount.out, with following format

word	count	type

where, type can be:

LEXICON	a existing word in the Lexicon, such as of
NUMBER	a existing number in the Lexicon, such as nine
NON_WORD	not exists in the Lexicon and not a real word or element of multiwords, such as iii
DIGIT	digit, such as 9
TBD	To be done, not in above types

Program: