Pre-Process: Word-Signal-Wc-Dc (Gt1)
- Description:
This file includes information of Word-Signal-Wc-Dc for all words in training set (MEDLINE) with document count great than 1. Word signal is the normalized total word count and is used to replace in JDI.
- Input:
- Java Files & Algorithm:
- GenerateWordSignalWcDc.java
- Read global word count and document count from wordWcDcGt1.txt
- Calculate word signal
- Read in all words (Gt1) of titles and abstracts from uiTiAbWords.${NUM}.txt
- Update local word count of all words for each citation (UI/PMID) from uiTiAbWords.${NUM}.txt if word in wordWcDcGt1.txt
- Calculate normalized global word signal:
global Wc: is the total word count for the entire training set
go through all documents in the training set and calculate:
- local Wc: is the total word count for the document we are working on
- local noise = (local Wc/global Wc)*log2 (global Wc/local Wc);
- local signal = log2 (global Wc) - local noise;
- local signal weight = (local Wc) * local signal;
- local normalized signal = log2 (1 + local signal weight);
- Add local normalized signal to global word signal (used as normalized word count in the next step)
- Print out Word-Signal-Wc-Dc
- Output file:
- wordSignalWcDcGt1.txt, used in JDI
Word | word signal (normalize) | total word count | total document count
|
---|