Step | Description | Inputs | Outputs | Notes
|
---|
1 | Grep terms (nGrams) then sort
|
|
| - Must create a link of the input nGram.${YEAR}
|
2 | Filter pipe (|) from nGrams
|
| - nGram.${YEAR}.noPipe
- nGram.${YEAR}.pipe
|
|
3 | Group nGrams by core-term
NGramUtil.GroupByCoreTerm
|
| - nGram.${YEAR}.noPipe.core
- nGram.${YEAR}.noPipe.core.detail
| - Group by core-term, also update the WC
|
4 | Group nGrams by norm-term
NGramUtil.GroupByNormTerm
|
| - nGram.${YEAR}.noPipe.norm
- nGram.${YEAR}.noPipe.norm.detail
| - Group by norm-term, also update the WC
|
Convert from WC|core-term back to DC|WC|TERM
|
---|
5 | Sort nGrams by DC|WC|Term
NGramUtil.SortNGramsByDcWc
|
| - nGram.${YEAR}.noPipe.sort.WcDcTerm
| - input is sorted by N, then DC|WC|Term
|
6 | Convert (ungroup) core-term to nGrams
NGramUtil.CoreTermToNGram
| - nGram.${YEAR}.noPipe.core
- nGram.${YEAR}.noPipe.core.detail
| - nGram.${YEAR}.noPipe.core.ungroup
| - the result is sorted, same as results from Step 5
- in format (core-term): WC|core-term
- out format (core-term): DC|WC|TERM
|
Convert from WC|core-term.lc back to WC|core-term
|
---|
7 | Group nGrams by core-term.lc
NGramUtil.GroupByCoreTerm
|
| - nGram.${YEAR}.noPipe.core.lc
- nGram.${YEAR}.noPipe.core.lc.detail
| - Results are the same because the input is all lowercase
|
8 | core-term to corm-term nGrams
NGramUtil.CoreTermLcToCoreTerm
| - nGram.${YEAR}.noPipe.core.lc
- nGram.${YEAR}.noPipe.core.lc.detail
| - nGram.${YEAR}.noPipe.core.lc.core
- nGram.${YEAR}.noPipe.core.lc.core.detail
| - Results are the same because the input is all lowercase
|
Group n-gram set by core-term.lc
|
---|
10 | Group nGram set by core-term.lc
NGramUtil.CoreTermToNGram
| - ${NGRAM_DIR}nGramSet.${YEAR}.30
| - ${NGRAM_DIR}nGramSet.${YEAR}.30.core.lc
- ${NGRAM_DIR}nGramSet.${YEAR}.core.lc.detail
|
|
11 | Group distilled nGram set by core-term.lc
NGramUtil.CoreTermToNGram
| - ${NGRAM_DIR}distilledNGram.${YEAR}
| - ${NGRAM_DIR}distilledNGram.${YEAR}.core.lc
- ${NGRAM_DIR}distilledNGram.${YEAR}.core.lc.detail
|
|