Real-word Split
This page describes the processes for real-word split detection and correction.
I. Processes
RealWordDetector.java
CS_DETECTOR_RW_SPLIT_WORD_MIN_LENGTH)
CS_DETECTOR_RW_SPLIT_WORD_MIN_WC)
SplitCandidates.java
CS_CAN_RW_MAX_SPLIT_NO)
CS_CAN_RW_MAX_SHORT_SPLIT_WORD_NO)
CS_CAN_RW_SHORT_SPLIT_WORD_LENGTH)
| Src | Candidate | Notes |
|---|---|---|
| 17942.txt | "something" -> "so me thing" | both "so" and "me" are short split word, two of them means it is not a valid split |
| 16369.txt | "suggestion" -> "suggest i on" | both "i" and "on" are short split word, two of them means it is not a valid split |
| 60.txt | "upon" -> "up on" | |
| 30.txt | "soon" -> "so on" | |
| 12353.txt | "another" -> "a not her", "an other" | "a not her" is an invalid candidate, "an other" is a valid candidate. |
| 15721.txt | "anyone" -> "any one" |
CS_CAN_RW_SPLIT_CAND_MIN_WC)
| Src | Candidate | Notes |
|---|---|---|
| 17536.txt | "inversion" -> "in version" | where "in" is a unit, short for "inch" |
| 10136.txt | "everyday" -> "every day" | where "day" is a unit |
| Src | Candidate | Notes |
|---|---|---|
| 16661.txt | "human" -> "hu man" | where "Hu" is a proper noun |
| 16481.txt | "children" -> "child ren" | where "Ren" is a proper noun |
RankRealWordSplitByContext.java
CS_RW_SPLIT_CONTEXT_RADIUS)
where:
CS_RANKER_RW_SPLIT_C_FAC)
ProcRealWordSplit.java, ProcRealWordSplit.java
II. Development Tests
Tested different real-word merge factor on the revised real-word included gold standard from the training set with the following setup:
CS_DETECTOR_RW_SPLIT_WORD_MIN_LENGTH=4
CS_CAN_RW_SHORT_SPLIT_WORD_LENGTH=3
CS_CAN_RW_MAX_SHORT_SPLIT_WORD_NO=2
| Function | Confidence Factor | Context Radius | Max. SplitNo | Raw data | Performance |
|---|---|---|---|---|---|
| NW (1-to-1, Split, Merge) | N/A | N/A | 2 | 604|775|964 | 0.7794|0.6266|0.6947 |
| NW + RW_SPLIT | 0.00 | 2 | 5 | 605|789|964 | 0.7668|0.6276|0.6902 |
| NW + RW_SPLIT | 0.01 | 2 | 5 | 605|789|964 | 0.7668|0.6276|0.6902 |
| NW + RW_SPLIT | 0.02 | 2 | 5 | 605|790|964 | 0.7658|0.6276|0.6899 |
| NW + RW_SPLIT | 0.03 | 2 | 5 | 605|790|964 | 0.7658|0.6276|0.6899 |
| NW + RW_SPLIT | 0.05 | 2 | 5 | 605|791|964 | 0.7649|0.6276|0.6895 |
| NW + RW_SPLIT | 0.10 | 2 | 5 | 605|792|964 | 0.7639|0.6276|0.6891 |
| NW + RW_SPLIT | 0.20 | 2 | 5 | 605|792|964 | 0.7639|0.6276|0.6891 |
| NW + RW_SPLIT | 0.40 | 2 | 5 | 605|809|964 | 0.7478|0.6276|0.6825 |
| NW + RW_SPLIT | 0.60 | 2 | 5 | 607|835|964 | 0.7269|0.6297|0.6748 |
| NW + RW_SPLIT | 0.80 | 2 | 5 | 608|875|964 | 0.6949|0.6307|0.6612 |
| NW + RW_SPLIT | 0.01 | 9 | 0 | 604|775|964 | 0.7794|0.6266|0.6947 |
| NW + RW_SPLIT | 0.01 | 9 | 1 | 606|777|964 | 0.7799|0.6286|0.6962 |
| NW + RW_SPLIT | 0.01 | 9 | 2 | 606|777|964 | 0.7799|0.6286|0.6962 |
| NW + RW_SPLIT | 0.01 | 9 | 3 | 606|777|964 | 0.7799|0.6286|0.6962 |
| NW + RW_SPLIT | 0.01 | 9 | 4 | 606|777|964 | 0.7799|0.6286|0.6962 |
| NW + RW_SPLIT | 0.01 | 9 | 5 | 606|777|964 | 0.7799|0.6286|0.6962 |
III. Observations from Training Set
| ID | Source | Original Words | Split Word |
|---|---|---|---|
| TP-1 | 10349 | along | a long |
| TP-2 | 10349 | along | a long |
| TP-3 | 13165 | iam | i am |
| TP-4 | 18669 | iam | i am |
| ID | Source | Original Words | Split Word |
|---|---|---|---|
| FP-1 | 10349 | along | a long |
| FP-2 | 10061 | however | how ever |
| FP-3 | 39 | without | with out |
| FP-4 | 39 | because | be cause |
| FP-5 | 41 | anywhere | any where |
| ID | Source | Original Words | Merged Word |
|---|---|---|---|
| FN-3 | 13864 | apart | a part |
| Input | Output | Notes |
|---|---|---|
| apart | apart | |
| apart of | a part of | |
| apart of this | apart of this | |
| apart of this study | apart of this study | |
| apart of this group | a part of this group | Good |
| apart of this process | a part of this process | Good |
| apart of this effect | a part of this effect | Good |
| be apart | be apart | |
| be apart of | be a part of | Good |
| to be apart of | to be apart of | |
| not be apart of | not be a part of | Good |
| weeks apart of | weeks apart of | Good |
| weeks apart of 160 mg | weeks apart of 160 mg | Good |
| distance apart of | distance apart of | Good |
| distance apart of the | distance apart of the | Good |
| apart from | apart from | Good |