Real-word Split
This page describes the processes for real-word split detection and correction.
I. Processes
RealWordDetector.java
CS_DETECTOR_RW_SPLIT_WORD_MIN_LENGTH
)
CS_DETECTOR_RW_SPLIT_WORD_MIN_WC
)
SplitCandidates.java
CS_CAN_RW_MAX_SPLIT_NO
)
CS_CAN_RW_MAX_SHORT_SPLIT_WORD_NO
)
CS_CAN_RW_SHORT_SPLIT_WORD_LENGTH
)
Src | Candidate | Notes |
---|---|---|
17942.txt | "something" -> "so me thing" | both "so" and "me" are short split word, two of them means it is not a valid split |
16369.txt | "suggestion" -> "suggest i on" | both "i" and "on" are short split word, two of them means it is not a valid split |
60.txt | "upon" -> "up on" | |
30.txt | "soon" -> "so on" | |
12353.txt | "another" -> "a not her", "an other" | "a not her" is an invalid candidate, "an other" is a valid candidate. |
15721.txt | "anyone" -> "any one" |
CS_CAN_RW_SPLIT_CAND_MIN_WC
)
Src | Candidate | Notes |
---|---|---|
17536.txt | "inversion" -> "in version" | where "in" is a unit, short for "inch" |
10136.txt | "everyday" -> "every day" | where "day" is a unit |
Src | Candidate | Notes |
---|---|---|
16661.txt | "human" -> "hu man" | where "Hu" is a proper noun |
16481.txt | "children" -> "child ren" | where "Ren" is a proper noun |
RankRealWordSplitByContext.java
CS_RW_SPLIT_CONTEXT_RADIUS
)
where:
CS_RANKER_RW_SPLIT_C_FAC
)
ProcRealWordSplit.java
, ProcRealWordSplit.java
II. Development Tests
Tested different real-word merge factor on the revised real-word included gold standard from the training set with the following setup:
CS_DETECTOR_RW_SPLIT_WORD_MIN_LENGTH=4
CS_CAN_RW_SHORT_SPLIT_WORD_LENGTH=3
CS_CAN_RW_MAX_SHORT_SPLIT_WORD_NO=2
Function | Confidence Factor | Context Radius | Max. SplitNo | Raw data | Performance |
---|---|---|---|---|---|
NW (1-to-1, Split, Merge) | N/A | N/A | 2 | 604|775|964 | 0.7794|0.6266|0.6947 |
NW + RW_SPLIT | 0.00 | 2 | 5 | 605|789|964 | 0.7668|0.6276|0.6902 |
NW + RW_SPLIT | 0.01 | 2 | 5 | 605|789|964 | 0.7668|0.6276|0.6902 |
NW + RW_SPLIT | 0.02 | 2 | 5 | 605|790|964 | 0.7658|0.6276|0.6899 |
NW + RW_SPLIT | 0.03 | 2 | 5 | 605|790|964 | 0.7658|0.6276|0.6899 |
NW + RW_SPLIT | 0.05 | 2 | 5 | 605|791|964 | 0.7649|0.6276|0.6895 |
NW + RW_SPLIT | 0.10 | 2 | 5 | 605|792|964 | 0.7639|0.6276|0.6891 |
NW + RW_SPLIT | 0.20 | 2 | 5 | 605|792|964 | 0.7639|0.6276|0.6891 |
NW + RW_SPLIT | 0.40 | 2 | 5 | 605|809|964 | 0.7478|0.6276|0.6825 |
NW + RW_SPLIT | 0.60 | 2 | 5 | 607|835|964 | 0.7269|0.6297|0.6748 |
NW + RW_SPLIT | 0.80 | 2 | 5 | 608|875|964 | 0.6949|0.6307|0.6612 |
NW + RW_SPLIT | 0.01 | 9 | 0 | 604|775|964 | 0.7794|0.6266|0.6947 |
NW + RW_SPLIT | 0.01 | 9 | 1 | 606|777|964 | 0.7799|0.6286|0.6962 |
NW + RW_SPLIT | 0.01 | 9 | 2 | 606|777|964 | 0.7799|0.6286|0.6962 |
NW + RW_SPLIT | 0.01 | 9 | 3 | 606|777|964 | 0.7799|0.6286|0.6962 |
NW + RW_SPLIT | 0.01 | 9 | 4 | 606|777|964 | 0.7799|0.6286|0.6962 |
NW + RW_SPLIT | 0.01 | 9 | 5 | 606|777|964 | 0.7799|0.6286|0.6962 |
III. Observations from Training Set
ID | Source | Original Words | Split Word |
---|---|---|---|
TP-1 | 10349 | along | a long |
TP-2 | 10349 | along | a long |
TP-3 | 13165 | iam | i am |
TP-4 | 18669 | iam | i am |
ID | Source | Original Words | Split Word |
---|---|---|---|
FP-1 | 10349 | along | a long |
FP-2 | 10061 | however | how ever |
FP-3 | 39 | without | with out |
FP-4 | 39 | because | be cause |
FP-5 | 41 | anywhere | any where |
ID | Source | Original Words | Merged Word |
---|---|---|---|
FN-3 | 13864 | apart | a part |
Input | Output | Notes |
---|---|---|
apart | apart | |
apart of | a part of | |
apart of this | apart of this | |
apart of this study | apart of this study | |
apart of this group | a part of this group | Good |
apart of this process | a part of this process | Good |
apart of this effect | a part of this effect | Good |
be apart | be apart | |
be apart of | be a part of | Good |
to be apart of | to be apart of | |
not be apart of | not be a part of | Good |
weeks apart of | weeks apart of | Good |
weeks apart of 160 mg | weeks apart of 160 mg | Good |
distance apart of | distance apart of | Good |
distance apart of the | distance apart of the | Good |
apart from | apart from | Good |