Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov
Real-word Split
This page describes the processes for real-word split detection and correction.
I. Processes
RealWordDetector.java
CS_DETECTOR_RW_SPLIT_WORD_MIN_LENGTH
)
CS_DETECTOR_RW_SPLIT_WORD_MIN_WC
)
SplitCandidates.java
CS_CAN_RW_MAX_SPLIT_NO
)
CS_CAN_RW_MAX_SHORT_SPLIT_WORD_NO
)
CS_CAN_RW_SHORT_SPLIT_WORD_LENGTH
)
Src | Candidate | Notes |
---|---|---|
17942.txt | "something" -> "so me thing" | both "so" and "me" are short split word, two of them means it is not a valid split |
16369.txt | "suggestion" -> "suggest i on" | both "i" and "on" are short split word, two of them means it is not a valid split |
60.txt | "upon" -> "up on" | |
30.txt | "soon" -> "so on" | |
12353.txt | "another" -> "a not her", "an other" | "a not her" is an invalid candidate, "an other" is a valid candidate. |
15721.txt | "anyone" -> "any one" |
CS_CAN_RW_SPLIT_CAND_MIN_WC
)
Src | Candidate | Notes |
---|---|---|
17536.txt | "inversion" -> "in version" | where "in" is a unit, short for "inch" |
10136.txt | "everyday" -> "every day" | where "day" is a unit |
Src | Candidate | Notes |
---|---|---|
16661.txt | "human" -> "hu man" | where "Hu" is a proper noun |
16481.txt | "children" -> "child ren" | where "Ren" is a proper noun |
RankRealWordSplitByContext.java
CS_RW_SPLIT_CONTEXT_RADIUS
)
where:
CS_RANKER_RW_SPLIT_C_FAC
)
ProcRealWordSplit.java
, ProcRealWordSplit.java
II. Development Tests
Tested different real-word merge factor on the revised real-word included gold standard from the training set with the following setup:
CS_DETECTOR_RW_SPLIT_WORD_MIN_LENGTH=4
CS_CAN_RW_SHORT_SPLIT_WORD_LENGTH=3
CS_CAN_RW_MAX_SHORT_SPLIT_WORD_NO=2
Function | Confidence Factor | Context Radius | Max. SplitNo | Raw data | Performance |
---|---|---|---|---|---|
NW (1-to-1, Split, Merge) | N/A | N/A | 2 | 604|775|964 | 0.7794|0.6266|0.6947 |
NW + RW_SPLIT | 0.00 | 2 | 5 | 605|789|964 | 0.7668|0.6276|0.6902 |
NW + RW_SPLIT | 0.01 | 2 | 5 | 605|789|964 | 0.7668|0.6276|0.6902 |
NW + RW_SPLIT | 0.02 | 2 | 5 | 605|790|964 | 0.7658|0.6276|0.6899 |
NW + RW_SPLIT | 0.03 | 2 | 5 | 605|790|964 | 0.7658|0.6276|0.6899 |
NW + RW_SPLIT | 0.05 | 2 | 5 | 605|791|964 | 0.7649|0.6276|0.6895 |
NW + RW_SPLIT | 0.10 | 2 | 5 | 605|792|964 | 0.7639|0.6276|0.6891 |
NW + RW_SPLIT | 0.20 | 2 | 5 | 605|792|964 | 0.7639|0.6276|0.6891 |
NW + RW_SPLIT | 0.40 | 2 | 5 | 605|809|964 | 0.7478|0.6276|0.6825 |
NW + RW_SPLIT | 0.60 | 2 | 5 | 607|835|964 | 0.7269|0.6297|0.6748 |
NW + RW_SPLIT | 0.80 | 2 | 5 | 608|875|964 | 0.6949|0.6307|0.6612 |
NW + RW_SPLIT | 0.01 | 9 | 0 | 604|775|964 | 0.7794|0.6266|0.6947 |
NW + RW_SPLIT | 0.01 | 9 | 1 | 606|777|964 | 0.7799|0.6286|0.6962 |
NW + RW_SPLIT | 0.01 | 9 | 2 | 606|777|964 | 0.7799|0.6286|0.6962 |
NW + RW_SPLIT | 0.01 | 9 | 3 | 606|777|964 | 0.7799|0.6286|0.6962 |
NW + RW_SPLIT | 0.01 | 9 | 4 | 606|777|964 | 0.7799|0.6286|0.6962 |
NW + RW_SPLIT | 0.01 | 9 | 5 | 606|777|964 | 0.7799|0.6286|0.6962 |
III. Observations from Training Set
ID | Source | Original Words | Split Word |
---|---|---|---|
TP-1 | 10349 | along | a long |
TP-2 | 10349 | along | a long |
TP-3 | 13165 | iam | i am |
TP-4 | 18669 | iam | i am |
ID | Source | Original Words | Split Word |
---|---|---|---|
FP-1 | 10349 | along | a long |
FP-2 | 10061 | however | how ever |
FP-3 | 39 | without | with out |
FP-4 | 39 | because | be cause |
FP-5 | 41 | anywhere | any where |
ID | Source | Original Words | Merged Word |
---|---|---|---|
FN-3 | 13864 | apart | a part |
Input | Output | Notes |
---|---|---|
apart | apart | |
apart of | a part of | |
apart of this | apart of this | |
apart of this study | apart of this study | |
apart of this group | a part of this group | Good |
apart of this process | a part of this process | Good |
apart of this effect | a part of this effect | Good |
be apart | be apart | |
be apart of | be a part of | Good |
to be apart of | to be apart of | |
not be apart of | not be a part of | Good |
weeks apart of | weeks apart of | Good |
weeks apart of 160 mg | weeks apart of 160 mg | Good |
distance apart of | distance apart of | Good |
distance apart of the | distance apart of the | Good |
apart from | apart from | Good |