Real-word Merge
This page describes the processes for real-word merge detection and correction.
I. Processes
RealWordMergeDetector.java
MergeCandidates.java
CS_CAN_RW_MAX_MERGE_NO
)
CS_CAN_RW_MERGE_WITH_HYPHEN
)
CS_CAN_RW_SPLIT_CAND_MIN_WC
)
Input text | Candidate | Notes |
---|---|---|
me at | meat |
|
RankRealWordMergeByContext.java
,
CS_RW_MERGE_CONTEXT_RADIUS
)
where:
CS_RANKER_RW_MERGE_C_FAC
)
MergeCorrector.java
II. Development Tests
Tested different real-word merge factor on the revised real-word included gold standard from the training set.
Function | Confidence Factor | Context Radius | Max. MergeNo | Raw data | Performance |
---|---|---|---|---|---|
NW (1-to-1, Split, Merge) | N/A | N/A | 2 | 604|775|964 | 0.7794|0.6266|0.6947 |
NW + RW_MERGE | 0.20 | 2 | 2 | 609|783|964 | 0.7778|0.6317|0.6972* |
NW + RW_MERGE | 0.25 | 2 | 2 | 610|785|964 | 0.7771|0.6328|0.6975 |
NW + RW_MERGE | 0.30 | 2 | 2 | 610|783|964 | 0.7791|0.6328|0.6983 |
NW + RW_MERGE | 0.33 | 2 | 2 | 610|785|964 | 0.7771|0.6328|0.6975 |
NW + RW_MERGE | 0.40 | 2 | 2 | 610|783|964 | 0.7791|0.6328|0.6983 |
NW + RW_MERGE | 0.50 | 2 | 2 | 610|786|964 | 0.7761|0.6328|0.6971 |
NW + RW_MERGE | 0.55 | 2 | 2 | 612|787|964 | 0.7776|0.6349|0.6990 |
NW + RW_MERGE | 0.60 | 2 | 2 | 613|786|964 | 0.7799|0.6359|0.7006 |
NW + RW_MERGE Fixed LC on W2V | 0.60 | 2 | 2 | 614|788|964 | 0.7792|0.6369|0.7009 |
NW + RW_MERGE | 0.70 | 2 | 2 | 613|790|964 | 0.7759|0.6359|0.6990 |
NW + RW_MERGE | 0.80 | 2 | 2 | 614|791|964 | 0.7762|0.6369|0.6997 |
NW + RW_MERGE | 0.90 | 2 | 2 | 614|792|964 | 0.7753|0.6369|0.6993 |
NW + RW_MERGE | 1.00 | 2 | 2 | 615|794|964 | 0.7746|0.6384|0.6997 |
NW + RW_MERGE | 0.60 | 1 | 2 | 610|783|964 | 0.7791|0.6328|0.6983 |
NW + RW_MERGE | 0.60 | 2 | 2 | 613|786|964 | 0.7799|0.6359|0.7006 |
NW + RW_MERGE | 0.60 | 3 | 2 | 611|784|964 | 0.7793|0.6338|0.6991 |
NW + RW_MERGE | 0.60 | 4 | 2 | 609|783|964 | 0.7778|0.6317|0.6972 |
NW + RW_MERGE | 0.60 | 5 | 2 | 608|782|964 | 0.7775|0.6307|0.6964 |
NW + RW_MERGE | 0.60 | 6 | 2 | 610|784|964 | 0.7781|0.6328|0.6979 |
NW + RW_MERGE | 0.60 | 7 | 2 | 607|779|964 | 0.7792|0.6297|0.6965 |
NW + RW_MERGE | 0.60 | 8 | 2 | 607|778|964 | 0.7802|0.6297|0.6969 |
NW + RW_MERGE | 0.60 | 9 | 2 | 607|779|964 | 0.7792|0.6297|0.6965 |
NW + RW_MERGE | 0.60 | 10 | 2 | 606|778|964 | 0.7789|0.6286|0.6958 |
NW + RW_MERGE | 0.60 | 2 | 1 | 613|786|964 | 0.7799|0.6359|0.7006 |
NW + RW_MERGE | 0.60 | 2 | 2 | 613|786|964 | 0.7779|0.6359|0.7006 |
NW + RW_MERGE | 0.60 | 2 | 3 | 613|786|964 | 0.7799|0.6359|0.7006 |
NW + RW_MERGE | 0.60 | 2 | 4 | 613|786|964 | 0.7799|0.6359|0.7006 |
III. Observations from Development test set
ID | Source | Original Words | Merged Word |
---|---|---|---|
TP-1 | 1 | on set | onset |
TP-2 | 39 | under developed | underdeveloped |
TP-3 | 39 | some what | somewhat |
TP-4 | 62 | life long | lifelong |
TP-5 | 11579 | anti psychotic | antipsychotic |
TP-6 | 13645 | non prescription | nonprescription |
TP-7 | 13864 | my self | myself |
TP-8 | 14296 | some one | someone |
TP-9 | 15759 | anti depresants | antidepressants |
TP-10 | 16974 | non drug | nondrug |
TP-11 | 18766 | some times | sometimes |
TP-12 | 12745 | extra corporeal | extracorporeal |
ID | Source | Original Words | Merged Word |
---|---|---|---|
FP-2 | 12261 | a while | awhile |
FP-3 | 16481 | me anyt | meant |
FP-5 | 18903 | over time | overtime |
FP-6 | 12630 | every day | everyday |
ID | Source | Original Words | Merged Word |
---|---|---|---|
FN-1 | 24 | some thing | something |
FN-2 | 30 | there after | thereafter |
FN-3 | 33 | web site | website |
FN-4 | 74 | great full | grateful |
FN-5 | 74 | use full | useful |
FN-6 | 11225 | over read | overread |
FN-7 | 11435 | some time | sometime |
FN-8 | 11579 | with out | without |
FN-9 | 11579 | worth while | worthwhile |
FN-10 | 11757 | care taker | caretaker |
FN-11 | 12271 | in to | into |
FN-12 | 12520 | post menopause | postmenopause |
FN-13 | 12646 | what ever | whatever |
FN-14 | 12800 | through out | throughout |
FN-15 | 13287 | grand child | grandchild |
FN-16 | 16823 | after noon | afternoon |
FN-17 | 16829 | grand father | grandfather |
FN-18 | 19818 | boy friend | boyfriend |