CSpell

Real-word Split

This page describes the processes for real-word split detection and correction.

I. Processes

  • Detector:
    RealWordDetector.java
    • Not corrected previously in the CSpell pipeline.
    • real-word: valid word (in checkDic)
    • Not exceptions: digit, punctuation, digit/punctuation, email, url, empty string, measurement, Aa, proper noun)
    • focus token has word2Vec
    • focus token has length >= 4 (configurable: CS_DETECTOR_RW_SPLIT_WORD_MIN_LENGTH)
    • focus token: WC >= 200 (configurable: CS_DETECTOR_RW_SPLIT_WORD_MIN_WC)
  • Candidates:
    SplitCandidates.java
    • Get splitSet from all possible split as in the non-word
      • SplitNo <= 2 (configurable: CS_CAN_RW_MAX_SPLIT_NO)
    • The split candidate is a Lexicon multiword
    • If not a multiword, check if it is a valid split candidate:
      • Check short split words in split candidate
        • short split word <= 2 (configurable: CS_CAN_RW_MAX_SHORT_SPLIT_WORD_NO)
          The total number of short split word <= maxShortSplitWordNo (2);
        • length of short plit word <= 3 (configurable: CS_CAN_RW_SHORT_SPLIT_WORD_LENGTH)
          The default value of short split word is a word with length of 3 or less.
        • Heuristic rules are used to avoid split into invalid many short split words, For examples:

          SrcCandidateNotes
          17942.txt"something" -> "so me thing"both "so" and "me" are short split word, two of them means it is not a valid split
          16369.txt"suggestion" -> "suggest i on"both "i" and "on" are short split word, two of them means it is not a valid split
          60.txt"upon" -> "up on"
          30.txt"soon" -> "so on"
          12353.txt"another" -> "a not her", "an other" "a not her" is an invalid candidate, "an other" is a valid candidate.
          15721.txt"anyone" -> "any one"

        • keep: "away" -> "a way", "along" -> "a long", etc.
      • Check all split words (element words) in split candidate
        • in splitDic (Not pure Aa)
        • has context score (word2Vec)
        • WC > min. threshold (200 configurable: CS_CAN_RW_SPLIT_CAND_MIN_WC)
          example: ploytension -> poly tension
        • not unit
          examples:

          SrcCandidateNotes
          17536.txt"inversion" -> "in version"where "in" is a unit, short for "inch"
          10136.txt"everyday" -> "every day"where "day" is a unit
        • not proper noun
          examples:

          SrcCandidateNotes
          16661.txt"human" -> "hu man"where "Hu" is a proper noun
          16481.txt"children" -> "child ren"where "Ren" is a proper noun
  • Ranker:
    RankRealWordSplitByContext.java
    • Rank split candidates by context scores
      • context radius = 2 (configurable, CS_RW_SPLIT_CONTEXT_RADIUS)
    • Validate the top rank candidate
      compare the top ranked candidate to the original token for correction:
      • orgScore < 0
        • & topScore > 0
        • & topScore < 0 & topScore * RealWord_Split_Confidence_Factor > orgScore
      • orgScore > 0
        • & topScore * RealWord_Split_Confidence_Factor > orgScore
      • orgScore = 0
        • No real-word split correction because no word2Vec information on the original word, this case is filtered out in the detection

      where:

      • orgScore: is the context score of the original token
      • topScore: is the context score of the top candidate
      • RealWord_Split_Confidence_Factor = 0.01 (Configurable: CS_RANKER_RW_SPLIT_C_FAC)
    • TBD: the ranking can be improved if n-gram frequency is available. The frequency with context will be a better ranking source for split candidate
  • Corrector:
    ProcRealWordSplit.java, ProcRealWordSplit.java
    • FlatMap the split word (OneToOneSplitCorrector.AddOneToOneSplitCorrection)
    • Update process history to real-word-split

II. Development Tests

Tested different real-word merge factor on the revised real-word included gold standard from the training set with the following setup:

  • CS_DETECTOR_RW_SPLIT_WORD_MIN_LENGTH=4
  • CS_CAN_RW_SHORT_SPLIT_WORD_LENGTH=3
  • CS_CAN_RW_MAX_SHORT_SPLIT_WORD_NO=2

FunctionConfidence FactorContext RadiusMax. SplitNoRaw dataPerformance
NW (1-to-1, Split, Merge)N/AN/A2604|775|9640.7794|0.6266|0.6947
NW + RW_SPLIT0.0025605|789|9640.7668|0.6276|0.6902
NW + RW_SPLIT0.0125605|789|9640.7668|0.6276|0.6902
NW + RW_SPLIT0.0225605|790|9640.7658|0.6276|0.6899
NW + RW_SPLIT0.0325605|790|9640.7658|0.6276|0.6899
NW + RW_SPLIT0.0525605|791|9640.7649|0.6276|0.6895
NW + RW_SPLIT0.1025605|792|9640.7639|0.6276|0.6891
NW + RW_SPLIT0.2025605|792|9640.7639|0.6276|0.6891
NW + RW_SPLIT0.4025605|809|9640.7478|0.6276|0.6825
NW + RW_SPLIT0.6025607|835|9640.7269|0.6297|0.6748
NW + RW_SPLIT0.8025608|875|9640.6949|0.6307|0.6612
NW + RW_SPLIT0.0190604|775|9640.7794|0.6266|0.6947
NW + RW_SPLIT0.0191606|777|9640.7799|0.6286|0.6962
NW + RW_SPLIT0.0192606|777|9640.7799|0.6286|0.6962
NW + RW_SPLIT0.0193606|777|9640.7799|0.6286|0.6962
NW + RW_SPLIT0.0194606|777|9640.7799|0.6286|0.6962
NW + RW_SPLIT0.0195606|777|9640.7799|0.6286|0.6962

  • Bigger the confidence factor increases the [TP] and [FP]. Value of 0.01 seems reach the best F1.
  • Bigger the context radius decreases the [TP] and [FP] first, then increase [TP] and [FP], value of 9 seems to reach the best F1.
    => real-word split involves understand the meaning of the text, software needs more context for better precision.
  • The value of max. split No. does not seems have too much impact on F1. Use empirical value of 2 as default. There are not too much possibility that a merged word happen to be real-word. Use 2 (instead of bigger number) could save running time and increase speed performance.

III. Observations from Training Set

  • [TP] real-word split:
    IDSourceOriginal WordsSplit Word
    TP-110349alonga long
    TP-210349alonga long
    TP-313165iami am
    TP-418669iami am
    • 10349.txt: "sounding in my ear every time for along time."
    • TP-3 and TP-4 are done in the ND splitter

  • [FP] real-word split:
    IDSourceOriginal WordsSplit Word
    FP-110349alonga long
    FP-210061howeverhow ever
    FP-339withoutwith out
    FP-439becausebe cause
    FP-541anywhereany where

  • [FN] real-word split:
    IDSourceOriginal WordsMerged Word
    FN-313864aparta part
    • FN-3: The original input text is ... I donate my self to be apart of this study. The word2Vec need to be improved by bigger corpus. This split case is very sensitive with context as shown in follows:

      InputOutputNotes
      apartapart
      apart ofa part of
      apart of thisapart of this
      apart of this studyapart of this study
      apart of this groupa part of this groupGood
      apart of this processa part of this processGood
      apart of this effecta part of this effectGood
      be apartbe apart
      be apart ofbe a part ofGood
      to be apart ofto be apart of
      not be apart ofnot be a part ofGood
      weeks apart ofweeks apart ofGood
      weeks apart of 160 mgweeks apart of 160 mgGood
      distance apart ofdistance apart ofGood
      distance apart of thedistance apart of theGood
      apart fromapart fromGood