LexBuild

LexBuild - Check Spelling Variants

The spelling variants could be typos, which results in:

  • Error spelling variants in the lexical record
  • generating error inflectional variants

This feature identifies possible error by comparing Edit Distance among

  • base form and spelling variants
  • spelling variants
In our 2006 LEXICON, the distribution of edit distance among base form and spelling variants are shown as below:

Edit DistancePercentThresholdNotes
04.87%4.87%Different cases
175.38%80.26% 
215.15%95.41%Used to cover 95%
33.65%99.07% 
40.68%99.75% 
5+0.25%100.00% 

From above table, a threshold of 2 is used for sending out warning message in LexBuild to let LB users have second chance to verify the input spelling variants. With threshold at edit distance of 2, it covers more than 95% of correct spelling variants. On the other hand, the LB users could ignore the warning message if the input is correct (for the other 5%).

Program:

shell> $LEXBUILD_DIR/Tools/PostProcessing/AnalyzeEditDistance

Inputs:

$LEXBUILD_DIR/data/WebApp/Outputs/Lexicon/LEXICON

Outputs:

  • $LEXBUILD_DIR/data/WebApp/Outputs/PostProc/editDistance.data

    Format:

    Eidt Distance|EUI|Base|Spelling Vars|Category

  • screen output: duplicated spelling variants

Notes:

editDistance.data is sorted (> sort -r editDistance.data > editDistance.sort.data) for further analysis. A manually sort might be needed for records with edit distance more than 10.