You are here
Approximate String Matching Algorithms for Limited-Vocabulary OCR Output Correction
Five methods for matching words mistranslated by optical character recognition to their most likely match in a reference dictionary were tested on data from the archives of the National Library of Medicine. The methods, including an adaptation of the cross correlation algorithm, the generic edit distance algorithm, the edit distance algorithm with a probabilistic substitution matrix, Bayesian analysis, and Bayesian analysis on an actively thinned reference dictionary were implemented and their accuracy rates compared. Of the five, the Bayesian algorithm produced the most correct matches (87%), and had the advantage of producing scores that have a useful and practical interpretation.