Skip to content

Flowchart for probabilistic OCR correction.

by on October 14, 2012

Optically transcribed documents have errors, and while we can improve OCR, I suspect it’s going to be a long time before it’s totally error-free.

Once you get to the later nineteenth century, OCR errors may be random enough that they don’t constitute a huge problem for data mining. But before about 1820 they’re shaped by period typography, and are therefore distributed unevenly enough to pose a problem. If you topic-model uncorrected 19c corpora, you get topics containing, e.g., all words with a medial or initial s.

So I’ve been developing a workflow for OCR correction. My goal is not to correct everything, but to correct the most common kinds of errors, especially ones that affect relatively common words (say, the top 50,000 or so).

I’ve borrowed the probabilistic (roughly Bayesian) approach adopted by Lasko and Hauser in this paper for the National Library of Medicine.

    1. When you hit a token not in the dictionary, search for fuzzy matches in the dictionary.
    2. Find the “edit distance” separating the token from each fuzzy match. The edit distance is normally the number of characters that have to be added, deleted, or substituted in order to turn one word into another. But in correcting OCR, some substitutions, like s->f or ct -> d, are much more likely than others, so I weight substitutions by their observed probability in the corpus. The program can learn this as it goes along. The probability of a given correction should also be weighted by the frequency of the “correct” word. E.g., a correction to “than” is more likely than one to “thane.”
    3. Decide whether the closest match is close enough to risk making the correction. You have to be cautious correcting short tokens (since there’s not much evidence to go on).

These rules were embodied in a Python script I wrote last year, which worked well enough to make OCR “artefacts” vanish in my analysis of 4,000 volumes. But now that I’m confronting a corpus of 500k volumes, Python is a bit too slow — so I’m rewriting the core of the process in Java. Also, I’ve realized that the scale of the corpus itself gives me certain new kinds of leverage. For instance, you can record the average OCR quality of documents where a given token-type appears, and use that information probabilistically to help decide whether the type is an error needing correction.

I’m now about halfway through designing a workflow to do probabilistic correction on 500,000 HathiTrust documents, and thought I would share my workflow in case people have suggestions or critiques. Click to enlarge.

Two areas where I know people are going to have questions:

1) Why use Titlecase to identify proper nouns? Why not do NLP? The short answer is, I’m mortal and lazy. The longer answer is that there’s a circular problem with NLP and dirty data. With 18c documents, I suspect I may need to do initial cleaning before NLP becomes practical. And since I’m correcting 18/19c books as one gigantic corpus, the 18c habit of titlecasing common nouns isn’t too disruptive. However, this is a place where I would welcome help, if someone happens to have a huge gazetteer of known proper nouns.

2) Why have separate precision and recall dictionaries? In part, this helps avoid improbable corrections. “\vholes” could be an error for the Dickens character Mr. Vholes. But it’s much more likely to be an error for “wholes.” So, while I want to recognize “Vholes” as a valid word, I don’t really want to consider it as a possible “fuzzy match.” You could achieve the same thing with probabilistic weighting, but using a relatively small “fuzzy matching” dictionary also significantly boosts speed, and last year that was an issue. It may be less of an issue now that I’m working in concurrent Java, and running the script on a cluster.

From → All posts, Illinois

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: