Flowchart for probabilistic OCR correction.

by tedunderwood on October 14, 2012

Optically transcribed documents have errors, and while we can improve OCR, I suspect it’s going to be a long time before it’s totally error-free.

Once you get to the later nineteenth century, OCR errors may be random enough that they don’t constitute a huge problem for data mining. But before about 1820 they’re shaped by period typography, and are therefore distributed unevenly enough to pose a problem. If you topic-model uncorrected 19c corpora, you get topics containing, e.g., all words with a medial or initial s.

So I’ve been developing a workflow for OCR correction. My goal is not to correct everything, but to correct the most common kinds of errors, especially ones that affect relatively common words (say, the top 50,000 or so).

I’ve borrowed the probabilistic (roughly Bayesian) approach adopted by Lasko and Hauser in this paper for the National Library of Medicine.

These rules were embodied in a Python script I wrote last year, which worked well enough to make OCR “artefacts” vanish in my analysis of 4,000 volumes. But now that I’m confronting a corpus of 500k volumes, Python is a bit too slow — so I’m rewriting the core of the process in Java. Also, I’ve realized that the scale of the corpus itself gives me certain new kinds of leverage. For instance, you can record the average OCR quality of documents where a given token-type appears, and use that information probabilistically to help decide whether the type is an error needing correction.

I’m now about halfway through designing a workflow to do probabilistic correction on 500,000 HathiTrust documents, and thought I would share my workflow in case people have suggestions or critiques. Click to enlarge.

Two areas where I know people are going to have questions:

1) Why use Titlecase to identify proper nouns? Why not do NLP? The short answer is, I’m mortal and lazy. The longer answer is that there’s a circular problem with NLP and dirty data. With 18c documents, I suspect I may need to do initial cleaning before NLP becomes practical. And since I’m correcting 18/19c books as one gigantic corpus, the 18c habit of titlecasing common nouns isn’t too disruptive. However, this is a place where I would welcome help, if someone happens to have a huge gazetteer of known proper nouns.

2) Why have separate precision and recall dictionaries? In part, this helps avoid improbable corrections. “\vholes” could be an error for the Dickens character Mr. Vholes. But it’s much more likely to be an error for “wholes.” So, while I want to recognize “Vholes” as a valid word, I don’t really want to consider it as a possible “fuzzy match.” You could achieve the same thing with probabilistic weighting, but using a relatively small “fuzzy matching” dictionary also significantly boosts speed, and last year that was an issue. It may be less of an issue now that I’m working in concurrent Java, and running the script on a cluster.

From → All posts, Illinois

One Comment

Trackbacks & Pingbacks

Experimenting with the imperfect: ECCO & OCR – Linguistic DNA

Flowchart for probabilistic OCR correction.

Trackbacks & Pingbacks

Leave a comment Cancel reply

Recent Posts

Archives

Categories

Meta

Participants’ blogs

Flowchart for probabilistic OCR correction.

Share this:

Related

Trackbacks & Pingbacks

Leave a comment Cancel reply

Recent Posts

Archives

Categories

Meta

Participants’ blogs