Basic OCR correction

by Ted Underwood and Loretta Auvil

Although optical character recognition is imperfect, we can often address the worst of the imperfections. The simplest approach is to develop a set of rules to translate individual tokens produced by OCR errors back into correctly-spelled forms.

Obviously, this approach has limits. You can’t anticipate all the possible errors. Worse, you can’t anticipate all possible correct words. It’s always possible that you’ll correct “Defist” to “Desist,” when it was in this one case the name of a French nobleman.

However, with all that said, the practical reality is that many OCR errors are quite predictable. Especially in the period 1700-1820, errors often fall into predictable patterns of substitution produced by archaic typography or worn and broken type (s -> f, h -> li, h -> b, e -> c, sh -> m). Moreover, the predictable errors are the ones we really need to care about. Rare, random errors aren’t going to distort data mining significantly, but a systematic substitution of f for s, limited to a particular span of years, is a problem we have to address!

So it’s possible, and useful, to produce a pretty-good list of rules that simply translate individual tokens back into correctly spelled words. Here’s one such initial list, containing 50,000 translation rules produced by Ted Underwood and Loretta Auvil, specifically for the period 1700-1899.

We identified a set of predictable character substitutions and used the Google ngrams dataset as a source of common OCR errors. In cases where a limited number of predictable substitutions translated a common error back into one (and only one) dictionary word, we accepted that translation as a “correction rule.” In this list the rules are sorted by frequency, and the third column represents a number of occurrences.

Please note that this list is specific to the period 1700-1899. If you use it on twentieth-century texts, you might end up “correcting” some modern acronyms. Also, this list is designed to normalize everything to British modern spelling; if you want to preserve differences between American and British spelling, you’ll need a different approach.

Finally, there are many pairs of words, like “six/fix” or “soul/foul,” where a likely OCR error is also a correctly spelled word. In situations like this, no list of 1gram rules will be adequate. Instead, you need a contextual spellchecking algorithm. Since “immortal soul” is a more probable phrase than “immortal foul,” context can help us correct words that a mere dictionary check would accept.

Underwood is about to generate a larger list of 1-gram rules using a more flexible probabilistic approach. We also have an algorithm for contextual spellchecking (“immortal foul” -> “immortal soul”). But neither of those resources is quite ready for prime time yet. So we’re providing this initial list of 50,000 correction rules as a temporary stopgap solution for people interested in correcting 18th and 19th-century OCR.

12 Comments

James permalink

This is a great resource – I’m very grateful to you for making it available. I’m finding it tremendously useful in parsing and normalizing the raw Google ngrams data.

I’ve noticed one feature which I think is an over-generalization, namely ‘f’ -> ‘s’ substitutions in cases where the ‘f’ is the last character of a word. My understanding is that the C18 long-s was not usually used in the end position, and therefore that correcting end-position ‘f’ to ‘s’ is probably erroneous.

Looking at some examples in the list (e.g. ‘feaf’ -> ‘seas’), it’s clear that something is wrong with the form in column 1, but it’s by no means clear that the form in column 2 can be inferred. (I’d guess that many instances of ‘feaf’ may actually be errors for ‘feat’, but it’s only a guess.)

For ‘f’ -> ‘s’ substitutions, I’ve found that’s it’s useful to use the Google ngrams to compare frequencies in the decades before and after 1800. If the correction is valid, the ‘f’ form should decrease sharply in frequency, and the ‘s’ form should increase, with the two crossing over around 1800. This seems to be an effective way to automatically detect bigram pairs of the ‘immortal foul’/’immortal soul’ kind.

At least, that was the case for the 2009 Google ngrams dataset; I notice that the more recent release seems to have cleaned up many of the long-s errors, although I can’t find this documented anywhere. Compare http://books.google.com/ngrams/graph?content=Chriftian%2CChristian&year_start=1750&year_end=1850&corpus=0 and http://books.google.com/ngrams/graph?content=Chriftian%2CChristian&year_start=1750&year_end=1850&corpus=15

One other comment: Forms in the list are all downcased, which may give the impression that corrections are case-insensitive. But this is not always true. For example, ‘0ctober’ -> ‘october’ is not valid, since a leading zero should probably be corrected to an upper-case ‘O’, not a lower-case ‘o’.

Thank you again.

Reply
- tedunderwood permalink
  
  Thanks much, James. I agree with you about final “f” — that rule should really be position-sensitive. I also really, really like your idea to use change-over-time as a clue about the validity of a correction rule. That won’t work equally well for all OCR errors, but for f/s it’s perfect.
  
  Re case: I do also have another version of these rules where the forms-to-be-corrected are case-sensitive. That’s important for things like “WiUiam => William.” Errors of that kind are usually a capitalized U, for obvious visual reasons. I just hadn’t included those rules here yet. I’ll add them.
  
  Making the corrections themselves case-sensitive would also be possible. It’s a fairly low priority for my own research, since I end up compressing everything to lowercase anyway. But I suppose it might matter for people who want to do, e.g. entity extraction. Right now, I just reproduce the case (title, upper, lower, etc) of the original token. But obvs that won’t work for your example (since a zero has no case attribute).
  
  Reply
Bea Alex permalink

I’m working on the Trading Consequences project (http://tradingconsequences.blogs.edina.ac.uk/)
for which we are analysing 18th and 19th century text with respect to commodity trading. Thanks for sharing the substitution resource. We are already doing something similar but just for f-to-s conversion and end-of-line soft hyphen deletion but your list includes substitutions for other error correction rules which I’m hoping to makes use of as well.

Have you done evaluation on how much your substitutions improve text quality? I’d be interested to find out about that.

Reply
- tedunderwood permalink
  
  Thanks! Yes, we have assessed that. I’ll try to post some figures soon — graphs showing accuracy over time before and after correction.
  
  Reply
Jörg Knappen permalink

Your OCR correction patterns were a great starting help for our Royal Society corpus. We adapted the patterns to our specific corpus (dropping a lot of unused patterns and adding many other patterns more or less specific to the corpus). We describe the process of our OCR correction pattern mining in this reference http://www.ep.liu.se/ecp/article.asp?issue=133&article=003&volume=

Our patterns for the Royal society corpus v2.0 are available for download here: http://fedora.clarin-d.uni-saarland.de/rsc/access.html

Reply
- tedunderwood permalink
  
  Thanks for those links, Jörg; very glad to see the material was useful, and I will look forward to downloading the new patterns!
  
  Reply
- Jörg Knappen permalink
  
  We released a new public version (4.0) of the Royal Society Version with an extended version of the OCR correction patterns here: http://fedora.clarin-d.uni-saarland.de/rsc_v4/access.html
  
  Reply
Tom permalink

Hi, do you have the 50,000 translation rules list link? The one on this page is not working.

Reply
- tedunderwood permalink
  
  Thanks. I’ll fix the link. For now, this is the best link: https://github.com/tedunderwood/DataMunging/tree/master/rulesets I think you probably want CorrectionRules.txt.
  
  But as a caveat emptor sort of thing: this is old work now and there may well be better approaches.
  
  Reply