Skip to content

Removing running headers

by Mike Black and Ted Underwood

Printed books contain a lot of paratext that readers implicitly separate from the text itself. Every verso page may start “The Count of Monte Cristo,” and every recto page may start “Chapter Three,” but we don’t actually read those words.

Optical transcription unfortunately tends to merge running headers with the text itself, and this can be a significant source of distortion for text mining. Where statistical analysis is concerned, it doesn’t especially matter if 180 random words out of a volume are mistranscribed with random OCR errors that make them unrecognizable. That sort of noise gets lost in the wash. But it does matter if we add 180 correctly transcribed occurrences of “Count,” one for each verso page in the volume.

To address this problem, we developed a set of Python scripts that remove running headers from HathiTrust documents. We use the HathiTrust data structure for many reasons, but for this project it has the particular advantage of representing pages as separate files. This makes it easy to algorithmically recognize the top of a page, and then identify phrases that recur often as the first (or nearly first) line on a page. Since page numbers and OCR errors introduce variability, we use fuzzy matching to identify recurring lines.

In addition to removing recurring headers from the text, we have used them to identify document parts, which we then mark on the text as <div>s. For many reasons, this is a very imperfect approach to document segmentation. (The main problem is simply that not all books contain running headers.) However, we have preserved the information since it might be useful at later stages as a clue to supplement a more robust segmentation strategy.

Most of the coding on this project was actually performed by Mike Black, with only occasional input from Underwood. Our Python scripts are available in this github repository, with a readme.txt file that explains them in more depth. Note that they are in Python 3.2 rather than the (still more common) 2.7.

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: