Skip to content

Resources

One of the purposes of the Uses of Scale project is to make it easier for researchers to normalize large collections of printed documents. Specific resources to do that will be linked from this page. But let’s back up. What’s so difficult about “normalizing” collections of text?

  1. First, there’s the problem of obtaining the texts themselves. For the most part that’s a task beyond the limited resources of this project, although we can recommend collaboration with HathiTrust and the HathiTrust Research Center.
  2. Then there’s the concrete problem of managing a collection that may encompass hundreds of thousands, if not millions of documents. For humanists, the availability of appropriate hardware is a nontrivial aspect of that problem, and we hope to encourage collaborative solutions (especially among the thirteen institutions involved in Humanities Without Walls).
  3. Then there’s the reality that most digitized texts are transcribed optically, and contain errors. Not all kinds of error are problematic, but we need to think carefully about our tolerance for different sorts of error, and correct documents where it’s possible to improve recall without significant loss of precision.
  4. Then there are questions of language change. English spelling varies over time, for instance, and across the Atlantic Ocean.
  5. Then there are a range of difficulties introduced by the mismatch between the structure of a physical volume and the structure of electronic files. Is the book title that appears at the top of every page part of the “text”? How do we divide a book into chapters, or distinguish the prose introduction from the verses that follow? TEI is designed to solve these problems, but unfortunately we don’t have collections of TEI on the requisite scale, so algorithmic solutions will be needed.
  6. Finally, there are problems associated with metadata. We would like to make metadata more reliable, but also — and perhaps more urgently — we’d like to have richer kinds of metadata, including, for instance, information about genre and authorial gender.

*

At the moment, scholars have to solve all of these problems themselves before digital research at scale becomes possible. We want to abbreviate the task by sharing resources. These may include:

  1. Actual collections of texts and metadata.
  2. Lexicons for normalizing spelling.
  3. Rulesets and scripts for correcting OCR (optically transcribed text).
  4. More complex sorts of workflows for addressing structural questions (volume segmentation, running headers, and so on).
  5. Tools for enriching metadata — for instance, automatically categorizing texts by genre.

We won’t be providing a one-size-fits-all solution to these problems. Since data can be “clean enough” only in relation to a particular set of questions, researchers will still want to customize solutions for their own projects. Our goal is simply to make it easier to construct those solutions.

Leave a Comment

Leave a comment