Skip to content

Geolocation correction

by Matthew Wilkens

Algorithmic geolocation extraction from literary texts is nifty, but it’s not overwhelmingly accurate, especially in the case of older texts. The sources of inaccuracy are several (and surely extend beyond this list):

  • Named entity recognition is hard. I’ve used the Stanford CoreNLP package for my own work; it does a pretty good job, but it’s prone to confusing personal names with place names.
  • Most NER/NLP packages are trained on modern data, which may not match up well with the place names used in nineteenth-century texts. I’m planning to have an undergrad or two work on creating a new NER training set from nineteenth-century fiction sources, but we haven’t gotten that far yet.
  • Place names — especially ancient and international place names translated into English — have changed over time and may not match current gazetteers.
  • Some place names are ambiguous, especially out of context (think “Springfield”). And some places are referenced using colloquial names.

I’ve found that there’s no substitute for hand review and correction of the place names extracted from my own corpus of mid-nineteenth-century fiction. But this is a labor-intensive process; it requires checking at least a meaningful subset of all occurrences of a computationally extracted location string in their original (literary) contexts, determining the real-world location (if any) to which the string refers, and recording that real-world place in a way that Google (or another geodata provider) will recognize and return correctly. This takes anywhere from a minute to the better part of an hour for each unique string. It’s a pain.

In an attempt to lessen the pain of others doing similar work, I’ve uploaded here the results of my hand review and correction of 3,220 unique location strings extracted from a corpus of 1,093 volumes of American fiction published between 1851 and 1875. (The link is to a full listing of the corpus, but not the texts themselves.) This data is obviously tied in some important ways very closely to the corpus in question, but it should provide a useful starting point for anyone else facing the same (dull, painstaking) task of hand correction. Note that these 3,220 strings are only a subset of those identified in the corpus; to keep the scope of the review manageable, I excluded strings that occurred fewer than five times or that were used by only a single author.

The data file of hand-corrected location strings is a zipped TSV containing the following fields:

  • string — The original text string as identified in the source corpus. These are strings of text that Stanford CoreNLP identified as locations.
  • occurrences — Number of times the string in question was identified in the corpus as a location.
  • volumes — Number of volumes in the corpus in which the string was identified as a location.
  • authors — Number of authors in the corpus who used the string as a location.
  • ignore — Should the string be ignored as a location? Generally (but not always) a string is ignored because it’s an obvious error. Check the comment field for more information.
  • alias — Substitute string to which this string should be aliased, e.g., “Abyssinia” -> “Ethiopia.”
  • comment — Explanation (if any) for the ignore and alias fields.

The occurrences, volumes, and authors fields are strictly informational; I used them to make sure I wasn’t casually throwing away heavily used strings without a careful check and to limit the amount of time I spent reviewing low-frequency strings. They might also be useful as points of comparison against a different corpus.

You’ll notice that what’s missing from this file is any actual geographic information. That’s by design; you can easily regenerate that data by feeding these strings and aliases to Google’s geocoding API or to any other geodata provider. But I’m pretty sure I can’t just give away a big chunk of Google’s data.

Leave questions, corrections, extensions, praise, and abuse in the comments below or email Matthew Wilkens. For more information on the project and related geolocation work, see Wilkens’ blog, Work Product.

One Comment

Trackbacks & Pingbacks

  1. Geolocation Correction at Uses of Scale « Work Product

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: