Do Not Despair, Do Not Presume

A Cautionary Tale about JSTOR Data

DH

A note of caution about working with text as data in the age of digital archives, and a note of caution about notes of caution.

“Raw data is an oxymoron”—as the title of a recent collection edited by Lisa Gitelman has it. Indeed yes. But what should this mean, in practice, for those who still want to aggregate data for analysis? Especially when the data in question is textual, and the analysis is meant to serve the scholarly study of culture? What do we do when the data is not only “cooked”—as all data is—but flawed? In this post, I want to suggest, as a twofold rule of thumb, one of Beckett’s favorite quotations: do not despair: one of the thieves was saved; do not presume: one of the thieves was damned.

For the last three years, JSTOR has offered a Data for Research service which gives anyone who signs up the ability to download a great deal of information about items in their digital archive—most tantalizingly, word counts for each document they store. In fact they offer not just word counts but n-grams (counts of word pairs, triples, and quadruples) and a quite rich set of document metadata. Suddenly it becomes possible to conceive of analyzing the entire archive of scholarship in the aggregate. These possibilities have really been exhilarating for me. But they also require relying on data cooked by JSTOR’s own processes of scanning, OCR, and counting words—processes whose workings are normally invisible to the user of DfR.

I recently received a somewhat startling reminder of the non-rawness of the data this “data service” provides. Here are some word counts I downloaded from DfR for the same article (Anne Ruggles Gere and Sarah R. Robbins, “Gendered Literacy in Black and White: Turn-of-the-Century African-American and European-American Club Women’s Printed Texts,” Signs 21, no. 3 [Spring 1996]: 643–78, www.jstor.org/stable/3175174) on two different dates this fall:

the of and in to a club as for women
9/19/13 download 15568 12512 8400 6160 5840 3920 3680 3024 2784 2768
11/5/13 download 973 782 525 385 365 245 230 189 174 173

And it goes on like this. The rank order of words is pretty much the same, but the numbers are wildly different. Hmm.

Well—I am not being entirely straightforward in presenting these varying counts as a raw phenomenon. I am actually part of the explanation for the difference between the two rows of the table. While working on the DfR data for Signs earlier in the fall, I noticed some articles whose wordcount data was clearly inflated by a factor of ten or more (a 35-page article is not going to have 15000 occurrences of the unless it is some Oulipian contraption). I wrote to JSTOR (as did other researchers I was collaborating with) and, two months later, the new counts show that JSTOR’s Advanced Technology Group is in the process of correcting a sporadic but—as you can see—significant error.

How sporadic and how significant? That is up to JSTOR to study and, I trust, publicly explain. I have not seen errors of this magnitude in data from DfR for any journal other than Signs, but I have looked at only small slices of their archive.

To check that the changes from September to November were indeed corrections, I downloaded Gere and Robbins’s article from JSTOR. Making use of the embedded text in the PDF, we can obtain another set of counts using the pdfgrep tool (I’ve put a link to my code for doing this at the end of the post):

the of and in to a club as for women
9/19/13 download 15568 12512 8400 6160 5840 3920 3680 3024 2784 2768
11/5/13 download 973 782 525 385 365 245 230 189 174 173
pdfgrep 480 450 391 267 256 201 184 149 57 177

 

At least the pdfgrep counts are in the same ballpark, and we can expect some differences due to the intricacies of what gets counted or how words have been divided, but it’s hard to know how to account for all the differences without redoing the OCR myself—which I haven’t done, yet, for this file, and which would be impossible, or at least legally actionable, to do for the whole archive of PDFs. But to convince myself that the November data counts were indeed plausible, I counted occurrences of women in Gere and Robbins’s essay by hand (a strange exercise, but it was a nice chance to read that very interesting article) and found 182 occurrences. That is at least pretty close.

So what is the extent of the change in the data this process of correction is making? How many badly distorted word-counts was JSTOR serving for Signs in September? For a quick sense of the answer to this question, I compared word counts available from DfR for all 3886 Signs items (the full run of the journal in JSTOR, from Autumn 1975 [vol. 1, no. 1] to Autumn 2013 [vol. 39, no. 1]), downloaded on September 19 and November 5. These include 2161 items classified as “full-length articles,” 1223 “book reviews,” 497 “miscellaneous,” and 5 other. (The metadata classifications also have errors, but that is an issue for another day.) As a rough measure to see where the major revisions to the available word counts were, I computed the sum of the changes in wordcounts for each article (the “taxicab distance,” or sum of the absolute values of the differences in the counts for each word in the document).

About a third of the items were completely unchanged: 1112 out of 3886. But let’s focus on the 135 items in which the word counts have changed, in total, by more than 1000 words. I assume that a change of that size is a major difference that would significantly affect most computational text analyses relying on the data.

These are mostly articles (80; the rest are classified as miscellaneous). The items occur across the run of the journal:

no. 1 no. 2 no. 3 no. 4
vol. 2 0 0 0 10
vol. 8 11 0 0 0
vol. 10 13 0 0 0
vol. 12 0 12 0 12
vol. 13 0 12 0 0
vol. 14 0 10 10 0
vol. 15 0 10 0 0
vol. 17 7 0 0 0
vol. 18 0 0 0 11
vol. 21 0 0 10 0
vol. 26 0 0 6 0
vol. 36 0 0 1 0

If you were trying to understand the verbal make-up of Signs in the 1980s with the old data, you’d be significantly misled. Hopefully, the corrections JSTOR has made—and, I understand, is continuing to make—mean this is no longer the case. But there is still a lesson to draw. My motto, then, is in the reverse of Beckett’s order (for reasons that will become clear): Do not presume; do not despair.

Do not presume

The first moral of the story is: The “data” can be really, really wrong. Though from one perspective 3.5% is a small proportion of problem articles, it’s more than enough to cause trouble for all kinds of analyses; it represents a considerable part of the archive one wants to analyze (here, the journal Signs), and, without having any way to trace the source of the error, it makes one worry about other hidden problems. It’s one thing to expect a rate of error from OCR (this issue is mentioned on the DfR FAQ page) but quite another to wonder just how much garbage one is feeding in. So everyone who works with data of this kind needs to do spot checks of multiple kinds. And, in the words of Justin Grimmer and Brandon Stewart’s excellent essay Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts, “validate, validate, validate.”

Furthermore, this exercise reveals something quite troubling about the opacity of the “service” or information-provision model DfR represents. The representation of the very same documents can change, silently, over time, according to private decisions by JSTOR. Such decisions are no doubt made in order to improve the quality of the data. But what are the priorities that define what aspects of quality matter most? I was corresponding with the people at JSTOR and knew that the Signs counts were being revised. If I hadn’t been? If others haven’t been? What does it mean to think of this pre-digested data as a moving target?

Do not despair

The second moral of the story, however, is: well, welcome to the world of data. Data is produced by institutions with their own imperatives, codes, flaws, agendas. Social scientists and historians know this very well: it’s just that as a literary scholar, this is the first time I’ve had an opportunity to face the problem. It’s one thing to know the principle, and another to see these nearly-raw counts change before my eyes.

I can easily imagine a certain kind of literary scholar responding to a moment like this by saying, See? I told you so! This business of computers and quantifying and data is fraught with error. It’s certain to do violence to texts, writers, and cultural history. A close reader would never make mistakes on this scale. Just read the texts.

But such a response would be wrongheaded. Intensive (“close”) reading is all about making mistakes about the scale of effects in texts: normally, it gambles that a few hand-picked textual examples stand in a meaningful and interpretable relation to larger cultural or historical wholes. These gambles can be gloriously right. But how do we know? They have their own biases—towards contemporary canons of literary value, towards the deeply-ingrained convictions of the expert reader, towards the texts that have been made available for intensive reading by publication or archival curation. The very reason I have become interested in trying to analyze large numbers of texts is to find some correctives or supplements to what I am able to find by reading intensively in the individual text. The aggregate, in the best case, gives access to a cultural field, a context, a collective pattern, that is not accessible in a handful of examples.

So I don’t think the task of literary and cultural scholarship can do without methods for analyzing cultural production in the aggregate. As we continue to develop those methods, we have to find ways of acknowledging that producing good data for scholarship is an important responsibility of the scholarly community, not a bit of scutwork we can delegate to digital information companies (for-profit or non-profit) and then put out of mind. That means we have to train one another to handle data ourselves, and we should recognize the work that goes into producing data as an important contribution to scholarship.

But it also means learning that flawed evidence can still be evidence, and discovering errors—even glitches—is one route to new knowledge. What matters is not crystalline perfection or ultimate conclusiveness but arguments that are as explicit as possible about our interpretive assumptions. So I’m not giving up on using data for literary scholarship, or on JSTOR Data for Research. I will continue to be as compulsive as possible about sanity-checking, spot-checking, cross-checking—and reading. Provided we take sufficient care, JSTOR’s data is, despite the problems, still able to tell us meaningful, valid—but always potentially defeasible—things about the history of scholarship. It is just as true of a small, intensively analyzed selection of texts as of 3886 wordcounts*.CSV files that the conclusions I can draw are subject to revision by scholars considering more and better evidence. The possibility of such revision is what allows us to think of arguments about texts—including arguments that aggregate many texts into data—as part of a cumulative process for producing knowledge about cultures.

My R code for counting words with pdfgrep and for taking the taxicab distances can be found in this gist.

[Edited 11/9: fixed a few incorrect dates.]