THATCamp Theory 2012: Session proposal


Theory by the numbers

or, The yack of the hack of the yack

This is a longer version of my proposal for a THATCamp Theory 2012 session.

How we should go about mining the digital archive of the history of scholarship for theoretical resources? Let’s talk about text-mining journals, quantitatively analyzing metadata about scholarship, and living with closed access as theorists. And perhaps we can work on a dataset or two—I’ll bring some example data and laughably primitive visualizations!

The explanation

One of theory’s major tasks is to describe how scholarship is done—and then to prescribe how it should be done. Often the description leads to the prescription: theory as scholarship about scholarship. Well, yes. It is characteristic of a whole family of genres that belong to theory, from De la grammatologie to Orientalism to Ahmad’s In Theory, Laclau and Mouffe’s Hegemony and Socialist Strategy to Sheldon Pollock theorizing a “Political Philology” in a memorial essay about the scholarship of D.D. Kosambi.

Meanwhile, over in digital-land, one of the richest digital archives we have is the archive of scholarship itself. But we are used to using these archives for search, not as objects of analysis in themselves. That is what I’d like to explore in this session. What does the MLA Bibliography tell us—in the aggregate? What theoretical possibilities can we open up by mining the extraordinary archive represented by JSTOR’s Data for Research service?

I’d be able to talk about two examples of datasets I’ve done a little work on—one from the MLA Bibliography and one from JSTOR’s archive of PMLA. Please feel free to bring your own datasets, or leads, or inspirations, or problems, or concerns.

The MLA Bibliography and “modernism”

A bit more than a year ago I got permission to download all the MLAIB hits for “modernism” (8000 or so). This was a crude attempt to understand the history of the study of “modernism” as a totality.

Who is a modernist author, according to scholarship? Has it changed? The frequencies as title words of the author names that were most frequent in the whole corpus of titles:

Frequency of selected author names in titles over time

And of a few more figures:

Frequency of selected author names in titles over time

What is a period? What years get covered? I represented periods (using years found in subject headings) by histograms. In the 1970s, “modernism” meant these years:

Distribution of periods covered by KW modernism, 1970s

In the 1990s:

Distribution of periods covered by KW modernism, 1990s

PMLA: a topic model

JSTOR’s Data for Research service makes it possible to download word counts for whole runs of journals. In a pilot project, I and several others got word counts for the full run of PMLA and created several topic models of the archive of that journal.

Here are top keywords for some cherry-picked examples of the 100 topics that MALLET’s latent Dirichlet allocation algorithm inferred:

Topic 4
text line manuscript mss manuscripts reading wrote written copy readings 
word scribe order group texts notes copies left note

Topic 6
structure pattern point action theme final end work unity contrast 
relationship effect central view part terms dramatic important basic

Topic 78
text literary cultural texts discourse theory studies reading literature 
work culture relation critical language textual trans writing critique 

Topic 85
head back house body dead face hair hand hands people red eyes white boy 
side small blood black day

And here are some time courses for the presence of those topics in PMLA by 5-year window:

Frequency over time of topic 4 text line manuscript mss

Frequency over time of topic 6 structure pattern point action

Frequency over time of topic 78 text literary cultural texts

Frequency over time of topic 85 head back house body dead

The question

Can this sort of data-mining do theoretical work for us—as digital humanists, as humanists? What is needed to move from my current state (“Gee whiz, this is a thing I can sort of do”) to something more directed (“Here is an argument about the history and the future of scholarship”)?


MLAIB unfortunately requires special dispensation for bulk downloads. But [JSTOR’s interface] is available to anyone after a free registration. So if anyone is inspired, it’s easy and fast to put in a request for a dataset—JSTOR will give you word counts and metadata for up to 1000 articles represented by any single JSTOR search. Usually the data is available very fast after you submit the request.

How about some of the scholarly subject headings searchable in the amazing Bookworm?

An example of the kind of intervention that’s possible: consider Andrew Abbott’s hilarious essay on the non-impact of concordances on pre-WWII literary scholarship. It’s a shaft aimed straight at the heart of DH (or at least at the fantasies about the transformative power of search as such).