Topic Modeling in the JDH


This is a little belated: the new special topic-modeling issue of the Journal of Digital Humanities is out. The guest editors chose Ted Underwood’s and my blog post on modeling PMLA for inclusion: What Can Topic Models of PMLA Teach Us About the History of Literary Scholarship?. It’s great to be part of it. The whole issue is beautifully put together in both its web and PDF forms, and it gives a sense of the excitement of the discussions in computational text analysis for the humanities right now. After the jump, my all-too-brief appreciations, plus a note of speculation about the future of this conversation.

JDH is a striking experiment in scholarly publishing, and the issue is notable, among other things, for its combination of serious work and informal presentation. I’ve read the whole with great interest. Ted and I left our post unrevised, as our next major revision is still under way, and will be more argumentative (and will model more journals). It is, however, gratifying to see it there as part of a conversation with some very impressive work. I particularly call attention to Lisa M. Rhody’s work topic-modeling a corpus of poetry; this work moves beyond modeling themes to modeling rhetoric, and, just as important, it shows how much genre matters.

I found totally compelling Ben Schmidt’s revised version of his critique of topic modeling. I am very glad that so much of the work in the JDH issue, including Ted’s and mine, has been out there in blog form, because it facilitated this very searching—and yet near-instantaneous—challenge from Ben. Like Matt Jockers and David Mimno’s work on topic-model quality and valid inference from LDA (see this by Jockers and Mimno and this by Mimno et al., for example), Ben’s essay shows how thoroughly we have to look into the models we are using—models from LDA or any other technique. But LDA merits special circumspection, especially when we label topics, because of the semi-miraculous appearance of instant validity MALLET’s initial output often presents.

I know I have much more to figure out, and much more to say, about how to use topic models as evidence for the kinds of arguments we want to make in disciplines like literary studies and history.

Fevered speculation on a sequel

It is striking that the JDH issue makes little mention (that I could see) of any work by social scientists (social scientists other than historians, that is) in this area. To be clear, this is not a criticism of Scott and Elijah, the JDH issue editors, or any of the contributors. On the contrary, JDH 2.1 is an admirably coherent and enviably interdisciplinary journal issue, and I am deeply impressed by the bibliographies of the more formal contributions.

In thinking about where this scholarly conversation will go next, however, I do worry that DH’s occasional “Big Data” talk, some prominent DH projects’ affinity for metaphors from the natural sciences (“labs,” “experiments”), and the richness of the DH work being done in CS itself may be slowing down the recognition that DH text-analysis work is often very close indeed to sociological methods and questions.1 There is an enormous overlap, just to take the specific example at hand, between the kinds of analysis we humanists are trying to carry out using LDA (etc.) and the aggregative analysis that social scientists do all the time when they analyze survey, interview, and archived text data.

I have (for only tangentially related reasons) recently been reading handbooks of content analysis for sociologists and political scientists, and the connections with the concerns in the DH topic modeling discussion are palpable. I see some of the same concerns over theme labeling, the same desire for explanations of historical and social variation, the same anxieties about and needs for algorithmic “coding” of language. Here is an example from a content-analysis discussion from the olden days (the Eighties):

Perhaps the weakest form of validity is face validity…A category has face validity to the extent that it appears to measure the construct it is intended to measure. Even if a number of expert judges agrree, face validity is still a weak claim as it rests on a single variable….Unfortunately, content analysts tend to rely heavily on face validity; consequently, other social scientists often view their results with some skepticism. (Robert Philip Weber, Basic Content Analysis [Beverly Hills: Sage, 1985], 19)

What has changed since 1985 in content analysis in sociology and political science? I am still finding out, and doubtless my own newness to DH means I am ignorant of all kinds of already-ongoing DH connections with the social sciences. Yet I still think it’s worth asking how DH’s dialogue with the social sciences is going to develop.

  1. By “some of us” I may just mean “me.” But I think it’s not just personal. The possibility of sharing in “Big X” (X = Science, Data, Money) is seductive, whereas the humanities, especially literary studies, has an endemic frenemy relationship with the “third culture” of sociology. ↩︎