The Quiet Transformations of Literary Studies


My article with Ted Underwood on the long-term history of literary studies is now available in a preprint form (pdf). This essay will appear in NLH, but the publisher has kindly given permission for us to make a submitted manuscript available now. Please keep your eyes open for the final published version. It’s been a long (though enjoyable) journey from our initial exploratory blog post in late 2012 to this essay.

The article’s primary evidence is a topic model of seven scholarly journals in literary studies that span the last century and a bit more. In order to make this form of evidence more accessible, I designed a website that allows you to explore the model. Ted and I spent a long time exploring the many, many different ways you can slice and dice topic models before arriving at the claims we make in the essay. To get a sense of how we might have gotten there, spend some time surfing the model, seeing both its power to reveal and its inevitable froth of randomness and approximation. And, hopefully, you’ll discover patterns of your own.

What I’m most proud of in our essay is our effort to show why our approach responds to central questions in literary history and literary theory. We are interested in topic modeling in itself, but I see our essay as aiming primarily at two things. We try to revise literary scholarship’s conception of its own history. And we argue that quantitative methodologies in literary history, far from dispensing with interpretation, attempt to scale up the interpretation of cultural texts while keeping the major problems of interpretation in view. This scaling-up tends to bring humanistic scholarship into closer proximity to the challenges and methodologies of the social sciences.

I’m planning to expand on that last claim in my talk at DH 2014…. Ted’s and my joint formulations are in the essay. Read it, and browse the website.

After the jump, the rest of this post is all geeking out about programming and workflows.

Reflections on R

The business of generating topic models and analyzing them we did almost but not quite entirely in R. R has many helpful features that can make programming a magical journey in which one learns one’s full capacity for rage and soul-crushing despair. Nonetheless it is irresistible for data analysis because of the concision of the language, the power of its interactive mode, the incredible depth of CRAN, and the beauty of ggplot2 visualizations. The existence of an R wrapper for MALLET sealed the deal.

But R is weak when it comes to any project that requires more than one script file. A topic modeling analysis involves: making a bunch of different models; doing some legwork to pull the modeling results, which are not simple because the model is hierarchical, into memory and exploring them; understanding the corpus composition on its own terms (with all the accompanying data-cleaning this requires); and, finally, working out an analysis that yields the numerical results and plots you need.

As we worked on this, our script files and data-output directories rapidly multiplied, and it got harder and harder to manage this and to share code that didn’t break when it went from one file system to another. R offers only two mechanisms for managing code dependencies: source() and packages. It’s technically true that R has deep namespacing mechanisms but nothing like the straightforward module mechanisms of perl and python (or Haskell!!oneone) or the class architecture of a language like C++. Plus R’s object orientation is not exactly transparent enough to let you just start cooking up classes the way you would in Ruby or Java or whatever. So our strategy involved a bunch of scripts that got sourced in some order and filled up the global namespace like the dickens. This led to lots of Heisenbugs when something that appeared to work actually depended on some piece of state I’d created interactively in the session before testing a script.

I now think that packages are the way to go if you have a project of more than one script file in R. Use Wickham’s devtools and this gets a lot more tractable. Anyway, my nightmarish tangle of R scripts (plus some python and a little Perl here and there) is available on github because we wanted our work to be documented. (We are working on getting data files a repository home.) However, if you actually want to follow along the Data-for-Research-topic-modeling-with-MALLET lines laid down by Ted’s and my endeavor, please take a look at my continuing work on dfrtopics, an adaptation of the nightmarish tangle into a better documented and organized R package. More to the point, I urge you to follow the development of David Mimno’s mallet R package and Ben Marwick’s JSTORr as well.

Reflections on browsing in the browser

Once we started showing and discussing this work in public, it was clear that people liked and gained from being able to look over model visualizations and interpret them themselves. That motivates the dfr-browser project, which is the basis for the topic model browser accompanying our essay. The space of topic model visualizations seems a bit overcrowded to me right now, and I’m sure some of my choices would make a professional sad, but the thing—and this is the thing not only about visualization but about modeling as well—is that domain experts want to use the visualization and the model for their own ends. As a result some existing really impressive designs don’t quite fit the uses we have in mind. I am convinced that, just as social scientists and natural scientists grit their teeth and learn to program and produce visualizations when they need to in order to support their analysis, so too must those of us in the humanities. Solutions off the shelf are not great at answering expert research questions. What should come off the shelf are components that the researcher knows how to put together.

d3 makes this kind of thing tractable for interactive information design on the web. The library is beautifully documented both by its author and in Scott Murray’s nice introduction, soon to be complemented by Elijah Meeks’s forthcoming book. Then I got sucked into the pleasures of web programming and the infinite gadgetry of HTML5. I don’t know if anyone would actually want to follow the pattern I have set up in dfr-browser. It is my novice effort at a web data application, shaped by my eccentric formation as an amateur programmer whose ideas of interface design were set by the System 7-era Macintosh Toolbox Essentials. But anyway my approach emerges from the following constraints:

  1. Access to static web hosting is easy and a perk of university membership; all other kinds of hosting have various barriers (cost, setup time, institutions) and involve undertaking more serious system administration.

  2. If I am presenting my research, I want tyrannical control over every aesthetic dimension.

d3 is very well suited to this kind of do-it-yourself enterprise. The hosting constraints resulted in:

  1. Compressing the data. Having had to learn about sparse matrices in order to manipulate the 100,000 × 21,000 term-document matrix for the whole corpus in R, I realized column-compressed sparse format was a pretty good way to save space on the doc-topics matrix too. It cut the data size in half, from an impractical 7 MB to a more practical 3MB. There are other ways to compress big arrays. Grown-ups use server-side databases, but: static hosting.

  2. Some use of concurrency. I figured this out in two stages.

    a. First of all, d3 provides a nice set of asynchronous file request calls. The design I opted for calls a view_refresh function every time a file is loaded; meanwhile, the view rendering simply stops, or displays a loading message, if the needed data isn’t yet loaded. Even if you’re waiting for metadata you can still draw a topic overview grid, for example. Thus the actual browser program begins by spawning a series of calls to a data loading function, with callbacks that store each piece of the loaded data and then refresh the view.

    b. What you gain in space you lose in speed with compressed matrix formats. When I was testing on a model with 64 topics and a few thousand documents, this wasn’t a big problem, but on the 150-topic, 21,000-document model, the web browser would block for too long: the spinning color wheel of death was not acceptable as a regular part of the user experience. So I moved the computationally intensive parts of the matrix operations into a Web Worker. This was not too hard to do, since Workers are deliberately simplified: all communication with a thread is via a single message queue. The main thread end of the channel is defined by my data model object implementation; the worker, which just holds the document-topic matrix, is required to be a script file of its own.

  3. An attempt to map states of the visualization onto URLs of the site. The key, for static hosting, is the hashchange event, which becomes the driving force of interactivity in the application. Every hash change triggers a “refresh” event in which a new view can be selected. No new resources need to be requested from the server; in essence, the page becomes a little mini-server of its own, responding to requests placed after the # in the URL (#/model/grid, #/topic/16, #/bib, and so on). Interface controls can be simple anchor elements bearing appropriately structured links.

As for aesthetic obsessiveness, the nice thing about d3 is that it lets you stay close to the underlying HTML/SVG. I like this, but I still adopted a framework, the very handy Bootstrap. This provides a pretty good balance of most of the basic interface components I want and a very minimal framework. It does introduce a jQuery dependency, which feels sort of silly in a d3 project. I hope someone is rewriting all of jQuery in d3.

Aesthetic Obsession

Okay, there’s one part of the R coding that isn’t wrapped up in dfrtopics that might be of interest: namely, the code used to draw our figures in the essay. That is largely to be found in github/agoldst/tmhls/figures.Rmd. Again, this is a sort of kludgy solution; since starting work on this project, I’ve gotten more practice with knitr and would be happier composing a whole essay in R markdown. In this case, though, knitr is just a harness for figure generation.

It’s still pretty useful, though. First of all, I put each figure in a chunk. Then knitr can keep track of which figures need redrawing as you change the code, and it does a good job of caching R objects needed for rendering. So figures.Rmd is kind of a Makefile1 for figures.

Second, knitr wrangles TikZ graphics for you. This is crucial for obsession-satisfaction: it outputs your R graphics in TeX code, then uses the TeX engine to render to PDF. Any typography you can do in TeX, you can now do in R figures. For our purposes, the most crucial thing is the ability to use system fonts. (It’s also nice to know you could use TeX math mode if you wanted.)

Wait! I hear you cry. System fonts with TeX? Yes, provided your TeX engine is xetex. However, the tikz graphics + xelatex combo is pretty fragile, and on my system at least regularly segfaults when I try to do it directly by calling a tikz() device in R and sticking the results in a TeX file with \input. Fortunately, knitr is smarter than me, and it runs tikz/xelatex without a hitch, rendering the figure to PDF from there. All you need to know is that you have to add the following utterly transparent commands to the start of your R markdown file (more precisely, in a chunk that all the other chunks depend on):

    "\\setmainfont[Ligatures=TeX]{Minion Pro}\n",

Replace Minion Pro with the name of your desired OpenType system font. Ganz einfach! I look forward to learning that this sequence breaks in the next incremental update of knitr, ggplot2, R, tikz, MacOS X, or the Oxford English Dictionary.

Otherwise our preprint is typeset from basically straightforward LaTeX, using xelatex for its font capabilities.2 I see no reason why a preprint should be in double-spaced Times New Roman. So we’ve tried to make a document you can read fluently on screen or on paper.

And after all, the point is to produce arguments that are worth reading. Here’s hoping you find that our essay is.

  1. Though it’s impossible to overstate the usefulness of actual Makefiles. Make is one language/tool I really do think everyone should learn a little of. Mike Bostock, the author of d3, agrees↩︎

  2. Actually I used the truly excellent latexmk, which automagically calculates dependencies and runs xelatex, biber, etc. as many times as needed. ↩︎