Topic modeling: a software update

DH, kludgetastic

I have spent a lot of time experimenting with and exploring topic models of text. Aside from an article, some blog posts, and a bunch of strongly held opinions, that time also produced quite a few lines of computer code for handling topic models from MALLET. I started out with a big file of R functions, then escalated to a folder full of R functions. The organization got ever more byzantine, even more so when I collaborated with others. Finally I bit the bullet and, following the gospel of Wickham, converted my pile o’ scripts into an R package, called dfrtopics because I was making models of data from JSTOR’s Data for Research. There it has sat, on github, accumulating bits now and then, plus some function documentation written in a fit of compulsion, but really not in a form that anyone but I could use (and, as time went on, becoming hard for me to use too). My website has had a note promising a tutorial demonstration of how to use my package for nigh-on two years, but no demonstration demonstrated itself.

The package was hard to use and document because of the messy and ad hoc way I represented pieces of the topic model. A hierarchical model is not easy to wrap your mind around, and different questions require different slices of the model to answer. And all the mess of code passing around random collections of data frames, lists, and who knows what else seemed like fertile ground for errors and glitches, even when the whole thing seemed more or less to do what I wanted most of the time.

So: in a questionable expenditure of energy, I’ve spent a few days applying some polish to the package. Herewith dfrtopics version 0.2. Install it from github with devtools::install_github("agoldst/dfrtopics").

Three things are new from a potential user’s perspective—and my hope is that the idea of a potential user is slightly less far-fetched than before. First, there is now an introductory tutorial in the form of a package vignette. Second, the whole package has been rewritten around the idea that a topic model is an object, stored in a single R variable. Third, I have tried to make everything as modular as possible, so that the usefulness of the package is not restricted to MALLET models of DfR data. If you have other textual data in wordcount-form—oh, let’s say, 180,000 18th-through-early-20th-century volumes—you might write up some variant file-loading code if necessary, then use the package functions to model those texts with MALLET. And if you have wordcount data that you want to analyze in R in some other way than with MALLET, there are still, I hope, some useful things here for converting those wordcounts into data frames or term-document matrices.

Any operation on the model might need some or all of the parts of the model or—and this is very important for cultural-historical work—metadata about the documents in the corpus. So all of these parts should follow that object around and be accessible at need. Say that the model is called m. The matrix of topic weights in documents is doc_topics(m); the matrix of word weights in topics is topic_words(m); the metadata is metadata(m); summary labels of topics are topic_labels(m); estimated hyperparameters are hyperparameters(m); and so on. There are write_mallet_model and load_mallet_model methods for conveniently saving and loading the necessary files. This tries to address the major user-interface difficulty presented by command-line MALLET, which is the multiplicity and somewhat arbitrary configuration of its many outputs.

This approach tries to encourage flexibility in exploring model outputs, rather than offering prefabricated summaries that may or may not reveal the strengths and weaknesses of the modeling process. Naturally this means I have at best helped to substitute the frustrations of R for those of command-line MALLET. But R, especially in conjunction with dplyr, is a nice platform for analyzing well-organized tabular data. What I ended up writing a lot of code for was precisely getting from the MALLET-output stage to the organized stage.

The front end of the process also needs flexibility. Here I have not been rigorously object-oriented, since the extra complexity did not seem worth it: R’s data frame type—or a dplyr tbl if you want to get fancy—is good enough for the kind of corpus work you’d do in conjunction with this kind of modeling.1 What’s important, again, is that it be possible to move thoughtfully from a folder full of texts in some form or other to the feature vectors or sequences that form inputs to the model. At each step you should be able to reason about which documents you are keeping, which words you are treating as instances of the same feature, how much mess you are willing to tolerate, etc. Some of the ways you can do this are highlighted in the vignette. They do not involve any fancy programming constructs. The dplyr pipeline is a nice idiom for doing this, however, and more straightforward than R’s basic subscripting syntax—so straightforward that when I taught dplyr to students after weeks working through vector subscripting, they demanded to know why they had to learn subscripting in the first place.

On the implementation side, I made some improvements, mostly for my own aesthetic pleasure: I’m not sure they make much other difference. I replaced plyr and reshape2 functions with either dplyr operations or sparse matrices (via the standard Matrix package) everywhere. That was a fun exercise in grappling with standard evaluation forms of dplyr functions. The ambiguity between matrix and data frame is more or less unavoidable with this kind of data, I think: sometimes you think strictly “rows are cases, columns are variables”—then you are in data-frame territory. But other times you think: “this is a two-subscript feature of the whole corpus” or “this is a map from a high-dimensional space to a lower-dimensional one”—then we’re talking matrices.

I suppose I fantasize that this package could now be extended to cover other kinds of models than vanilla LDA, but we’ll see about that.

The other novel aspect of the implementation is that I went a bit bananas with the unit tests. Continuing with the Hadleyverse theme, the enabling factor here was Wickham’s testthat. It gets more and more satisfying / dangerously engrossing to “prove” to oneself that one’s code works. Much of the package is still untested, but I wrote a lot of tests, and as of today (September 17, 2015) they all pass here at home. For what that’s worth—unfortunately you cannot simply install the package from github and rerun the tests. Because I wanted to test on real data, I wrote the tests in terms of a sample data set from JSTOR DfR. I don’t feel comfortable redistributing this data, though I must remark that it consists entirely of files produced from texts in the public domain (1905–1915 issues of PMLA and Modern Philology). Anyway, I suppose that means I should welcome bug reports. File an issue on github if you like (or better yet, send me a pull request), though, of course, no academic should make any promises about maintaining their software produced in the course of research.

I’ve certainly reinvented many people’s wheels here. I’m obviously building on the work in MALLET and especially David Mimno’s work on R mallet, which this package depends on. I remain a machine-learning neophyte and so there is doubtless a whole universe of expert tools, including R software, that already does all this. But it seemed worthwhile to polish what I’d done on my own, since it bears the stamp of my desire to do a certain kind of work as a scholar of culture in a way that might be useful to others—if only as a monitory lesson.

In any case, if anyone uses this, I’d love to know.

  1. Yes, this is a passive-aggressive comment about the woeful tm package. When it comes to a “text-mining infrastructure for R”, I feel about it the way Gandhi is said to have felt about Western civilization. ↩︎