I have spent a lot of time experimenting with and exploring topic models of text. Aside from an article, some blog posts, and a bunch of strongly held opinions, that time also produced quite a few lines of computer code for handling topic models from MALLET. I started out with a big file of R functions, then escalated to a folder full of R functions. The organization got ever more byzantine, even more so when I collaborated with others. Finally I bit the bullet and, following the gospel of Wickham, converted my pile o’ scripts into an R package, called dfrtopics because I was making models of data from JSTOR’s Data for Research. There it has sat, on github, accumulating bits now and then, plus some function documentation written in a fit of compulsion, but really not in a form that anyone but I could use (and, as time went on, becoming hard for me to use too). My website has had a note promising a tutorial demonstration of how to use my package for nigh-on two years, but no demonstration demonstrated itself.
The package was hard to use and document because of the messy and ad hoc way I represented pieces of the topic model. A hierarchical model is not easy to wrap your mind around, and different questions require different slices of the model to answer. And all the mess of code passing around random collections of data frames, lists, and who knows what else seemed like fertile ground for errors and glitches, even when the whole thing seemed more or less to do what I wanted most of the time.
So: in a questionable expenditure of energy, I’ve spent a few days applying some polish to the package. Herewith dfrtopics version 0.2. Install it from github with
Three things are new from a potential user’s perspective—and my hope is that the idea of a potential user is slightly less far-fetched than before. First, there is now an introductory tutorial in the form of a package vignette. Second, the whole package has been rewritten around the idea that a topic model is an object, stored in a single R variable. Third, I have tried to make everything as modular as possible, so that the usefulness of the package is not restricted to MALLET models of DfR data. If you have other textual data in wordcount-form—oh, let’s say, 180,000 18th-through-early-20th-century volumes—you might write up some variant file-loading code if necessary, then use the package functions to model those texts with MALLET. And if you have wordcount data that you want to analyze in R in some other way than with MALLET, there are still, I hope, some useful things here for converting those wordcounts into data frames or term-document matrices.