The other day I was talking with an innocent bystander about some of my past work in the digital humanities. It occurred to me to wonder what a person who went looking for that work would find. The abyss also looks into you. Anyhoo, once upon a time I spent a lot of time working with data from JSTOR’s Data for Research service, a thing that no longer exists, and I produced two fairly elaborate programming projects related to topic models of text: my dfrtopics R package and my dfr-browser topic-model visualization. I am writing this post to announce that those things are still available and continue to shamble on, zombie-like, into the coming apocalypse. But I don’t plan to develop them further.
The “dfr” in those two names referred to JSTOR’s web service offering downloads of wordcount data from its digitized database of journals. That service has been replaced by a more elaborate “platform,” Constellate, which offers comparable outputs as well as some built-in exploration and modeling capacity. I am not going to update my software to work with Constellate outputs directly. Past experience suggests that any data from the service will have its own idiosyncracies and that, as always, much of the work of analysis will actually consist in data-cleaning, finding and dealing with errors, discovering how to get large data files into and out of R, and so on. Puttering around on Constellate, it looks shiny and responsive, and also full of extra features I don’t want. It is disappointing—unsurprising, but disappointing—to see the platformization of JSTOR’s digitized archive.1 Humanists who want to study this data don’t need yet another limited, encapsulated, moderately opaque portal. They need full access together with clear statements about the provenance and characteristics of the data.
But I digress. Since at some point in the past I had the bright idea of writing a series of tests for my dfrtopics package, I had the pleasure of running the tests to find out if something I last touched in 2019—and which depended on, among other things, Java code dating back two decades—still runs in these bright new days. Answer: it needed a little fiddling, but yes, it still does.2 The other update I have made is to the project github homepage, where I now recommend users take a look at more recently developed general-purpose tools. stm offers some facilities for modeling, inference, and interpretation that aren’t in mallet and would have been—and still are—much to the purpose. I also point to the tidytext package which might help on the preprocessing end. I guess those are probably the right starting points these days for anyone who isn’t too distracted by shiny “AI” news to want to try modeling textual data with actually sort-of interpretable models. Then again, it’s possible some of the things I stuck into dfrtopics because I couldn’t find other implementations might still have applications. My approach to aligning multiple topic models for comparison could be handy, simplistic as it is. I also implemented a couple of “posterior predictive checks” for topic models which could be used for identifying violations of modeling assumptions, and I’m not sure that is easily available in other packages.3
This seems as good as place as any to thank the people who’ve gotten in touch with me about dfr-browser and dfrtopics over the years. A number of people wrote me to tell me they were experimenting with dfr-browser or even using it productively for their work. Some kind souls even sent me pull requests on github with fixes to the code while it was accumulating cobwebs. I’m sorry I never merged those in. Turns out maintaining open-source software takes, you know, work.
Not, however, without an avalanche of annoying warnings from the “tidyverse” packages about using deprecated methods. That’s because dplyr’s interface underwent a big overhaul in the intervening years, and the idioms I painfully mastered in 2015 or whatever have all been replaced with new, shiny idioms. By comparison, all that Java code in mallet works just fine. ↩︎
I would like to give myself half a pat on the back for having put some time into documentation, though the documentation I wrote is not quite so illuminating to me now as it was a few years ago. For what it’s worth, you can read about the model alignment functions with
help("align_topics", "dfrtopics")and the model checks with
help("imi_check", "dfrtopics"). ↩︎
Also web standards and browsers that follow them. ↩︎