Some cultural datasets for teaching use

teaching, dh, kludgetastic

Manual pin: My article, “Origins of the US Genre-Fiction System, 1890–1856,” is just out in Book History. Read all about it.

I made an R package with some “cultural” datasets of various kinds that might be of pedagogical use. It is available on github as agoldst/dataculture. See the repository page for a summary of the datasets, which I used to teach introductory analyses of:

  1. cultural tastes over time and social space (names, music genres, recipes)
  2. textual/paratextual signs of fictional genre (text-mining science fiction and crime)
  3. historical and fictional social networks (Hamlet characters and eighteenth-century Bostonian troublemakers)

If that sounds interesting, take a look at the package and the lecture slides and lab exercises from the course I created it for, “Data and Culture” (Fall 2022). I’m hoping to save someone somewhere a few steps of wheel-reinventing.

Eight years ago, I taught a Literary Data graduate seminar and struggled mightily. Of many challenges, finding literary datasets that yielded actually interesting results to elementary analysis was one of the hardest. Last fall I tried again, this time at the undergraduate level, offering Data and Culture with my colleague Meredith McGill.

Now it’s 2023, several degrees of global warming are locked in, and I shudder to think what 2024 will bring…and the problem of humanities datasets is still nearly as challenging. Humanities data appears abundant because of mass digitization, but pedagogically usable data for intro-level students remains very scarce. Beginning students need pre-cooked data that can be easily manipulated with a starter set of data-analysis techniques; and they ought to get data that bears some relation to intellectually interesting questions, not “toy” data which might help them learn “data wrangling” but provides no rationale for scholarship.1 Statistical education in the social sciences copes with these constraints one way and another, but digital humanities has tended instead to be seduced by the sheer scale of mass-digitized datasets—or the bespoke appeal of TEI-XML editions, which can be great as works of editing but whose potential as data remains largely hypothetical after more than thirty years of excited promises.

But the past years have not been entirely in vain. A number of open socialists won elected office, and two things happened which helped me compile some teachable datasets for humanities students. First, some people with more self-discipline than I have produced textbooks in data analysis for humanities students, and with these texts comes the real prize, usable data. Folgert Karsdorp, Mike Kestemont, and Allen Riddell made available all the data to go with their Humanities Data Analysis textbook (the book is open access too). Since their goal was to produce actually interesting case studies, their data is actually interesting! And the use of their data need be in no way constrained by their choice of Python as the language of exposition. I drew on two of their examples, a riff on Franco Moretti’s exploration of Hamlet’s character network and a study of historical American cookbooks.

Meanwhile, many people working in computational humanities and social sciences have embraced the “open science” idea and started sharing code and data systematically. So I hoped that it would be easier to find an article or book chapter I could assign as a reading and guide students through reproducing some part of its analysis. On some of the more sociological topics of the course, this was not too hard: though Stanley Lieberson’s analysis of baby naming patterns dates from 2000 and relied on what were then fairly hard-to-access records, the Social Security Administration’s baby names file is available these days as an easy-to-use R package. And the much-studied 1993 General Social Survey Culture Module on musical taste comes handily predigested in Kieran Healy’s marvelous gssr package. I hope Healy thinks imitation is the sincerest form of flattery, since I also modeled one of my network-analysis lessons on a blog post he wrote about Paul Revere’s ride and used the dataset he transcribed for it.

But when it comes to literary datasets, the situation remains much more challenging. There are many reasons for this, notably the degree to which primary sources are encumbered by copyright or locked up in proprietary databases. Then there is the intrinsic difficulty of textual data, which is messy, sparse, high-dimensional, etc.: this has led most recent computational humanists to experiment with pretty elaborate statistical techniques (mea culpa) whose workings and limitations I didn’t want to hand-wave about. We’re already drowning under endless bullshit about fancy statistical techniques applied to large text corpora, even though those techniques can’t even win at tic-tac-toe.

Anyhoo, that’s why I really admire the unique qualities of Ted Underwood’s Distant Horizons, which makes provocative, accessible claims, while keeping the analytical machinery (relatively) straightforward. I turned to his imposing replication repository as the source for my most elaborate series of labs and activities, focused on his argument about the “life cycles” of subgenres like science fiction and crime fiction. What distinguishes Underwood’s work in that book is the care lavished on the metadata about the texts and data derived from them, as well as the deliberateness with which his datasets are constructed to answer actual questions of interest (rather than the more typical research question, “I wonder what’s in this bunch of files I can download?”). Even the most elementary tabulations and word-counting procedures lead in interesting directions with his dataset, so I found it very worthwhile to repackage it for elementary exploration in R.2

As for the course itself…well, it was hard going for me. Having already authored one self-critical DH course post-mortem I don’t think I have the right to do another. Suffice to say I am no more optimistic about the digital humanities than before, and I think the challenges the post-pandemic cohorts of students—and their teachers—face are really gigantic. But it’s still amazing that any undergrad with an internet connection and a laptop can get their hands on data about culture which took many years to accumulate—and that what was once the province of specialized high-end computing now takes a few lines of R.

  1. Though it’s conventional to remark that data wrangling is a huge pain in the neck that takes 90% of the time for any analysis, it’s also, let’s face it, just not that hard: an entire fake profession, “data scientist,” has been invented in less than two decades for people who learn to data-wrangle from free online tutorials and then make money by creating dashboards for managers or something. Teaching scholarship and understanding its methods and goals, by contrast, still requires the actual institution of higher education. ↩︎

  2. In my repository you can find code documenting just how I repackaged Underwood’s HathiTrust-derived data files as a single sparse Matrix object small enough to just stuff in an R package. Answer: tidytext::cast_sparse()↩︎