Teaching Literary Data: What Makes It Hard

dh, teaching, research

Two years ago, I designed and taught a graduate seminar on approaches to Literary Data. I was invited to contribute an essay on the course to Debates in the Digital Humanities. The essay was some time in the writing, and it will be some time longer in the publishing: the next DDH volume is now due to appear at the start of 2018. But by kind permission of the editors, I am able to share a preprint of my essay: Teaching Quantitative Methods: What Makes It Hard (in Literary Studies). If you quote it, please cite its forthcoming publication.

The essay explains the rationale of the course, which combined a practicum in computing with literary data using the R language with theories of literary data from structuralism to the present. My own evaluation of the course is quite mixed, and I offer my materials and my experience not as a model but as evidence for an argument about the conditions of possibility for a successful quantitative methods pedagogy in literary studies. Pedagogy, in this case, also raises serious questions for research; and I also hint at what I take to be the conditions for fruitful quantitative methodology tout court.

I couldn’t have wished for better students—that condition of possibility is indeed already realized. The major lessons I draw are (this is from the essay):

  1. Cultivating technical facility with computer tools—including programming languages—should receive less attention than methodologies for analyzing quantitative or aggregative evidence. Despite the widespread DH interest in the former, it has little scholarly use without the latter.

  2. Studying method requires pedagogically suitable material for study, but good teaching datasets do not exist. It will require communal effort to create them on the basis of existing research.

  3. Following the “theory” model, DH has typically been inserted into curricula as a single-semester course. Yet as a training in method, the analysis of aggregate data will undoubtedly require more time, and a different rationale, than that offered by what Gerald Graff calls “the field-coverage principle” in the curriculum.

Some more remarks on the essay and the course follow after the jump.

I wanted to include my course materials as an appendix to the essay. I deposited them in the MLA’s CORE repository as a safeguard against the disappearance of my own course site.1 No sooner had I done this than the MLACommons twitter account tweeted, “Teaching literary data analysis? @goldstoneandrew has put syllabus (http://dx.doi.org/10.17613/M69S30) & assignments (http://dx.doi.org/10.17613/M6602R) in CORE.” As though this material self-evidently served as an exemplar! It is at least as much a cautionary lesson, though I hope there is something useful in there somewhere.

I don’t have any taste for the self-praise which dominates so many academics’ writing about their own teaching. My teaching is a work in progress, and the only praise I give myself is that I am sometimes lucid about my mistakes. I feel the progress of scholarship under the “DH” label has been hindered by an occasionally desperate optimism about what can be accomplished in a short time by students or by researchers. This optimism, where it is not simply a necessity of short-term institutional survival, has too much in common with the culture of coding autodidacticism, with its endless free tutorials, getting-started guides, walkthroughs, and cool demos. That culture feeds the dreams of the high-tech precariat, but it bears little relation to a training in research methods, which should promise just the opposite of instant gratification. Anyway, the essay tries to sketch out some ideas of what a better kind of training might look like.

B-side: some practicalities

In my effort to make a polemical essay, some of the more practical advice I derived from teaching went into my ever-growing scraps file. Well: what are blogs for? Here are some bits and pieces I picked up along the way about teaching a subject like literary data analysis. But you have to promise to read my essay before attempting to take these into account, or you’ll get it all wrong.

Problem sets are also meta-problem sets. I assigned regular homeworks, so that students could practice what they were learning and demonstrate their own growing skills to themselves. But giving assignments like this to literature grad students imposed an additional requirement on them that I didn’t fully foresee: not just problems to solve, but the problem of how to solve problems. The only solution to the meta-problem that everyone starts with is trial and error. Someone with a BA in English may be many years away from their last practice with other problem-solving strategies that can be applied in this domain, like divide-and-conquer, or formulating and solving a simpler related problem. Or restricting the domain of possible answers by assuming that the materials for a solution were indeed given in class. (Graduate students will assume they might have to do research.)

If you supply example code, you will see it again. Students build on the models they have. But this means you have to attend very carefully even to casual snippets of code, and explain (repeatedly) what details are important and what aren’t. Most of my students spent weeks believing that the iterator variable in any for loop always had to be called j, even when it represented a filename or a word in an array of words—all because the very first for loops I introduced were loops over integers, and out of old habit I used j as the iterator.

No Windows. Tech support is an inevitable part of a course of this kind. I spent a lot of time getting students’ setups working and fixing things when they inexplicably broke.2 Doing this for my students with Macs was challenging but feasible. Keeping up with the endless problems of Windows, by contrast, was a nightmare. Each Windows system had its own idiosyncratic major failures. Windows is fine for many things, I’m sure, but it is terrible for platform-independent scientific computing. Graduate methodology courses need robust technical support, including some means to offer a fully standardized system to every student. Three months into the course, in frustration, I developed a virtual machine image for some of my more ambitious Windows-using students to try (note to self: that repository needs updating). That worked, but wow: VMs are resource-intensive. The truth is that there is no substitute for an institutional commitment to support the course, whether that means an IT person who has time for students from the course (ha) or making well-configured platforms available as loaner laptops, good lab machines that aren’t wiped daily, or usable remote logins on a server.

Literate programming is a pandora’s box. I had the dream that, given enough ingenious setup on my part beforehand, my students could write their homeworks in R markdown and generate a PDF to turn in with a single button-click. I even believed that being “literate” in the Knuth way would be more congenial for highly-trained writers than conventional introductory programming. I put a lot of work into the setup and into teaching my students how to use R markdown and the templates I created. But it turns out to be all too easy to write innocent R markdown that generates LaTeX errors, usually because of some unexpected R output that TeX cannot parse. Asking students to convert their R markdown to HTML would be a less fraught path, because browsers are more liberal than TeX is. But reading twenty-page final papers as webpages is pretty unappealing. There are no good answers here: any teacher who wants the useful discipline that R markdown offers had better be prepared to investigate a lot of “Pandoc Error 43” messages for students.

Almost no one understands the file system. I spent the whole semester helping students with file-not-found errors. Highly sophisticated writers and researchers regularly neither understand nor take advantage of the hierarchical structure of the file system. The difference between system files, program files, and user files is not well-known. The distinction between the file system and the Finder or Windows Explorer is not intuitive; nor is that between opening files from within a program as opposed to from the Finder or Explorer. Shortcuts, aliases, and symlinks are “expert” features,” not universally used. Filename “hygiene” (no spaces or punctuation other than hyphens, underscores, and periods; extensions for file types) is not at all obvious and needs to be taught multiple times before it sinks in.

The for loop is a doozy. This construct is much harder for beginners to understand than I expected. Reasoning about the flow of execution is the challenge, and it took many sessions for me to teach students how to do this (and they still had some trouble knowing what the iterator variable iterated over). In a way R made this harder, since I had to teach vectorized operations before explicit for loops. Then it’s hard to motivate the latter when R makes many loops implicit that would be explicit in other languages. The whole class was thrilled to learn the dplyr idioms and indignant that I’d made them work with explicit loops first. But the meaning of those idioms is nothing other than for loops.

ggplot, though. The grammar of graphics, and Wickham’s glorious software package, provided some of the high points of the semester, bringing theory and practice satisfyingly. Making visualizations went surprisingly smoothly and led to serious questions about their significance—as well as the details of visualizations found elsewhere. ggplot was one of the main reasons for my own choice of R for the course. But I still have a cautionary lesson here: I learned that it is indeed possible for students to make very nice visualizations of data without having much statistical literacy about them. That is: without being able to ask, what quantitative features of the data produce the noticeable parts of a graph? are those features to be taken seriously, or are they likely to be artifacts of chance, the plotting method itself, or something else? DH is enchanted with visualization, to such an extent that many people in the humanities identify the whole of data analysis with the production of data visualizations. A good visualization can indeed summarize data or reveal its structure in powerful and appealing ways. But the apparatus for determining the goodness of a visualization—in the context of a researched argument from evidence, or what in statistics seems usually to be called “a scientific question”—is not visual alone.

…But, let me emphasize this once more, even a perfectly smooth programming practicum would not have come close to achieving what I wanted. The emphasis has to lie elsewhere, on understanding how to pose good quantitative questions, and how to tell good quantitative answers from bad.

[Edited, 1/7/2017: The preprint is now hosted on Rutgers’s institutional repository and can be permanently referred to by DOI: 10.7282/T3G44SKG. The file differs from the file originally posted on this site only by the addition of the repository’s cover page.]


  1. I hope my treasured hand-coded tilde-username website will be around until the end of time, but every new cloud-services IT deal at my university makes me worry a little bit more. At least the Internet Archive will keep most of the material safely preserved in Canada.
  2. An important lesson: you will encounter character-encoding problems. Here is a blog post where I tried to guide my students through the mess.