Fun with Dates

DH, kludgetastic

The website for Ted Underwood’s and my Quiet Transformations of Literary Studies has just been updated. Though this doesn’t add much in the way of new functionality (just a slightly modified bibliography display), every single chronological datum on the site has been changed to correct an error. Importantly, this means:

  1. Every time-series chart is slightly different (more explanation below).
  2. These charts now match the visualizations in our preprint article, which were not subject to this error.
  3. Article bibliographic information more closely matches JSTOR’s.

In not entirely unrelated news, I learned more about date encoding in Javascript and d3.js. After the jump, an explanation of the bug.

All the time series we examine are based on the item dates supplied in the JSTOR Data for Research metadata. A typical line of metadata looks like:

10.2307/456708,10.2307/456708	,Reiz Ist Schönheit in Bewegung	,William Guild Howard	,PMLA	,24	,2	,1909-01-01T00:00:00Z	,pp. 286-293	,Modern Language Association	,fla	,	,

Today we’re interested in the date, which JSTOR helpfully supplies in UTC (specified according to ISO 8601). PMLA 24, no. 2 is dated to January 1909. Actually, I don’t have to hand the actual month of publication of this issue (I can’t find anything but “1909” in the digitized images of the issue in JSTOR), and given that all numbers of volume 24 are dated to January 1909 in the JSTOR metadata, it looks like JSTOR doesn’t either. (The JSTOR scan looks to be from a bound volume of the whole year rather than from an original serial publication.) Fortunately, since in our analysis we only pay attention to years of publication, it would seem that the arbitrary assignment of the month doesn’t matter at all.

But let’s take a closer look at that date. 1909-01-01T00:00:00Z means “January 1, 1909, 12:00 a.m., UTC.” Since only the year is meaningful, the rest has been filled in with minimum values: January 1st, midnight. Why not? Well, my code in dfr-browser naively read in date metadata by taking that date field and constructing a Date object, more or less like this1:

meta = d3.csv.parseRows(metadata_as_string, function (d) {
    var date = new Date(d[7].trim());
    return {
        date: date,
        // other fields...

(You can see the actual current code on github; to contrast with the buggy code, look at this commit.)

The constructor for Javascript Date objects is happy to parse that ISO-formatted date. Easy-peasy! The rest of the browser uses Date methods for extracting month and year information, doing things like meta[i].date.getFullYear(). Look how object-oriented and encapsulated we’re being! Good for us!

But alas. The Date methods return components of dates in your local time zone. And when does a few hours here or there matter? Right on the boundaries. Midnight, January 1st, 1909 UTC is midnight 1/1/1909 Greenwich Mean Time. What about, say, Eastern Standard Time? That’s five hours earlier, or 7 p.m., December 31st, 1908. Suddenly, January becomes December and 1909 has become 1908. Every article dated to January of one year is mistakenly accounted in the previous year by the buggy browser code. Every bibliography entry is behind by a month. (Readers in the UK and points east to the international date line: not to worry! you’ve never seen this bug!)

The fix is relatively simple. Javascript also provides UTC-specific methods, and so does d3. Thus: getFullYear becomes getUTCFullYear; d3.time.format becomes d3.time.format.utc; and (importantly) d3.time.scale becomes d3.time.scale.utc. Now that all the calendrical functions respect the time zone of the original date, all the dates in the topic-model browser display match the dates specified in the JSTOR metadata.

Working in R, we were lazier about handling dates: instead of using all the date metadata supplied by JSTOR, we just stripped off the first four characters and saved it as the year.2 This turned out to be a more robust approach.

The major change the bug fix causes, apart from making the bibliographic information match JSTOR’s more closely, is to slightly shift all of the graphs of topics over time. None of the time trends are significantly affected, but some peaks move up and down as articles that used to be in December of one year move into January of the next. Of course this is a good occasion to remember that binning topic counts by year is to a certain extent arbitrary. Aggregating helps to see trends over time. If we just look at raw numbers of words assigned to a given topic in each article over time, appearances can be a little deceptive. Compare two time-series images for a topic from our model that we discuss extensively, numbered 80 and featuring as key words power, violence, and fear:

First, a plot in which each dot represents one article, with vertical position corresponding to number of words assigned to the topic:

Document weights of topic 80 power violence fear over time

Document weights of topic 80 power violence fear over time

Second, a plot in which each bar represents the proportion of the year’s words assigned to that topic:

Yearly proportions of topic 80 over time

Yearly proportions of topic 80 over time

The scatterplot shows all the data, sort of, but gives no idea of the relevant baselines. In this dataset this is a big problem, since there are many more articles included (and, in fact, many more articles published) in recent years than circa 1900. The bars instead give us a relative sense of the role of this topic in each successive year. But why should the timeframe of comparison be twelve months from January to December and not from May to April, or sixteen months, or eight? In fact, it would be an interesting project to use topic models to try to figure out whether the calendar year (or the journal volume) actually has a discernible character. But that’s a data-analysis project for another day.

Moral. Edge cases: they’re sharp enough to cut you if you don’t handle them with care.

  1. Just in case there are any readers who are trying to adapt dfr-browser to their own uses: I say “more or less” because the browser doesn’t actually expect metadata raw from JSTOR. To save a little space, it expects a CSV of metadata with some extraneous columns stripped out. That’s why if you look at the actual code you’ll see that d[6] is where it looks for the date. Note added 8/9/14.
  2. There are places in our R code where R Date objects are used, because ggplot does good axis formatting with them. But when we need them we just construct them from the year…setting January 1 as the date.