Of Literary Standards and Logistic Regression: A Reproduction

DH, theory, sociological bandwagon, kludgetastic

This post is a discussion and partial replication of Ted Underwood and Jordan Sellers’s fascinating essay “How Quickly Do Literary Standards Change?” (available in preprint and discussed on Ted’s blog). Hats off to Ted and Jordan, who have contributed something really remarkable here, not just in their provocative arguments but in the data they have made usable by others. It’s one thing to circulate data and code so that someone can in principle re-run your scripts—though that is already a lot—and quite another to make the data accessible enough for fresh analyses. The latter is a very demanding standard, too demanding for everyone to meet, I think. But it is what is required to let others build directly on your results. Even more importantly, it’s what’s needed to make research results pedagogically available. As I argue in an essay I’m working on now, any quantitative methods pedagogy should—must—lean heavily on the results of research. In the immortal words of Miriam Posner, “It’s just awful trying to find a humanities dataset”: one of the best ways to address this challenge would be to make good research data available in recirculable, easily accessible form.

So consider this post partly pedagogical in intent: I want to show that Ted and Jordan’s replication repository is already an excellent dataset and could be the basis for a lesson in automatic classification. What I want to emphasize here is that their work allows us to breeze right past the data-wrangling and straight into the analytical substance. This may not be entirely obvious from their python code, so I’m going to try to make it clearer by doing the whole thing in R instead.

Rather than give all the technical detail, I’ll only show the R code where it makes a point about the analysis. The full R markdown source for this post is in this repository. Feel free to skim right past the R listings anyway. This is a literary theory and data analysis post that happens to have some software details in it, not a programming post. Here we go…

I’m going to call Ted and Jordan U&S (Underwood and Sellers). First I downloaded U&S’s replication repository. Loading the metadata as provided by U&S is straightforward. Reading in their files with feature counts for their chosen volumes is also not too hard to do from scratch, but I already did some of that work in my dfrtopics package and applied it to files like these in an earlier blogpost, so I can do that task in a few lines:

read_hathi <- function (f) read_delim(f,
    delim="\t", quote="", escape_backslash=F, na="",
    col_names=F, col_types="ci")
fs <- meta %>%
    transmute(id, filename=str_c(id, ".poe.tsv")) %>% 
    mutate(filename=file.path("paceofchange", "poems", filename))
# Check ID's match filenames
stopifnot(all(file.exists(fs$filename)))
# load 'em up
counts <- read_wordcounts(fs$filename, fs$id, read_hathi)

Now let’s start by seeing whether we can reproduce U&S’s principal model (and the principal figure that goes with it). They choose only the 3200 word features with the highest document frequencies, skipping punctuation features.1 Here too dfrtopics has convenience functions that get us more concisely from the data frame of feature counts to the sparse term-document matrix:

rank_cutoff <- 3200
keep_features <- counts %>%
    group_by(word) %>%
    summarize(doc_freq=n()) %>%
    top_n(2 * rank_cutoff, doc_freq) %>%
    # apostrophes
    filter(str_detect(word, "^\\w(\\w|')*$")) %>%
    top_n(rank_cutoff, doc_freq)
pruned_counts <- counts %>%
    filter(word %in% keep_features$word) %>%
    wordcounts_Matrix()

We actually have 3210 features, a superset of U&S’s, because we have allowed for ties at the bottom of the frequency list. If the results are sensitive to this, that’s something we’d like to know, so let’s continue without discarding those extra features.2

Now we have to to do the logistic regression modeling, trying to match the choices of U&S. We’ll use L2 regularization (“ridge regression”), as they do when they use the scikit-learn default.

Then we will obtain a predicted probability of being reviewed for each document as follows:

  1. Discard all documents by the same author.
  2. Construct a model of the remaining documents.
  3. Use this model to predict the probability of the target document being reviewed.

This is a kind of cross-validation, but the goal is not to assess the model’s overall performance but, as we will see, to search for unexplained variation. The holding-out rule is also interestingly asymmetric: we generate a prediction for the held-out volume that doesn’t incorporate information about the volume itself, including information about its author’s style. U&S do this so they can make non-circular predictions for every volume in their set. But it’s important to see that the resulting “predicted probabilities” of being reviewed are not all predictions from the same logistic regression model.3

To do regularized regression in R, we use the glmnet package. I found the package vignette very helpful as a guide through to the basic nuts and bolts. The package workhorse function is glmnet, which takes a matrix of predictors and a vector of response values. Logistic regression is specified by the parameter family="binomial" (which by default uses the logit link function we want).4 We choose the L2 penalty by setting alpha=0. There are 633 authors in our set, so that is the number of models we have to construct. Finally, U&S normalize feature counts as (slightly less than ?) proportions of the total number of features in each document, so we do the same, and then standardize them. glmnet will do the latter step for us. Finally, for predictions, we fix the regularization strength at a constant. U&S give their choice as C = 0.00007, which they report as maximizing overall predictive accuracy, but scikit-learn’s definition of C is different from glmnet’s definition of lambda. If I am reading the glmnet and scikit-learn documentation right, we translate with lambda = 1 / (C * N), where N is the number of observations.

# not sure why the extra 0.001
feats <- rescale_rows(pruned_counts, 1 / (rowSums(pruned_counts) + 0.001))

Here is the modeling step.5

# U&S's choice of regularization constant (?)
lambda_us <- 1 / (0.00007 * nrow(feats))
n_auth <- nlevels(meta$author)
ms <- vector("list", n_auth)
names(ms) <- levels(meta$author)
predicted_probs <- numeric(nrow(feats))
prog <- progress_estimated(nrow(feats))
for (i in 1:nrow(feats)) {
    au <- as.character(meta$author[i])
    if (is.null(ms[[au]])) {
        mask <- meta$author != au
        ms[[au]] <- glmnet(
            x=feats[mask, ],
            y=meta$recept[mask],
            family="binomial",
            alpha=0,
            standardize=T,
            lambda=lambda_us # docs say don't do this,
        )                    # but...
    }
    predicted_probs[i] <- predict(ms[[au]],
        feats[i, , drop=F], s=lambda_us,
        type="response")
    prog$tick()
    prog$print()
}

We now have what we need to try to reproduce U&S’s key figure (fig. 1, p. 8). The date information is not included in the model, but it is used in the plot.

The visual impression of this plot is very similar to U&S’s figure. The key features of the plot that U&S emphasize are here: mostly correct classifications ( 556/720 = 77.2%), plus an upward-sloping time-trend line. In fact I think that most of the predictions are a close match to U&S’s. Voilà: a reproduction.

Now that we know we’ve successfully found our way to U&S’s ballpark, we can examine the effect of the modeling choices we can make. Let’s start with the question of the elusive lambda, by fitting a model to all the data at varying values of lambda. We’ll let glmnet choose the sequence and compute a classification error rate at each lambda value using 20-fold cross-validation:

nfolds <- 20
m_cv <- cv.glmnet(feats, meta$recept, family="binomial",
                  nfolds=nfolds,
                  type.measure="class", alpha=0)
# choice of lambda
m_cv$lambda.min
## [1] 3.751426
# which achieves the following CV classification error rate
min(m_cv$cvm)
## [1] 0.2125

The error-rate number is not the proportion of mispredictions made by the chosen model but (concentrate, Goldstone) the 20-fold cross-validation estimate of that rate. Notice the difference here in the use of the CV procedure: whereas U&S calculate a different statistic (namely, a prediction at a different held-out data-point) for every one of the models fitted in the leave-one-out CV, here glmnet is calculating the same statistic each time, the classification error rate, to give us a reasonable estimate of the performance of the model. We can see how U&S’s lambda choice compares using a diagnostic plot:

As this plot suggests, the best lambda is rather smaller than U&S’s choice; then again, the difference between the corresponding error rates is just about one standard deviation, so only moderately significant. But describing the predictive power of the model is not the only thing we want to do.

U&S seek to interpret the extreme-valued coefficients of the model—the features having the largest influence, positive or negative, on the odds of being reviewed.6 Taking the model selected by cross-validation, we obtain this list of the most positively-influential coefficients (standardized):

feature coefficient
sake 0.0129978
whereon 0.0118395
hurrying 0.0117622
what 0.0113139
hers 0.0112498
sign 0.0111858
smooth 0.0109115
seasons 0.0108949

and negatively:

feature coefficient
i'll -0.0139030
mission -0.0133464
tis -0.0126298
greet -0.0125402
wondrous -0.0120224
i've -0.0108331
throng -0.0106055
anxious -0.0104825

These lists of words are interesting, but they don’t seem to me to be as crisply interpretable as the lists reported by U&S. Now compare the list of positively-charged coefficients generated by fixing lambda at the value selected by U&S:

feature coefficient
eyes 0.0045595
hollow 0.0040137
black 0.0040054
sake 0.0039791
slow 0.0039136
whereon 0.0038528
wind 0.0037451
brows 0.0036608

This closely reproduces U&S’s list, where the tendency to monosyllabic concretion really jumps out. But it’s no longer so clear that these are the words we should use to interpret the model. As U&S rightly note, we shouldn’t lean too much on any particular set of words picked out this way. The modeling process has limited power. In cases where we have many more predictors than data points, regularization can help improve out-of-sample predictive accuracy by trading increased bias for reduced variance. But it can’t make multicollinearity in the data go away: in this setting, according to Gareth James et al.’s Introduction to Statistical Learning, “at most, we can hope to assign large regression coefficients to variables that are correlated with the variables that truly are predictive of the outcome” (243).

Researchers should be very cautious about moving from good classification performance to interpreting lists of highly-weighted words. I’ve seen quite a bit of this going around, but it seems to me that it’s very easy to lose sight of how many sources of variability there are in those lists. Literary scholars love getting a lot from details, but statistical models are designed to get the overall picture right, usually by averaging away the variability in the detail.

Finally, we can follow U&S in looking for the worst misfits of the model. First the “random” volumes with highest odds of being reviewed in the model:

author title year response
Brooke, Rupert, The collected poems of Rupert Brooke 1915 0.80
Symonds, John Addington, New and old 1880 0.77
Dobson, Austin, Vignettes in rhyme and vers de société (now first collected) 1873 0.75
Thomas, Edward, Last poems 1918 0.75
<blank> Georgian poetry, 1911-1912 1912 0.75

Changing lambda does not change this list, and it confirms two observations made by U&S: first, except for the Dobson, all of these “random” volumes are immediately recognizable as high-prestige titles. Second, they belong to the late end of the period. Now let’s consider reviewed volumes with lowest modeled odds:

author title year response
Earle, Pliny Marathon 1841 0.2403877
Overton, Charles. Ecclesia anglicana; 1833 0.2870532
Jones, John, Attempts in verse 1831 0.3138384
Bangs, John Kendrick, The foothills of Parnassus 1914 0.3191844
Conder, Josiah, Star in the East; 1824 0.3233777

I am only prepared to say that these titles are not yet familiar to me.

Inconclusive thoughts on the modeling choices

The beautifully available data make it possible to investigate what would be lost and gained by making different choices in the analysis. They also make it possible to build on the work that U&S have done. U&S make a remarkable observation about their not-quite-so-simple model: it makes errors that systematically depend on the year of publication. The later the publication date, the more likely it is to misclassify a “random” poetry volume as “reviewed.”

Predictions about literary prestige are skewed across time, presumably, because the difference between unreviewed and reviewed volumes is always analogous to the difference between works at the beginning and end of each period: so the best solution the model can find always has an upward slope (18–19).

They show this by fitting a regression line of the predicted odds against year of publication. I find it hard to interpret that trend line, because almost every prediction comes from a (slightly) different model. Let’s instead consider the model of all the data. Even though it performs well, we can still plot its misclassification rate against the publication date of volumes.

These test error rates make the model look better than it is, but even so we can see the way that misclassifications of the “random” set are worst at the end of the century. That is to say that we should be able to improve the model predictions just by throwing the publication year in as a feature:

m_time <- cv.glmnet(cbind(feats, meta$firstpub),
                    meta$recept, family="binomial",
                    nfolds=nfolds,
                    type.measure="class", alpha=0)
# choice of lambda
m_time$lambda.min
## [1] 2.049133
# which achieves the following CV classification error rate
min(m_time$cvm)
## [1] 0.2055556
# with s.d.
m_time$cvsd[which.min(m_time$cvm)]
## [1] 0.01840364

Possibly (not strongly) better. As U&S point out, a linear time trend isn’t really interpretable, since the corpus is balanced over time between random and reviewed volumes. This tells us there is unexplained variation, which would need, perhaps, to be modeled by including some kind of feedback effect. That sounds hard but worth doing.

From here we could go on to try the other demographic variables included by U&S, but I’ll leave that aside. The last thing I want to glance at is the size of the feature set. We have a lot of features. How many could we drop without degrading the performance of the classifier? One way to do this would be to use the lasso instead of the ridge penalty, which by design will drive many coefficients to zero. Alternatively, we could just throw away infrequent features by handfuls and see how the model does. The lasso is easy, so I’ll go with that.

m_cvl <- cv.glmnet(feats, meta$recept, family="binomial",
                   nfolds=nfolds,
                   type.measure="class", alpha=1)
# non-zero coefficients
m_cvl$nzero[which.min(m_cvl$cvm)]
## s50 
## 219

At the cost of a moderately increased minimum CV error rate—0.23, s.d. 0.017—this proposes a model with many fewer predictors. The only moral here is, again, that there is a lot of correlation among the word frequencies, and hence a lot of redundant information.

And here we come up against the limits of the word lists themselves for understanding what U&S rightly emphasize is a boundary drawn by social processes. Assuming that we have managed to dodge all the dangers of overfitting, all the reasonable (though not awe-inspiring) performance of the word-based classifier tells us is that these combinations of word frequencies correlate with whatever factors might explain the difference between the reviewed volumes and the randomly-sampled ones. Everything we know about literary reception tells us that some of those factors are not textual (relating instead to the author’s and the publisher’s characteristics, the overall configuration of the book trade, etc.), even though some of those non-textual factors may themselves have verbal correlates. Then, though lists of words lend themselves most easily to interpretation in terms of diction, lexical choice is presumably conditioned by many other phenomena that are mediated in the text of poems, like genre, theme, and meter. In other words: are the “literary standards” examined here the linguistic criteria which earn legitimacy, or are they the verbal habits of those who are legitimate? I’m sure U&S would answer “both.” Disentangling the two will require modeling in which the non-linguistic stands a chance of emerging as an explanatory factor.

Appendix: a non-reproduction

I wanted to check that my feature counts matched U&S’s by using their list of model coefficients to reproduce their list of estimated probabilities. But I did not quite succeed:

# load U&S's model coefficients
us_coef <- read_csv("paceofchange/results/mainmodelcoefficients.csv", 
                    col_names=c("word", "coef", "rescaled_coef"))
# load U&S's model predictions
us_predict <- read_csv("paceofchange/results/mainmodelpredictions.csv")
# inverse link function (inverse log-odds, here)
# function to calculate probability from coefficients and predictors
inverse_link <- function (x, b) 1 / (1 + exp(-sum(x * b)))
computed_probs <- apply(
      # standardize feats to mean 0, sd 1
      # reorder rows; reorders columns and drop extra features
      scale(feats)[us_predict$volid, us_coef$word],
      1,
      inverse_link,
      b=us_coef$coef / 100
)
all.equal(unname(computed_probs), us_predict$logistic, tol=1e-4)
## [1] "Mean relative difference: 0.03731868"

The match is not exact, and a few numbers are quite different. I am not sure what’s going on here. There’s no intercept term in these calculations, though there is probably one in the model U&S used. But because of the balanced design of the corpus, that intercept should make almost no difference. I am docking U&S’s Reproducibility Grade down from 100 to 99.95 for the undocumented rescaling of coefficients by 100 in their outputs.

[Minor edits 1/4/16–: revision history here.]


  1. Despite the use of isalpha() there, U&S’s feature list includes words with the apostrophe in them. To get their feature list I had to take account of this possibility explicitly. PSA to pythonians: don’t use isalpha in your reproducible code. Explicitly match on a regular expression, and have fun importing regex↩︎

  2. I did try to check that I was working with the same counts of the 3200 features in common. See the end of the post. ↩︎

  3. I suppose it would be possible to come up with a more elaborate multilevel model in which one first draws an author, and then draws word probabilities, but in the present case, where most authors only appear once in the set, it might make more sense just to throw out a few more volumes. ↩︎

  4. A note because for the longest time I have found this confusing. It’s called “binomial” because we model the response variable (that is, reviewed or random) as a draw from a binomial distribution—a (biased) coin flip, where the chance of being reviewed, that is, the expectation of the variable’s distribution, depends on the counts of the features for each case. ↩︎

  5. We specify the regularization constant lambda in the call to glmnet, even though the documentation for glmnet warns us that specifying a single lambda is much slower than computing a “regularization path” (that is, trying a sequence of lambda values) and then doing the prediction for fixed lambda afterwards. In this case doing the regularization paths takes about 50 times as long. I don’t know why. ↩︎

  6. Here the question of scaling returns. U&S examine the coefficients on the standardized scale. This means that a bigger coefficient on a word corresponds to a bigger increase in predicted log-odds per unit standard deviation of that word’s frequencies. This is what has been optimized by the modeling process (scaling matters in regularized regression). If instead we are interested in comparing contributions to log-odds per unit frequency, we want the unstandardized coefficients. This doesn’t seem to make a great deal of difference, at least when it comes to looking at the largest-magnitude coefficients. I imagine this is down to regularization, which shouldn’t let really rare and overdispersed words run away with the prize. An alternative strategy, which I won’t pursue here, would be to switch to the lasso to pick out features. ↩︎