dfrtopics, hold the dfr

DH, kludgetastic

It’s gratifying, and a little frightening, when someone else uses your own code. Jonathan Goodwin has built on my dfrtopics and dfr-browser code-blobs to produce a fascinating visualization of topics in fiction 1920–1922, derived by modeling the genre-specific word frequencies data set from HathiTrust. He’s given a nice description of his process as well. In the process, Jonathan revealed some unnecessarily restrictive assumptions built into my code. He solved the problem by modifying my code: all praise to him! But then I felt bad and wanted to make it possible for others to go further without having to dig into the Area X that is my code. So I made a few adjustments to my new version of dfrtopics. Here are some notes on using the updated version to cope with the issues Jonathan found in processing and modeling the Hathi data.

The data is a series of moderately-sized files, the biggest of which are tarballs of word counts, one for each volume. In what follows I have saved the downloaded files into a folder called htrc-genre.

Let’s now prepare a slice of these files for modeling. First we load some handy packages from the “Hadleyverse”:


Now the metadata supplied with the dataset is not, of course, formatted in the same (eccentric) way as what we get from JSTOR. So we will fall back on ordinary file-reading functions. Jonathan figured out how to read this in by just treating everything as a string, so that’s what we’ll do too. read_csv is from the readr package and gives us a little speed boost, but for this to work we have to tell it we want 19 string columns with a col_types argument:

md <- read_csv("htrc-genre/fiction_metadata.csv",
               col_names=T, col_types=str_c(rep("c", 19), collapse=""))

It makes correct guesses about the uses of quotation marks.

The only expectation dfrtopics really has about the metadata is that it has a unique identifier in a column called id. We actually have several unique identifiers here, but Jonathan showed that we can work with the htid, so let’s make that our id:

md <- md %>%
    select(-recordid, -oclc) %>%

This ID corresponds to the file names of individual word-count files, so we can now match filenames to metadata easily by just adding ".tsv" to the end of an ID.

dfrtopics’s time-series functions also expect to find document dates, as Date objects, in a pubdate column. This transformation is optional if we’re not interested in time series, but just to show how it’s done, here’s how we would take the year given in the supplied date field and make Dates using a lubridate function.

md <- md %>%
    mutate(date=parse_date_time(date, "%Y")) %>%

Let’s just work with texts from the year 1895.

md <- md %>%
    filter(year(pubdate) == 1895)

The next data-processing step is to consider the status of reprints. That cannot be done automatically in any easy way, though we might try to guess reprints by looking at near-duplicates in metadata or in wordcounts. Jonathan’s approach was to examine the list of titles by hand—which, in my view, is probably a lot faster and better than any automatic approach. But since this post is focused on the mechanics of using my package, I’m going to skip this step. The results will therefore reflect fiction books in HathiTrust dated to 1895 rather than new books in the same category. [Sept. 25. Corrected an error left over from an earlier draft where I said we were working on five years of books. The model in this example is just of one year’s worth.]

Since we have blithely ignored this issue, we are ready to load the corpus. I extracted the tarball into a folder fiction_1895-1899.

fs <- md %>%
              filename=file.path("htrc-genre/fiction_1895-1899", id)) %>%
    mutate(filename=str_c(filename, ".tsv"))

Let’s verify that we have all the files we want:

## [1] TRUE

dfrtopics has a function for loading these files into memory all together, read_wordcounts. By default it is customized to read files from JSTOR Data for Research, which are CSV’s. But that isn’t quite the format this data is in:

readLines(fs$filename[1], n=3)
## [1] ",\t3258"   ".\t2260"   "the\t1709"

This is tab-separated and lacks a header. Thanks to Jonathan, though, I have updated dfrtopics to be able to cope with this case. I’ve taken a slightly more generalized approach. read_wordcounts now accepts a reader parameter which should be a function that takes a filename and returns a two-column data frame. The idea is that you, the package user, will figure out how to read in a single file, and then hand off the method to read_wordcounts for the bulk operation.1

read_hathi <- function (f) read_delim(f,
    delim="\t", quote="", escape_backslash=F, na="",
    col_names=F, col_types="ci")

read_delim is the readr version of read.table. We have to make sure it doesn’t try to parse any punctuation. We also have to work around the bugs in the current readr ruling out zero-length files in advance:

fs <- fs %>%
    filter(file.info(filename)$size != 0)

[Sept. 25. The next release of readr should fix this issue.]

We can test our read_hathi method:

read_hathi(fs$filename[1]) %>% head()
## # A tibble: 6 x 2
##      X1    X2
##   <chr> <int>
## 1     ,  3258
## 2     .  2260
## 3   the  1709
## 4     i  1597
## 5    to  1336
## 6   and  1233

So far, we haven’t even loaded dfrtopics. But now we will, first ensuring that Java has a big enough memory allocation for modeling.


Now we use the extended read_wordcounts, which requires filenames, matching IDs, and our reader function:

counts <- read_wordcounts(fs$filename, fs$id, read_hathi)

[Sept. 24. If you tried dfrtopics v0.2 in the last week, note a backwards-incompatible change in the newest version: the second parameter here is a vector of IDs, not a function for converting filenames to IDs.]

This is a “small data” analysis, since we are assuming we can manipulate the entire corpus in memory.2 The next few steps, which operate on this 15362570-row data frame, take several minutes each.

We are immediately going to trim this down by removing rare words and stopwords. I am borrowing here from Jonathan’s work. Let’s discard stopwords. A customized stoplist would be much better, but for demonstration purposes let’s stick with a default stoplist, with just a couple of additions:

stoplist <- readLines(file.path(path.package("dfrtopics"),
                                "stoplist", "stoplist.txt"))
stoplist <- c("'s", "n't", "said", "says", stoplist)
counts <- counts %>%

Let’s get rid of any token that is all non-word characters (again, a questionable choice—more careful postprocessing might be in order):

counts <- counts %>%
    filter(str_detect(word, "\\w"))

Jonathan noted a problem with the ligature characters fi and fl. We can convert them here:

counts <- counts %>%
    mutate(word=str_replace_all(word, "fi", "fi")) %>%
    mutate(word=str_replace_all(word, "fl", "fl"))

And now let’s get rid of all but the 20000-odd most frequent features:

counts <- counts %>%

The most frequent remaining word types, by the bye, are:

wordcounts_word_totals(counts) %>%
    top_n(10, weight) %>%
    arrange(desc(weight)) %>%

word weight
man 332797
little 281853
time 249873
know 229252
old 215765
two 203552
never 202500
made 199534
mr. 196236
good 195139
Now we have to reformat these counts into MALLET’s InstanceList format:

ilist <- counts %>%
    wordcounts_texts() %>%

The token.regex is a subtlety worth noting. The data from HathiTrust include punctuation at the start of or within tokens (it looks like). The importance of the apostrophe in dialect writing suggests we ought to see what happens if we keep it.

At this point we should save these to disk so we don’t have to recreate them all the time when we model:

write_instances(ilist, "fiction1895.mallet")

(We could now reclaim the memory being taken up by counts, if we didn’t have any further use for these bags of words except as MALLET inputs.)

Next comes the modeling step. Jonathan tried 200 topics and 200 iterations. Since the quality of this work is certainly no more than half as good as his, I will do 100 topics. I just note that we can make the model exactly reproducible by setting the random seed as well:

m <- train_model("fiction1895.mallet", # or, equivalently, ilist

A half an hour or so later, we can save this to disk, with

write_mallet_model(m, "fiction1895-k100-v20K")

and reload it with

m <- load_mallet_model_directory("fiction1895-k100-v20K",

[Sept. 25. Adding:] One extra note on loading models from disk—and, just in case anyone is following these directions step by step, note that of course you don’t have to save and reload the model at all; you can always work directly with the result from train_model. I wanted to show how to save, however, because you don’t need to rerun the model from scratch every time. load_mallet_model_directory doesn’t load metadata automatically; it is not part of the model. So we have to load it separately here, using our read_csv command from above and recreating the id and date columns (obviously I should have written it as a function):

metadata(m) <- read_csv("htrc-genre/fiction_metadata.csv",
    col_names=T, col_types=str_c(rep("c", 19), collapse="")) %>%
    mutate(id=htid, pubdate=str_c(date, "-01-01"))

(Only metadata rows corresponding to modeled documents will be kept.)

As for the model itself, I’ve stuck a list of topic top words at the end of the post. And here is a questionable diagram:

topic_scaled_2d(m, n_words=500) %>%
    plot_topic_scaled(labels=topic_labels(m, n=3))

For interactive browsing with dfr-browser, we need to export the data to the format expected by dfr-browser. Jonathan has created a modified version to work with the hathi-formatted metadata. Some such tweaking is necessary to adjust the browser’s expectations from journal articles to fiction volumes; Jonathan also forked dfrtopics to modify the export_browser_data function. For less expert users, I wanted to show the following alternate approach that doesn’t require any forking or tweaking of this kind. The only extra step needed is in the metadata export. If we run

export_browser_data(m, "browser1895", supporting_files=T, overwrite=T)

using the most recent dfrtopics version, we get all the files we asked for, with a warning about missing columns in the metadata. [July 19, 2016. Updates to dfrtopics changed the parameters to export_browser_data. Changed accordingly here.] A quick and dirty solution is to “cheat” the expectations of the package by giving it the metadata columns it’s looking for (these are listed in ?export_browser_data.)

metadata(m) <- metadata(m) %>%
        journaltitle=imprint,   # no "journal" but let's stick publisher here
        volume="",              # these are expected but we'll leave them blank
export_browser_data(m, "browser1895/data", supporting_files=F, overwrite=T)

This gives a scrappy browser version of the model with titles and dates. Author names have been mangled and the bibliography is all wrong, and the time series display is meaningless. I haven’t put this online because the above does not involve any serious data-cleaning, much less substantive checking about the indicativeness of topics, and the resulting browser is less than ideal. But it at least lets you look around at topics, words, and documents. How about that Balzac, anyway?

A less dirty solution to the metadata export issue would be:

  1. Call export_browser_data.
  2. Overwrite the junk meta.csv.zip from (1) with metadata formatted as you wish.3
  3. Modify dfr-browser to process and display the metadata appropriately, as Jonathan did.

I’m working on making step (3) easier, but in the meanwhile I hope these notes show how to handle data-variations on the bag-of-words theme.

Here are the generated topic top words for that year’s worth of published fiction. [Sept. 25. Adding this:] Just to highlight another feature of dfrtopics, I want to show how to change the weighting used to select “top words.” topic_labels(m) is meant for convenience, and it always ranks words by their raw weights within a topic. To penalize widespread words, we should transform the full topic-word matrix using a scoring scheme. tw_blei_lafferty implements a “relevance”-type scoring from Blei and Lafferty’s review article on topic models. It is used as follows:

top_words(m, 8, weighting=tw_blei_lafferty(m)) %>%
    group_by(topic) %>%
    summarize(`most relevant words`=str_c(word, collapse=" ")) %>%

topic most relevant words
1 old prue great rip ordener world pausanias little
2 sir nay hath man men standish saxon good
3 captain major colonel sergeant lieutenant general officer men
4 captain boat ship deck island sail crew vessel
5 mr. gideon herrick man captain cried pitman carthew
6 replied captain sir board ship vanslyperken father boat
7 baron crevel hulot madame marneffe baroness francs monsieur
8 poet given author nickname book english king literary
9 1’ mr. mrs. edmonstone violet think little dr.
10 mrs. think mr. know ’m eve hilliard thought
11 king prince princess huon queen sir knight quoth
12 moll dawson thyrza sanchez godwin coleridge egremont mr.
13 tlie king bodhisatta arnaud greatorex bede liis master
14 wyman devereux nixon man sophy mrs. hooper francisco
15 life love god heart nature years soul women
16 mr. mrs. boffin sir miss dear leicester wegg
17 white bang girl honour devil shot quoth god
18 squire dat nickleby old squeers just todhetley got
19 god jesus chrysostom man bishop church christian men
20 sir lord man old know lady say mr.
21 davies mrs. farrar cranston fenton captain leale ormsby
22 earl lavender lord fleetwood ormont aminta lady carinthia
23 man eyes face came looked woman back seemed
24 thuillier madame monsieur peyrade francs cerizet cibot schmucke
25 dombey philippe florence captain uncle flore major toots
26 temple hamilton hereward almina rigou michaud tonsard ’ll
27 modeste eila love mamy mother madame canalis dumay
28 ’ll ’d mrs. ’re ’ve ’m know mr.
29 pole canon wargrave mrs. shimna miss countess mr.
30 little time young made too just make first
31 mr. captain men think faulkner go get army
32 madame monsieur man francs old two love young
33 birotteau mme francs popinot monsieur madame tillet cesarine
34 an’ ter ’d mrs. goin’ ‘em yer git
35 love life woman women london loved husband years
36 grandet pierrette nanon mademoiselle atwood michu beaulieu rogron
37 chattaway ashwoode mr. trevlyn o’connor sir wetherell ’ll
38 frau frere herr raffaello verneuil mademoiselle ekkehard camillo
39 valley snow came shot camp went yards two
40 ship came men shore captain told great go
41 knight answered sir lady man father castle master
42 duke answered men tommasino man lord horse lady
43 madame monsieur m. emperor mademoiselle bonaparte paris lefebvre
44 mr. mrs. pickwick sir miss replied weller gentleman
45 mrs. mr. miss know think little go thought
46 graslin bryn japanese inoya fawcett japan selden sauviat
47 little adolphe miss know uncle coryse yes chiffon
48 mother wife father husband woman child daughter house
49 dred captain mr. ‘11 ship boat man vidal
50 mr. mrs. ’ll man go know get ’ve
51 mowgli jungle nial oona bagheera came till sorcha
52 shore island boat friday great found two made
53 lord bernicia sir lady miss colambre pomfret mr.
54 morse farrell chorker boys hillson men lieutenant castle
55 mrs. dodd bassett sir little came man mr.
56 king sir everard lord scotland earl cromwell colonel
57 indians indian mr. river fort men white camp
58 vince prince keawe man gondremark seraphina cried keola
59 somerset mr. man challoner merrick cried vandeleur prince
60 mother little father went came go child again
61 monsieur madame sallenauve mortsauf luigia l’estorade rastignac 1’estorade
62 ormond claes balthazar ulick sir connal mcknight annaly
63 marius man cosette valjean ammalat cavallo landes seidel
64 sir maitland say mr. never ’d ’ll man
65 8vo cloth crown 6d extra illustrated illustrations a.
66 wuz sez ‘em jorkaway jest veryan holt sech
67 king queen sire majesty replied crichton madame duke
68 austin colonel godefroid norton hertford tip newville fan
69 mr. peregrine honour person fortune great mrs. pickle
70 replied eyes exclaimed answered cried suddenly face moment
71 elfride knight trilby swancourt billee taffy svengali smith
72 sancho quixote knight 1 cervantes quoth dulcinea panza
73 lucian vizard burley calvert severne daisy man kike
74 mr. mrs. miss little know sir yon peggotty
75 griffith gaunt valerio derues hugues lamotte ryder replied
76 rendalen kallem fru life s6nya aniuta milla always
77 little old came went great go man mother
78 monsieur madame woman man francs bixiou paris mademoiselle
79 pecksniff mrs. sir yon mr. man miss know
80 mrs. mr. miss sir went know came go
81 bathsheba sue boldwood troy oak tess woman arabella
82 king knights gervaise count moors wulf granada great
83 peyton answered carminella know sigmund think emmeran mother
84 dorrit clennam olive strang myrtle meagles miss thornham
85 nature neuchamp stafford honour gentleman therefore old state
86 mr. hazel wardlaw rolleston penfold wylie sweetheart man
87 sir lady money told say know house knew
88 ursule monsieur ginevra madame minoret francs savinien georges
89 rae caesar ’ll grannie man ’m ’ve aw
90 hath king men man spake went came great
91 love angel woman life countess vicar heart eyes
92 bishop ewan deemster davy thorkell man sunlocks face
93 old boys school dr. great doctor college father
94 man benassis genestas old good two nothing woman
95 shee love theagenes hath good cariclia god selfe
96 an’ a’ hae minister tae auld weel ken
97 love heart pan know life man dear soul
98 mr. man think i. lord thought came catriona
99 man went men came two go time got
100 calyste madame camusot monsieur baron beatrix woman corentin
Which suggests quite a few topics localized to single texts. Hmm.

  1. Not that the latter is anything fancy: just a loop and a bind_rows.
  2. Memory management in this context is a bit tricky, and you can hit limits before you really should because of the greedy habits of both Java and R.You can push things further by separating the data-loading, data-modeling, and data-analysis steps into separate R sessions, passing data from one to the next by saving files to disk. The greed is made worse by the fact that there is apparently no way to recover memory from Java except to restart R. And then I couldn’t figure out any way to make use of mallet and rJava in dfrtopics except to Depend on them, which means they are obligatorily loaded as soon as you load dfrtopics. If you aren’t using the MALLET parts of the package in a session, set a small Java heap size before loading it and then you will at least not be giving up too much to the JVM. [Note, July 20, 2016. Since dfrtopics v0.2.4, MALLET is only loaded when it is needed; if you aren’t loading instances or training models there is no need to worry about the Java heap size setting.] All of this is a tradeoff for the relative convenience and concision of R for data analysis. Relative.
  3. It’s simplest to output meta.csv and then zip this file, either on the command line or in R with zip("meta.csv.zip", "meta.csv", flags="-9Xj").