It’s gratifying, and a little frightening, when someone else uses your own code. Jonathan Goodwin has built on my dfrtopics and dfr-browser code-blobs to produce a fascinating visualization of topics in fiction 1920–1922, derived by modeling the genre-specific word frequencies data set from HathiTrust. He’s given a nice description of his process as well. In the process, Jonathan revealed some unnecessarily restrictive assumptions built into my code. He solved the problem by modifying my code: all praise to him! But then I felt bad and wanted to make it possible for others to go further without having to dig into the Area X that is my code. So I made a few adjustments to my new version of dfrtopics. Here are some notes on using the updated version to cope with the issues Jonathan found in processing and modeling the Hathi data.
The data is a series of moderately-sized files, the biggest of which are tarballs of word counts, one for each volume. In what follows I have saved the downloaded files into a folder called htrc-genre
.
Let’s now prepare a slice of these files for modeling. First we load some handy packages from the “Hadleyverse”:
library(stringr)
library(readr)
library(dplyr)
library(lubridate)
Now the metadata supplied with the dataset is not, of course, formatted in the same (eccentric) way as what we get from JSTOR. So we will fall back on ordinary file-reading functions. Jonathan figured out how to read this in by just treating everything as a string, so that’s what we’ll do too. read_csv
is from the readr package and gives us a little speed boost, but for this to work we have to tell it we want 19 string columns with a col_types
argument:
md <- read_csv("htrc-genre/fiction_metadata.csv",
col_names=T, col_types=str_c(rep("c", 19), collapse=""))
It makes correct guesses about the uses of quotation marks.
The only expectation dfrtopics really has about the metadata is that it has a unique identifier in a column called id
. We actually have several unique identifiers here, but Jonathan showed that we can work with the htid
, so let’s make that our id
:
md <- md %>%
select(-recordid, -oclc) %>%
rename(id=htid)
This ID corresponds to the file names of individual word-count files, so we can now match filenames to metadata easily by just adding ".tsv"
to the end of an ID.
dfrtopics’s time-series functions also expect to find document dates, as Date
objects, in a pubdate
column. This transformation is optional if we’re not interested in time series, but just to show how it’s done, here’s how we would take the year given in the supplied date
field and make Dates using a lubridate
function.
md <- md %>%
mutate(date=parse_date_time(date, "%Y")) %>%
rename(pubdate=date)
Let’s just work with texts from the year 1895.
md <- md %>%
filter(year(pubdate) == 1895)
The next data-processing step is to consider the status of reprints. That cannot be done automatically in any easy way, though we might try to guess reprints by looking at near-duplicates in metadata or in wordcounts. Jonathan’s approach was to examine the list of titles by hand—which, in my view, is probably a lot faster and better than any automatic approach. But since this post is focused on the mechanics of using my package, I’m going to skip this step. The results will therefore reflect fiction books in HathiTrust dated to 1895 rather than new books in the same category. [Sept. 25. Corrected an error left over from an earlier draft where I said we were working on five years of books. The model in this example is just of one year’s worth.]
Since we have blithely ignored this issue, we are ready to load the corpus. I extracted the tarball into a folder fiction_1895-1899
.
fs <- md %>%
transmute(id,
filename=file.path("htrc-genre/fiction_1895-1899", id)) %>%
mutate(filename=str_c(filename, ".tsv"))
Let’s verify that we have all the files we want:
all(file.exists(fs$filename))
## [1] TRUE
dfrtopics has a function for loading these files into memory all together, read_wordcounts
. By default it is customized to read files from JSTOR Data for Research, which are CSV’s. But that isn’t quite the format this data is in:
readLines(fs$filename[1], n=3)
## [1] ",\t3258" ".\t2260" "the\t1709"
This is tab-separated and lacks a header. Thanks to Jonathan, though, I have updated dfrtopics to be able to cope with this case. I’ve taken a slightly more generalized approach. read_wordcounts
now accepts a reader
parameter which should be a function that takes a filename and returns a two-column data frame. The idea is that you, the package user, will figure out how to read in a single file, and then hand off the method to read_wordcounts
for the bulk operation.1
read_hathi <- function (f) read_delim(f,
delim="\t", quote="", escape_backslash=F, na="",
col_names=F, col_types="ci")
read_delim
is the readr version of read.table
. We have to make sure it doesn’t try to parse any punctuation. We also have to work around the bugs in the current readr
ruling out zero-length files in advance:
fs <- fs %>%
filter(file.info(filename)$size != 0)
[Sept. 25. The next release of readr should fix this issue.]
We can test our read_hathi
method:
read_hathi(fs$filename[1]) %>% head()
## # A tibble: 6 x 2
## X1 X2
## <chr> <int>
## 1 , 3258
## 2 . 2260
## 3 the 1709
## 4 i 1597
## 5 to 1336
## 6 and 1233
So far, we haven’t even loaded dfrtopics. But now we will, first ensuring that Java has a big enough memory allocation for modeling.
# Java is HUNGRY, SO HUNGRY
options(java.parameters="-Xmx4g")
library(dfrtopics)
Now we use the extended read_wordcounts
, which requires filenames, matching IDs, and our reader function:
counts <- read_wordcounts(fs$filename, fs$id, read_hathi)
[Sept. 24. If you tried dfrtopics v0.2 in the last week, note a backwards-incompatible change in the newest version: the second parameter here is a vector of IDs, not a function for converting filenames to IDs.]
This is a “small data” analysis, since we are assuming we can manipulate the entire corpus in memory.2 The next few steps, which operate on this 15362570-row data frame, take several minutes each.
We are immediately going to trim this down by removing rare words and stopwords. I am borrowing here from Jonathan’s work. Let’s discard stopwords. A customized stoplist would be much better, but for demonstration purposes let’s stick with a default stoplist, with just a couple of additions:
stoplist <- readLines(file.path(path.package("dfrtopics"),
"stoplist", "stoplist.txt"))
stoplist <- c("'s", "n't", "said", "says", stoplist)
counts <- counts %>%
wordcounts_remove_stopwords(stoplist)
Let’s get rid of any token that is all non-word characters (again, a questionable choice—more careful postprocessing might be in order):
counts <- counts %>%
filter(str_detect(word, "\\w"))
Jonathan noted a problem with the ligature characters fi and fl. We can convert them here:
counts <- counts %>%
mutate(word=str_replace_all(word, "fi", "fi")) %>%
mutate(word=str_replace_all(word, "fl", "fl"))
And now let’s get rid of all but the 20000-odd most frequent features:
counts <- counts %>%
wordcounts_remove_rare(20000)
The most frequent remaining word types, by the bye, are:
wordcounts_word_totals(counts) %>%
top_n(10, weight) %>%
arrange(desc(weight)) %>%
knitr::kable()
word | weight |
---|---|
man | 332797 |
little | 281853 |
time | 249873 |
know | 229252 |
old | 215765 |
two | 203552 |
never | 202500 |
made | 199534 |
mr. | 196236 |
good | 195139 |
ilist <- counts %>%
wordcounts_texts() %>%
make_instances(token.regex="\\S+")
The token.regex
is a subtlety worth noting. The data from HathiTrust include punctuation at the start of or within tokens (it looks like). The importance of the apostrophe in dialect writing suggests we ought to see what happens if we keep it.
At this point we should save these to disk so we don’t have to recreate them all the time when we model:
write_instances(ilist, "fiction1895.mallet")
(We could now reclaim the memory being taken up by counts
, if we didn’t have any further use for these bags of words except as MALLET inputs.)
Next comes the modeling step. Jonathan tried 200 topics and 200 iterations. Since the quality of this work is certainly no more than half as good as his, I will do 100 topics. I just note that we can make the model exactly reproducible by setting the random seed as well:
m <- train_model("fiction1895.mallet", # or, equivalently, ilist
n_topics=100,
n_iters=200,
seed=18951899,
metadata=md)
A half an hour or so later, we can save this to disk, with
write_mallet_model(m, "fiction1895-k100-v20K")
and reload it with
m <- load_mallet_model_directory("fiction1895-k100-v20K",
load_topic_words=T)
[Sept. 25. Adding:] One extra note on loading models from disk—and, just in case anyone is following these directions step by step, note that of course you don’t have to save and reload the model at all; you can always work directly with the result from train_model
. I wanted to show how to save, however, because you don’t need to rerun the model from scratch every time. load_mallet_model_directory
doesn’t load metadata automatically; it is not part of the model. So we have to load it separately here, using our read_csv
command from above and recreating the id
and date
columns (obviously I should have written it as a function):
metadata(m) <- read_csv("htrc-genre/fiction_metadata.csv",
col_names=T, col_types=str_c(rep("c", 19), collapse="")) %>%
mutate(id=htid, pubdate=str_c(date, "-01-01"))
(Only metadata rows corresponding to modeled documents will be kept.)
As for the model itself, I’ve stuck a list of topic top words at the end of the post. And here is a questionable diagram:
topic_scaled_2d(m, n_words=500) %>%
plot_topic_scaled(labels=topic_labels(m, n=3))

For interactive browsing with dfr-browser, we need to export the data to the format expected by dfr-browser. Jonathan has created a modified version to work with the hathi-formatted metadata. Some such tweaking is necessary to adjust the browser’s expectations from journal articles to fiction volumes; Jonathan also forked dfrtopics to modify the export_browser_data
function. For less expert users, I wanted to show the following alternate approach that doesn’t require any forking or tweaking of this kind. The only extra step needed is in the metadata export. If we run
export_browser_data(m, "browser1895", supporting_files=T, overwrite=T)
using the most recent dfrtopics version, we get all the files we asked for, with a warning about missing columns in the metadata. [July 19, 2016. Updates to dfrtopics changed the parameters to export_browser_data
. Changed accordingly here.] A quick and dirty solution is to “cheat” the expectations of the package by giving it the metadata columns it’s looking for (these are listed in ?export_browser_data
.)
metadata(m) <- metadata(m) %>%
transmute(
id,
title,
author,
journaltitle=imprint, # no "journal" but let's stick publisher here
volume="", # these are expected but we'll leave them blank
issue="",
pubdate,
pagerange=totalpages)
export_browser_data(m, "browser1895/data", supporting_files=F, overwrite=T)
This gives a scrappy browser version of the model with titles and dates. Author names have been mangled and the bibliography is all wrong, and the time series display is meaningless. I haven’t put this online because the above does not involve any serious data-cleaning, much less substantive checking about the indicativeness of topics, and the resulting browser is less than ideal. But it at least lets you look around at topics, words, and documents. How about that Balzac, anyway?
A less dirty solution to the metadata export issue would be:
- Call
export_browser_data
. - Overwrite the junk
meta.csv.zip
from (1) with metadata formatted as you wish.3 - Modify dfr-browser to process and display the metadata appropriately, as Jonathan did.
I’m working on making step (3) easier, but in the meanwhile I hope these notes show how to handle data-variations on the bag-of-words theme.
Here are the generated topic top words for that year’s worth of published fiction. [Sept. 25. Adding this:] Just to highlight another feature of dfrtopics, I want to show how to change the weighting used to select “top words.” topic_labels(m)
is meant for convenience, and it always ranks words by their raw weights within a topic. To penalize widespread words, we should transform the full topic-word matrix using a scoring scheme. tw_blei_lafferty
implements a “relevance”-type scoring from Blei and Lafferty’s review article on topic models. It is used as follows:
top_words(m, 8, weighting=tw_blei_lafferty(m)) %>%
group_by(topic) %>%
summarize(`most relevant words`=str_c(word, collapse=" ")) %>%
knitr::kable()
topic | most relevant words |
---|---|
1 | old prue great rip ordener world pausanias little |
2 | sir nay hath man men standish saxon good |
3 | captain major colonel sergeant lieutenant general officer men |
4 | captain boat ship deck island sail crew vessel |
5 | mr. gideon herrick man captain cried pitman carthew |
6 | replied captain sir board ship vanslyperken father boat |
7 | baron crevel hulot madame marneffe baroness francs monsieur |
8 | poet given author nickname book english king literary |
9 | 1' mr. mrs. edmonstone violet think little dr. |
10 | mrs. think mr. know 'm eve hilliard thought |
11 | king prince princess huon queen sir knight quoth |
12 | moll dawson thyrza sanchez godwin coleridge egremont mr. |
13 | tlie king bodhisatta arnaud greatorex bede liis master |
14 | wyman devereux nixon man sophy mrs. hooper francisco |
15 | life love god heart nature years soul women |
16 | mr. mrs. boffin sir miss dear leicester wegg |
17 | white bang girl honour devil shot quoth god |
18 | squire dat nickleby old squeers just todhetley got |
19 | god jesus chrysostom man bishop church christian men |
20 | sir lord man old know lady say mr. |
21 | davies mrs. farrar cranston fenton captain leale ormsby |
22 | earl lavender lord fleetwood ormont aminta lady carinthia |
23 | man eyes face came looked woman back seemed |
24 | thuillier madame monsieur peyrade francs cerizet cibot schmucke |
25 | dombey philippe florence captain uncle flore major toots |
26 | temple hamilton hereward almina rigou michaud tonsard 'll |
27 | modeste eila love mamy mother madame canalis dumay |
28 | 'll 'd mrs. 're 've 'm know mr. |
29 | pole canon wargrave mrs. shimna miss countess mr. |
30 | little time young made too just make first |
31 | mr. captain men think faulkner go get army |
32 | madame monsieur man francs old two love young |
33 | birotteau mme francs popinot monsieur madame tillet cesarine |
34 | an' ter 'd mrs. goin' 'em yer git |
35 | love life woman women london loved husband years |
36 | grandet pierrette nanon mademoiselle atwood michu beaulieu rogron |
37 | chattaway ashwoode mr. trevlyn o'connor sir wetherell 'll |
38 | frau frere herr raffaello verneuil mademoiselle ekkehard camillo |
39 | valley snow came shot camp went yards two |
40 | ship came men shore captain told great go |
41 | knight answered sir lady man father castle master |
42 | duke answered men tommasino man lord horse lady |
43 | madame monsieur m. emperor mademoiselle bonaparte paris lefebvre |
44 | mr. mrs. pickwick sir miss replied weller gentleman |
45 | mrs. mr. miss know think little go thought |
46 | graslin bryn japanese inoya fawcett japan selden sauviat |
47 | little adolphe miss know uncle coryse yes chiffon |
48 | mother wife father husband woman child daughter house |
49 | dred captain mr. '11 ship boat man vidal |
50 | mr. mrs. 'll man go know get 've |
51 | mowgli jungle nial oona bagheera came till sorcha |
52 | shore island boat friday great found two made |
53 | lord bernicia sir lady miss colambre pomfret mr. |
54 | morse farrell chorker boys hillson men lieutenant castle |
55 | mrs. dodd bassett sir little came man mr. |
56 | king sir everard lord scotland earl cromwell colonel |
57 | indians indian mr. river fort men white camp |
58 | vince prince keawe man gondremark seraphina cried keola |
59 | somerset mr. man challoner merrick cried vandeleur prince |
60 | mother little father went came go child again |
61 | monsieur madame sallenauve mortsauf luigia l'estorade rastignac 1'estorade |
62 | ormond claes balthazar ulick sir connal mcknight annaly |
63 | marius man cosette valjean ammalat cavallo landes seidel |
64 | sir maitland say mr. never 'd 'll man |
65 | 8vo cloth crown 6d extra illustrated illustrations a. |
66 | wuz sez 'em jorkaway jest veryan holt sech |
67 | king queen sire majesty replied crichton madame duke |
68 | austin colonel godefroid norton hertford tip newville fan |
69 | mr. peregrine honour person fortune great mrs. pickle |
70 | replied eyes exclaimed answered cried suddenly face moment |
71 | elfride knight trilby swancourt billee taffy svengali smith |
72 | sancho quixote knight 1 cervantes quoth dulcinea panza |
73 | lucian vizard burley calvert severne daisy man kike |
74 | mr. mrs. miss little know sir yon peggotty |
75 | griffith gaunt valerio derues hugues lamotte ryder replied |
76 | rendalen kallem fru life s6nya aniuta milla always |
77 | little old came went great go man mother |
78 | monsieur madame woman man francs bixiou paris mademoiselle |
79 | pecksniff mrs. sir yon mr. man miss know |
80 | mrs. mr. miss sir went know came go |
81 | bathsheba sue boldwood troy oak tess woman arabella |
82 | king knights gervaise count moors wulf granada great |
83 | peyton answered carminella know sigmund think emmeran mother |
84 | dorrit clennam olive strang myrtle meagles miss thornham |
85 | nature neuchamp stafford honour gentleman therefore old state |
86 | mr. hazel wardlaw rolleston penfold wylie sweetheart man |
87 | sir lady money told say know house knew |
88 | ursule monsieur ginevra madame minoret francs savinien georges |
89 | rae caesar 'll grannie man 'm 've aw |
90 | hath king men man spake went came great |
91 | love angel woman life countess vicar heart eyes |
92 | bishop ewan deemster davy thorkell man sunlocks face |
93 | old boys school dr. great doctor college father |
94 | man benassis genestas old good two nothing woman |
95 | shee love theagenes hath good cariclia god selfe |
96 | an' a' hae minister tae auld weel ken |
97 | love heart pan know life man dear soul |
98 | mr. man think i. lord thought came catriona |
99 | man went men came two go time got |
100 | calyste madame camusot monsieur baron beatrix woman corentin |
-
Not that the latter is anything fancy: just a loop and a
bind_rows
. ↩︎ -
Memory management in this context is a bit tricky, and you can hit limits before you really should because of the greedy habits of both Java and R.You can push things further by separating the data-loading, data-modeling, and data-analysis steps into separate R sessions, passing data from one to the next by saving files to disk. The greed is made worse by the fact that there is apparently no way to recover memory from Java except to restart R. And then I couldn’t figure out any way to make use of mallet and rJava in dfrtopics except to Depend on them, which means they are obligatorily loaded as soon as you load dfrtopics. If you aren’t using the MALLET parts of the package in a session, set a small Java heap size before loading it and then you will at least not be giving up too much to the JVM. [Note, July 20, 2016. Since dfrtopics v0.2.4, MALLET is only loaded when it is needed; if you aren’t loading instances or training models there is no need to worry about the Java heap size setting.] All of this is a tradeoff for the relative convenience and concision of R for data analysis. Relative. ↩︎
-
It’s simplest to output
meta.csv
and then zip this file, either on the command line or in R withzip("meta.csv.zip", "meta.csv", flags="-9Xj")
. ↩︎