In my line of work—and in my social-media universe—I seem to see a lot of ranking of categories: most frequent words in a text corpus, most frequent genres in a bibliographic listing, most frequent nationalities in a group of authors… I’ve often wondered how much uncertainty lies behind rankings but had no idea how to quantify this uncertainty. I caught a link (via @coulmont) to a short essay on the subject: Michael Höhle, “Rank Uncertainty: Why the ‘Most Popular’ Baby Names Might Not Be the Most Popular” (available in OA preprint), the basics of which I was able to follow, thanks to its clear exposition and accompanying R code. Höhle explains that even when one has a full census of the names of all the babies born in a year, there is still uncertainty associated with ranking the names: the true propensity in the population to give a particular name is not fully revealed by the actual total. Höhle suggests we can begin by modeling the uncertainty arising from the fact that (1) even if the birth rate is fixed, there are “chance” variations in the number of births and (2) even if the probabilities of names are fixed, each draw from the bag of names has some randomness (little Jacob might well have been William). The rank uncertainty can then be found by simulating a year’s worth of names many times and finding popularity ranks in each simulation.
This seemed like a useful method to be able to use the next time I look at a “top 10 list” for evidence—a way to replace listicles with statilisticles. And the data Höhle uses, the British Office for National Statistics lists of baby names and sexes for 2015, has recently been made available in a convenient R package by Thomas Leeper. So I decided to try to understand the point by reproducing the uncertainties in Höhle’s essay. Here’s my code:
library(tidyverse)
library(doMC)
library(foreach)
registerDoMC(cores=4)
# data: US and UK
library(babynames)
library(ukbabynames)
library(knitr) # for kable
# leave nothing to pseudo-chance!
set.seed(20170623)
The simulation is a simple hierarchical model: the number of individuals is first drawn from a Poisson distribution with mean N
, and then each individual’s category membership is drawn from a multinomial with a vector of category probabilities p
.
simulate <- function (N, p) rpois(1, N) %>%
rmultinom(1, size=., prob=p)
We’ll repeat the simulation many times using doMC::times
and then use the percentile method to obtain bootstrapped confidence intervals for the ranks. Here is my function for this, closely modeled on Höhle’s, but with some effort to make this something you could use for a non-baby-name application. One subtlety is that in the case of names, the ranks of interest are actually ranks within sexes, so we have to keep track of this grouping factor. I’ll do that by making the ranking function a parameter r
and passing in a function that respects this grouping (the default r
parameter does not).1
bootstrap_ranks <- function (x, # data frame with counts in rows
var, # name of column with counts
r=function (x) min_rank(desc(x[[var]])),
rank_threshold=Inf,
level=0.95, # confidence level
n_reps=1000 # no. of simulations
) {
# ignore grouping on x for these parameters
x <- ungroup(x)
N <- sum(x[[var]])
p <- x[[var]] / N
# data ranks
ranks <- r(x)
# rank mask
keep <- ranks <= rank_threshold
# parametric bootstrap: sample ranks (then apply mask)
ranks_boot <- times(n_reps) %dopar% {
x[[var]] <- simulate(N, p)
list(r(x)[keep])
}
# tack data on as first replicate
ranks_boot <- c(list(ranks[keep]), ranks_boot)
# transpose to a list of ranks for each name
ranks_boot <- ranks_boot %>%
transpose() %>%
map(flatten_int)
# return rows from with simulated ranks and CIs
x %>% filter(keep) %>%
mutate(rank_samples=ranks_boot) %>%
# add percentile CIs; type=3 following Höhle
mutate(ci=map(rank_samples, quantile,
prob=c((1 - level) / 2, 1 - (1 - level) / 2), type=3))
}
Now let’s try to reproduce Höhle’s table with UK names. To set this up we need a gender-specific ranking function and a function for extracting what he calls the “uncertainty-corrected rank,” UCR, which is the lower bound of the bootstrap confidence interval for the rank—that is to say, the upper bound of our estimate for the name’s position in the league table. If two names have the same UCR we can’t reject the hypothesis that their ranks are the same at the given confidence level.2
# function for ranking names within sexes: returns a vector
name_rank <- . %>%
group_by(sex) %>%
mutate(rank=min_rank(desc(n))) %>%
pull(rank) # dplyr 0.7
name_bootstraps <- . %>%
filter(year == 2015) %>%
select(name, sex, n) %>%
group_by(sex) %>%
mutate(p=n / sum(n)) %>%
# simulate
bootstrap_ranks(var="n", r=name_rank,
rank_threshold=16, level=0.95, n_reps=10000)
K_actual <- 697788 # according to Höhle
ucr_table <- . %>%
# extract CI lower bound
mutate(ucr=map_int(ci, 1)) %>%
# format table
arrange(sex, ucr) %>%
mutate(p=scales::percent(p), gsp=scales::percent(n / K_actual)) %>%
filter(ucr <= 14) %>%
select(sex, name, UCR=ucr, `% of sex`=p, `% of all`=gsp) %>%
kable()
The simulation with 10000 replicates takes a couple of minutes on my computer.
ukbabynames %>% name_bootstraps() %>% ucr_table()
sex | name | UCR | % of sex | % of all |
---|---|---|---|---|
F | Amelia | 1 | 1.68% | 0.739% |
F | Olivia | 2 | 1.58% | 0.695% |
F | Emily | 3 | 1.26% | 0.558% |
F | Isla | 4 | 1.13% | 0.498% |
F | Ava | 4 | 1.11% | 0.489% |
F | Ella | 6 | 0.98% | 0.434% |
F | Jessica | 6 | 0.95% | 0.421% |
F | Isabella | 7 | 0.93% | 0.412% |
F | Mia | 7 | 0.92% | 0.407% |
F | Poppy | 7 | 0.91% | 0.404% |
F | Sophie | 8 | 0.90% | 0.398% |
F | Sophia | 9 | 0.89% | 0.393% |
F | Lily | 10 | 0.88% | 0.389% |
F | Grace | 11 | 0.86% | 0.381% |
F | Evie | 14 | 0.82% | 0.362% |
M | Oliver | 1 | 2.08% | 0.995% |
M | Jack | 2 | 1.61% | 0.770% |
M | Harry | 2 | 1.59% | 0.761% |
M | George | 4 | 1.46% | 0.698% |
M | Jacob | 4 | 1.46% | 0.695% |
M | Charlie | 4 | 1.45% | 0.692% |
M | Noah | 7 | 1.24% | 0.594% |
M | William | 7 | 1.23% | 0.585% |
M | Thomas | 7 | 1.22% | 0.584% |
M | Oscar | 7 | 1.22% | 0.583% |
M | James | 10 | 1.17% | 0.561% |
M | Muhammad | 12 | 1.12% | 0.535% |
M | Henry | 12 | 1.07% | 0.513% |
M | Alfie | 13 | 1.06% | 0.507% |
M | Leo | 13 | 1.04% | 0.497% |
M | Joshua | 14 | 1.02% | 0.486% |
Now let’s look at analogous results for the US in 2015, making use of the babynames R package:
us_ranks <- babynames %>% name_bootstraps()
us_ranks %>% ucr_table()
sex | name | UCR | % of sex | % of all |
---|---|---|---|---|
F | Emma | 1 | 1.150% | 2.92% |
F | Olivia | 2 | 1.105% | 2.80% |
F | Sophia | 3 | 0.979% | 2.48% |
F | Ava | 4 | 0.920% | 2.33% |
F | Isabella | 5 | 0.876% | 2.22% |
F | Mia | 6 | 0.838% | 2.12% |
F | Abigail | 7 | 0.696% | 1.76% |
F | Emily | 8 | 0.663% | 1.68% |
F | Charlotte | 9 | 0.640% | 1.62% |
F | Harper | 10 | 0.579% | 1.47% |
F | Madison | 10 | 0.567% | 1.44% |
F | Amelia | 11 | 0.554% | 1.40% |
F | Elizabeth | 12 | 0.546% | 1.38% |
F | Sofia | 12 | 0.545% | 1.38% |
M | Noah | 1 | 1.028% | 2.80% |
M | Liam | 2 | 0.963% | 2.62% |
M | Mason | 3 | 0.871% | 2.37% |
M | Jacob | 4 | 0.833% | 2.27% |
M | William | 4 | 0.833% | 2.27% |
M | Ethan | 6 | 0.789% | 2.15% |
M | James | 6 | 0.774% | 2.11% |
M | Alexander | 7 | 0.762% | 2.07% |
M | Michael | 8 | 0.754% | 2.05% |
M | Benjamin | 10 | 0.717% | 1.95% |
M | Elijah | 10 | 0.712% | 1.94% |
M | Daniel | 10 | 0.706% | 1.92% |
M | Aiden | 10 | 0.705% | 1.92% |
M | Logan | 14 | 0.677% | 1.84% |
M | Matthew | 14 | 0.666% | 1.81% |
rank_plot <- function (x, s) {
x <- x %>% filter(sex==s) %>%
mutate(pop_rank=map_int(rank_samples, 1),
name=reorder(name, pop_rank))
csets <- x %>%
select(name, rank_samples) %>%
unnest(rank_samples) %>%
group_by(name) %>%
count(rank_samples)
cis <- x %>%
select(name, pop_rank, ci) %>%
# integers, you know?
mutate(lb=map_int(ci, 1) - 0.5, ub=map_int(ci, 2) + 0.5)
ggplot(csets) +
geom_raster(aes(name, rank_samples, fill=n, alpha=n)) +
scale_fill_gradient(low="white", high="black", guide="none") +
# alpha scaling because my website doesn't have a white background
# and I'm fussy
scale_alpha_continuous(range=c(0, 1), guide="none") +
geom_errorbar(data=cis, aes(name, ymin=lb, ymax=ub), color="blue") +
geom_point(data=cis, aes(name, pop_rank), color="blue") +
theme(axis.text.x=element_text(angle=45, hjust=1),
panel.grid=element_blank()) +
xlab(NULL) + ylab("rank")
}
rank_plot(us_ranks, "F")
rank_plot(us_ranks, "M")
As always when I try to learn something with statistics, what I learn is that I know less than I thought I knew. Use my code, and tell your friends!
Edited, 6/23/17: Set the random seed (reproducibility! the UCR of “Grace” in the UK actually did fluctuate). To get exactly the graphics here you’d also need the custom R markdown format I created to work with Hugo: hugormd. Also replaced an abuse of the term “confidence sets”; I have no confidence when talking about confidence.
-
I tried to come up with a fully idiomatic “tidyverse” way of doing this in dplyr and modelr that would respect the grouping seamlessly but I couldn’t. Subsetting the data frame
x
by a variable namevar
is also not idiomatic, but I am too lazy to figure out “tidy evaluation” right now. And no doubt this function could be improved in terms of efficiency too. There is an implicit copying ofx
at every simulation run when we modify thevar
column—but I think recent versions of R don’t actually copy over the unchanged columns. ↩︎ -
If but not only if: for that we need the lower bounds too, visualized further on. ↩︎