# Statilisticles

kludgetastic

In my line of work—and in my social-media universe—I seem to see a lot of ranking of categories: most frequent words in a text corpus, most frequent genres in a bibliographic listing, most frequent nationalities in a group of authors… I’ve often wondered how much uncertainty lies behind rankings but had no idea how to quantify this uncertainty. I caught a link (via @coulmont) to a short essay on the subject: Michael Höhle, “Rank Uncertainty: Why the ‘Most Popular’ Baby Names Might Not Be the Most Popular” (available in OA preprint), the basics of which I was able to follow, thanks to its clear exposition and accompanying R code. Höhle explains that even when one has a full census of the names of all the babies born in a year, there is still uncertainty associated with ranking the names: the true propensity in the population to give a particular name is not fully revealed by the actual total. Höhle suggests we can begin by modeling the uncertainty arising from the fact that (1) even if the birth rate is fixed, there are “chance” variations in the number of births and (2) even if the probabilities of names are fixed, each draw from the bag of names has some randomness (little Jacob might well have been William). The rank uncertainty can then be found by simulating a year’s worth of names many times and finding popularity ranks in each simulation.

This seemed like a useful method to be able to use the next time I look at a “top 10 list” for evidence—a way to replace listicles with statilisticles. And the data Höhle uses, the British Office for National Statistics lists of baby names and sexes for 2015, has recently been made available in a convenient R package by Thomas Leeper. So I decided to try to understand the point by reproducing the uncertainties in Höhle’s essay. Here’s my code:

``````library(tidyverse)
library(doMC)
library(foreach)
registerDoMC(cores=4)

# data: US and UK
library(babynames)
library(ukbabynames)

library(knitr) # for kable

# leave nothing to pseudo-chance!
set.seed(20170623)``````

The simulation is a simple hierarchical model: the number of individuals is first drawn from a Poisson distribution with mean `N`, and then each individual’s category membership is drawn from a multinomial with a vector of category probabilities `p`.

``````simulate <- function (N, p) rpois(1, N) %>%
rmultinom(1, size=., prob=p)``````

We’ll repeat the simulation many times using `doMC::times` and then use the percentile method to obtain bootstrapped confidence intervals for the ranks. Here is my function for this, closely modeled on Höhle’s, but with some effort to make this something you could use for a non-baby-name application. One subtlety is that in the case of names, the ranks of interest are actually ranks within sexes, so we have to keep track of this grouping factor. I’ll do that by making the ranking function a parameter `r` and passing in a function that respects this grouping (the default `r` parameter does not).1

``````bootstrap_ranks <- function (x, # data frame with counts in rows
var, # name of column with counts
r=function (x) min_rank(desc(x[[var]])),
rank_threshold=Inf,
level=0.95, # confidence level
n_reps=1000 # no. of simulations
) {
# ignore grouping on x for these parameters
x <- ungroup(x)
N <- sum(x[[var]])
p <- x[[var]] / N

# data ranks
ranks <- r(x)
keep <- ranks <= rank_threshold

# parametric bootstrap: sample ranks (then apply mask)
ranks_boot <- times(n_reps) %dopar% {
x[[var]] <- simulate(N, p)
list(r(x)[keep])
}

# tack data on as first replicate
ranks_boot <- c(list(ranks[keep]), ranks_boot)

# transpose to a list of ranks for each name
ranks_boot <- ranks_boot %>%
transpose() %>%
map(flatten_int)

# return rows from with simulated ranks and CIs
x %>% filter(keep) %>%
mutate(rank_samples=ranks_boot) %>%
# add percentile CIs; type=3 following Höhle
mutate(ci=map(rank_samples, quantile,
prob=c((1 - level) / 2, 1 - (1 - level) / 2), type=3))
}``````

Now let’s try to reproduce Höhle’s table with UK names. To set this up we need a gender-specific ranking function and a function for extracting what he calls the “uncertainty-corrected rank,” UCR, which is the lower bound of the bootstrap confidence interval for the rank—that is to say, the upper bound of our estimate for the name’s position in the league table. If two names have the same UCR we can’t reject the hypothesis that their ranks are the same at the given confidence level.2

``````# function for ranking names within sexes: returns a vector
name_rank <- . %>%
group_by(sex) %>%
mutate(rank=min_rank(desc(n))) %>%
pull(rank) # dplyr 0.7

name_bootstraps <- . %>%
filter(year == 2015) %>%
select(name, sex, n) %>%
group_by(sex) %>%
mutate(p=n / sum(n)) %>%
# simulate
bootstrap_ranks(var="n", r=name_rank,
rank_threshold=16, level=0.95, n_reps=10000)

K_actual <- 697788 # according to Höhle

ucr_table <- . %>%
# extract CI lower bound
mutate(ucr=map_int(ci, 1)) %>%
# format table
arrange(sex, ucr) %>%
mutate(p=scales::percent(p), gsp=scales::percent(n / K_actual)) %>%
filter(ucr <= 14) %>%
select(sex, name, UCR=ucr, `% of sex`=p, `% of all`=gsp) %>%
kable()``````

The simulation with 10000 replicates takes a couple of minutes on my computer.

``ukbabynames %>% name_bootstraps() %>% ucr_table()``
sex name UCR % of sex % of all
F Amelia 1 1.68% 0.739%
F Olivia 2 1.58% 0.695%
F Emily 3 1.26% 0.558%
F Isla 4 1.13% 0.498%
F Ava 4 1.11% 0.489%
F Ella 6 0.98% 0.434%
F Jessica 6 0.95% 0.421%
F Isabella 7 0.93% 0.412%
F Mia 7 0.92% 0.407%
F Poppy 7 0.91% 0.404%
F Sophie 8 0.90% 0.398%
F Sophia 9 0.89% 0.393%
F Lily 10 0.88% 0.389%
F Grace 11 0.86% 0.381%
F Evie 14 0.82% 0.362%
M Oliver 1 2.08% 0.995%
M Jack 2 1.61% 0.770%
M Harry 2 1.59% 0.761%
M George 4 1.46% 0.698%
M Jacob 4 1.46% 0.695%
M Charlie 4 1.45% 0.692%
M Noah 7 1.24% 0.594%
M William 7 1.23% 0.585%
M Thomas 7 1.22% 0.584%
M Oscar 7 1.22% 0.583%
M James 10 1.17% 0.561%
M Henry 12 1.07% 0.513%
M Alfie 13 1.06% 0.507%
M Leo 13 1.04% 0.497%
M Joshua 14 1.02% 0.486%
The rankings do indeed reproduce Höhle. The proportions given by Höhle as "gender-specific relative frequencies" are the proportions of the name-sex combination in the whole population; to reproduce these we have to use his denominator. Höhle gives the total number of individuals as 697788 whereas the total in `ukbabynames` is 641216. Leeper mentions that his dataset is restricted to names occuring at least three times, whereas Höhle seems to have had the full list.

Now let’s look at analogous results for the US in 2015, making use of the babynames R package:

``````us_ranks <- babynames %>% name_bootstraps()
us_ranks %>% ucr_table()``````
sex name UCR % of sex % of all
F Emma 1 1.150% 2.92%
F Olivia 2 1.105% 2.80%
F Sophia 3 0.979% 2.48%
F Ava 4 0.920% 2.33%
F Isabella 5 0.876% 2.22%
F Mia 6 0.838% 2.12%
F Abigail 7 0.696% 1.76%
F Emily 8 0.663% 1.68%
F Charlotte 9 0.640% 1.62%
F Harper 10 0.579% 1.47%
F Amelia 11 0.554% 1.40%
F Elizabeth 12 0.546% 1.38%
F Sofia 12 0.545% 1.38%
M Noah 1 1.028% 2.80%
M Liam 2 0.963% 2.62%
M Mason 3 0.871% 2.37%
M Jacob 4 0.833% 2.27%
M William 4 0.833% 2.27%
M Ethan 6 0.789% 2.15%
M James 6 0.774% 2.11%
M Alexander 7 0.762% 2.07%
M Michael 8 0.754% 2.05%
M Benjamin 10 0.717% 1.95%
M Elijah 10 0.712% 1.94%
M Daniel 10 0.706% 1.92%
M Aiden 10 0.705% 1.92%
M Logan 14 0.677% 1.84%
M Matthew 14 0.666% 1.81%
Höhle has a nice visualization of rank spread too. As a variation, we'll display the bootstrapped rank sampling distributions directly in a heatmap, overlaying the confidence intervals.[^3] Höhle does plots for UK names. Here are plots for US baby names from 2015:
``````rank_plot <- function (x, s) {
x <- x %>% filter(sex==s) %>%
mutate(pop_rank=map_int(rank_samples, 1),
name=reorder(name, pop_rank))

csets <- x %>%
select(name, rank_samples) %>%
unnest(rank_samples) %>%
group_by(name) %>%
count(rank_samples)

cis <- x %>%
select(name, pop_rank, ci) %>%
# integers, you know?
mutate(lb=map_int(ci, 1) - 0.5, ub=map_int(ci, 2) + 0.5)

ggplot(csets) +
geom_raster(aes(name, rank_samples, fill=n, alpha=n)) +
# alpha scaling because my website doesn't have a white background
# and I'm fussy
scale_alpha_continuous(range=c(0, 1), guide="none") +
geom_errorbar(data=cis, aes(name, ymin=lb, ymax=ub), color="blue") +
geom_point(data=cis, aes(name, pop_rank), color="blue") +
theme(axis.text.x=element_text(angle=45, hjust=1),
panel.grid=element_blank()) +
xlab(NULL) + ylab("rank")
}

rank_plot(us_ranks, "F")``````
``rank_plot(us_ranks, "M")``
1. I tried to come up with a fully idiomatic “tidyverse” way of doing this in dplyr and modelr that would respect the grouping seamlessly but I couldn’t. Subsetting the data frame `x` by a variable name `var` is also not idiomatic, but I am too lazy to figure out “tidy evaluation” right now. And no doubt this function could be improved in terms of efficiency too. There is an implicit copying of `x` at every simulation run when we modify the `var` column—but I think recent versions of R don’t actually copy over the unchanged columns. ↩︎