Statilisticles

kludgetastic

In my line of work—and in my social-media universe—I seem to see a lot of ranking of categories: most frequent words in a text corpus, most frequent genres in a bibliographic listing, most frequent nationalities in a group of authors… I’ve often wondered how much uncertainty lies behind rankings but had no idea how to quantify this uncertainty. I caught a link (via @coulmont) to a short essay on the subject: Michael Höhle, “Rank Uncertainty: Why the ‘Most Popular’ Baby Names Might Not Be the Most Popular” (available in OA preprint), the basics of which I was able to follow, thanks to its clear exposition and accompanying R code. Höhle explains that even when one has a full census of the names of all the babies born in a year, there is still uncertainty associated with ranking the names: the true propensity in the population to give a particular name is not fully revealed by the actual total. Höhle suggests we can begin by modeling the uncertainty arising from the fact that (1) even if the birth rate is fixed, there are “chance” variations in the number of births and (2) even if the probabilities of names are fixed, each draw from the bag of names has some randomness (little Jacob might well have been William). The rank uncertainty can then be found by simulating a year’s worth of names many times and finding popularity ranks in each simulation.

This seemed like a useful method to be able to use the next time I look at a “top 10 list” for evidence—a way to replace listicles with statilisticles. And the data Höhle uses, the British Office for National Statistics lists of baby names and sexes for 2015, has recently been made available in a convenient R package by Thomas Leeper. So I decided to try to understand the point by reproducing the uncertainties in Höhle’s essay. Here’s my code:

library(tidyverse)
library(doMC)
library(foreach)
registerDoMC(cores=4)

# data: US and UK
library(babynames)
library(ukbabynames)

library(knitr) # for kable

# leave nothing to pseudo-chance!
set.seed(20170623)

The simulation is a simple hierarchical model: the number of individuals is first drawn from a Poisson distribution with mean N, and then each individual’s category membership is drawn from a multinomial with a vector of category probabilities p.

simulate <- function (N, p) rpois(1, N) %>%
    rmultinom(1, size=., prob=p)

We’ll repeat the simulation many times using doMC::times and then use the percentile method to obtain bootstrapped confidence intervals for the ranks. Here is my function for this, closely modeled on Höhle’s, but with some effort to make this something you could use for a non-baby-name application. One subtlety is that in the case of names, the ranks of interest are actually ranks within sexes, so we have to keep track of this grouping factor. I’ll do that by making the ranking function a parameter r and passing in a function that respects this grouping (the default r parameter does not).1

bootstrap_ranks <- function (x, # data frame with counts in rows
                             var, # name of column with counts
                             r=function (x) min_rank(desc(x[[var]])),
                             rank_threshold=Inf,
                             level=0.95, # confidence level
                             n_reps=1000 # no. of simulations
                             ) {
    # ignore grouping on x for these parameters
    x <- ungroup(x)
    N <- sum(x[[var]])
    p <- x[[var]] / N

    # data ranks
    ranks <- r(x)
    # rank mask
    keep <- ranks <= rank_threshold

    # parametric bootstrap: sample ranks (then apply mask)
    ranks_boot <- times(n_reps) %dopar% {
        x[[var]] <- simulate(N, p)
        list(r(x)[keep])
    }

    # tack data on as first replicate
    ranks_boot <- c(list(ranks[keep]), ranks_boot)

    # transpose to a list of ranks for each name
    ranks_boot <- ranks_boot %>%
        transpose() %>%
        map(flatten_int)

    # return rows from with simulated ranks and CIs
    x %>% filter(keep) %>%
        mutate(rank_samples=ranks_boot) %>%
        # add percentile CIs; type=3 following Höhle
        mutate(ci=map(rank_samples, quantile,
                      prob=c((1 - level) / 2, 1 - (1 - level) / 2), type=3))
}

Now let’s try to reproduce Höhle’s table with UK names. To set this up we need a gender-specific ranking function and a function for extracting what he calls the “uncertainty-corrected rank,” UCR, which is the lower bound of the bootstrap confidence interval for the rank—that is to say, the upper bound of our estimate for the name’s position in the league table. If two names have the same UCR we can’t reject the hypothesis that their ranks are the same at the given confidence level.2

# function for ranking names within sexes: returns a vector
name_rank <- . %>%
    group_by(sex) %>% 
    mutate(rank=min_rank(desc(n))) %>%
    pull(rank) # dplyr 0.7

name_bootstraps <- . %>%
    filter(year == 2015) %>%
    select(name, sex, n) %>%
    group_by(sex) %>%
    mutate(p=n / sum(n)) %>%
    # simulate
    bootstrap_ranks(var="n", r=name_rank,
                    rank_threshold=16, level=0.95, n_reps=10000)

K_actual <- 697788 # according to Höhle

ucr_table <- . %>%
    # extract CI lower bound
    mutate(ucr=map_int(ci, 1)) %>%
    # format table
    arrange(sex, ucr) %>%
    mutate(p=scales::percent(p), gsp=scales::percent(n / K_actual)) %>%
    filter(ucr <= 14) %>%
    select(sex, name, UCR=ucr, `% of sex`=p, `% of all`=gsp) %>%
    kable()

The simulation with 10000 replicates takes a couple of minutes on my computer.

ukbabynames %>% name_bootstraps() %>% ucr_table()
sex name UCR % of sex % of all
F Amelia 1 1.68% 0.739%
F Olivia 2 1.58% 0.695%
F Emily 3 1.26% 0.558%
F Isla 4 1.13% 0.498%
F Ava 4 1.11% 0.489%
F Ella 6 0.98% 0.434%
F Jessica 6 0.95% 0.421%
F Isabella 7 0.93% 0.412%
F Mia 7 0.92% 0.407%
F Poppy 7 0.91% 0.404%
F Sophie 8 0.90% 0.398%
F Sophia 9 0.89% 0.393%
F Lily 10 0.88% 0.389%
F Grace 11 0.86% 0.381%
F Evie 14 0.82% 0.362%
M Oliver 1 2.08% 0.995%
M Jack 2 1.61% 0.770%
M Harry 2 1.59% 0.761%
M George 4 1.46% 0.698%
M Jacob 4 1.46% 0.695%
M Charlie 4 1.45% 0.692%
M Noah 7 1.24% 0.594%
M William 7 1.23% 0.585%
M Thomas 7 1.22% 0.584%
M Oscar 7 1.22% 0.583%
M James 10 1.17% 0.561%
M Muhammad 12 1.12% 0.535%
M Henry 12 1.07% 0.513%
M Alfie 13 1.06% 0.507%
M Leo 13 1.04% 0.497%
M Joshua 14 1.02% 0.486%
The rankings do indeed reproduce Höhle. The proportions given by Höhle as "gender-specific relative frequencies" are the proportions of the name-sex combination in the whole population; to reproduce these we have to use his denominator. Höhle gives the total number of individuals as 697788 whereas the total in `ukbabynames` is 641216. Leeper mentions that his dataset is restricted to names occuring at least three times, whereas Höhle seems to have had the full list.

Now let’s look at analogous results for the US in 2015, making use of the babynames R package:

us_ranks <- babynames %>% name_bootstraps()
us_ranks %>% ucr_table()
sex name UCR % of sex % of all
F Emma 1 1.150% 2.92%
F Olivia 2 1.105% 2.80%
F Sophia 3 0.979% 2.48%
F Ava 4 0.920% 2.33%
F Isabella 5 0.876% 2.22%
F Mia 6 0.838% 2.12%
F Abigail 7 0.696% 1.76%
F Emily 8 0.663% 1.68%
F Charlotte 9 0.640% 1.62%
F Harper 10 0.579% 1.47%
F Madison 10 0.567% 1.44%
F Amelia 11 0.554% 1.40%
F Elizabeth 12 0.546% 1.38%
F Sofia 12 0.545% 1.38%
M Noah 1 1.028% 2.80%
M Liam 2 0.963% 2.62%
M Mason 3 0.871% 2.37%
M Jacob 4 0.833% 2.27%
M William 4 0.833% 2.27%
M Ethan 6 0.789% 2.15%
M James 6 0.774% 2.11%
M Alexander 7 0.762% 2.07%
M Michael 8 0.754% 2.05%
M Benjamin 10 0.717% 1.95%
M Elijah 10 0.712% 1.94%
M Daniel 10 0.706% 1.92%
M Aiden 10 0.705% 1.92%
M Logan 14 0.677% 1.84%
M Matthew 14 0.666% 1.81%
Höhle has a nice visualization of rank spread too. As a variation, we'll display the bootstrapped rank sampling distributions directly in a heatmap, overlaying the confidence intervals.[^3] Höhle does plots for UK names. Here are plots for US baby names from 2015:
rank_plot <- function (x, s) {
    x <- x %>% filter(sex==s) %>% 
        mutate(pop_rank=map_int(rank_samples, 1),
               name=reorder(name, pop_rank))

    csets <- x %>%
        select(name, rank_samples) %>% 
        unnest(rank_samples) %>%
        group_by(name) %>%
        count(rank_samples)

    cis <- x %>%
        select(name, pop_rank, ci) %>%
        # integers, you know?
        mutate(lb=map_int(ci, 1) - 0.5, ub=map_int(ci, 2) + 0.5)

    ggplot(csets) + 
        geom_raster(aes(name, rank_samples, fill=n, alpha=n)) +
        scale_fill_gradient(low="white", high="black", guide="none") +
        # alpha scaling because my website doesn't have a white background
        # and I'm fussy
        scale_alpha_continuous(range=c(0, 1), guide="none") +
        geom_errorbar(data=cis, aes(name, ymin=lb, ymax=ub), color="blue") +
        geom_point(data=cis, aes(name, pop_rank), color="blue") +
        theme(axis.text.x=element_text(angle=45, hjust=1),
              panel.grid=element_blank()) +
        xlab(NULL) + ylab("rank")
}

rank_plot(us_ranks, "F")
2015 US female baby name ranks, with uncertainties

2015 US female baby name ranks, with uncertainties

rank_plot(us_ranks, "M")
2015 US male baby name ranks, with uncertainties

2015 US male baby name ranks, with uncertainties

As always when I try to learn something with statistics, what I learn is that I know less than I thought I knew. Use my code, and tell your friends!

Edited, 6/23/17: Set the random seed (reproducibility! the UCR of “Grace” in the UK actually did fluctuate). To get exactly the graphics here you’d also need the custom R markdown format I created to work with Hugo: hugormd. Also replaced an abuse of the term “confidence sets”; I have no confidence when talking about confidence.


  1. I tried to come up with a fully idiomatic “tidyverse” way of doing this in dplyr and modelr that would respect the grouping seamlessly but I couldn’t. Subsetting the data frame x by a variable name var is also not idiomatic, but I am too lazy to figure out “tidy evaluation” right now. And no doubt this function could be improved in terms of efficiency too. There is an implicit copying of x at every simulation run when we modify the var column—but I think recent versions of R don’t actually copy over the unchanged columns. ↩︎

  2. If but not only if: for that we need the lower bounds too, visualized further on. ↩︎