install.packages(c("countrycode", "babynames"))
Sitting around the living room over the weekend, someone asked “Which names that are also names of countries are most common?” We all gave guesses, and then I gave in to the urge to reach for my laptop and generate an answer that was authoritative enough for our purposes.
The data I can easily access answers a question more like: which names that are also English names of countries were most common in the USA from 1880 to 2017? Before you read further, try to guess!
Setup
We’ll use two CRAN packages to help answer this question:
- babynames, which contains a data frame by the same name containing counts of babies by name by year in the US.
- countrycode, which contains a data frame that includes, among other things, (English) names of countries.
Also, of course:
install.packages(c("tidyverse"))
Loading those up:
Here’s what the data on baby names looks like:
babynames
# A tibble: 1,924,665 × 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 1880 F Mary 7065 0.0724
2 1880 F Anna 2604 0.0267
3 1880 F Emma 2003 0.0205
4 1880 F Elizabeth 1939 0.0199
5 1880 F Minnie 1746 0.0179
6 1880 F Margaret 1578 0.0162
7 1880 F Ida 1472 0.0151
8 1880 F Alice 1414 0.0145
9 1880 F Bertha 1320 0.0135
10 1880 F Sarah 1288 0.0132
# ℹ 1,924,655 more rows
And the country data:
codelist
# A tibble: 291 × 624
ar5 cctld continent country.name.de country.name.de.regex
<chr> <chr> <chr> <chr> <chr>
1 ASIA .af Asia Afghanistan afghan
2 EIT .al Europe Albanien albanien
3 MAF .dz Africa Algerien algerien
4 ASIA .as Oceania Amerikanisch-Samoa ^(?=.*amerik).*samoa
5 OECD1990 .ad Europe Andorra andorra
6 MAF .ao Africa Angola angola
7 LAM .ai Americas Anguilla anguill?a
8 LAM .aq Antarctica Antarktis ^(?!.*franz).*antark…
9 LAM .ag Americas Antigua und Barbuda antigua
10 LAM .ar Americas Argentinien argentin
# ℹ 281 more rows
# ℹ 619 more variables: country.name.en <chr>,
# country.name.en.regex <chr>, country.name.fr <chr>,
# country.name.fr.regex <chr>, country.name.it <chr>,
# country.name.it.regex <chr>, cow.name <chr>, cowc <chr>,
# cown <dbl>, currency <chr>, dhs <chr>, ecb <chr>, eu28 <chr>,
# eurocontrol_pru <chr>, eurocontrol_statfor <chr>, …
country.name.en
is our ticket.
The reveal
We can use some of the core verbs from dplyr to answer our question:
top_names <-
babynames %>%
# only keep names that are also English country names.
filter(name %in% codelist$country.name.en) %>%
# then, for each name...
group_by(name) %>%
# take the sum of the counts across years, and...
summarize(n = sum(n)) %>%
# show the highest counts at the top.
arrange(desc(n))
top_names
# A tibble: 64 × 2
name n
<chr> <int>
1 Jordan 499903
2 Chad 240662
3 Georgia 151662
4 Israel 60698
5 Kenya 26199
6 India 22154
7 Unknown 18723
8 Malaysia 7444
9 Ireland 5350
10 China 4523
# ℹ 54 more rows
Was your guess in the top three?1
Bonus points
Now that we’re here, of course, we have to get up to a little tomfoolery. How are those top names trending?
babynames %>%
# only keep names that are in the top 5 from the previous result
filter(name %in% top_names$name[1:5]) %>%
# take a weighted average of the proportions across sex
group_by(year, name) %>%
summarize(prop = (sum(n) / sum(n/prop))) %>%
mutate(pct = prop * 100) %>%
# plot that feller
ggplot() +
aes(x = year, y = pct, color = name) +
geom_line() +
labs(x = "Year", y = "Percent", color = "Name") +
scale_y_continuous(labels = scales::percent_format(scale = 1)) +
theme(legend.position = c(0.15, 0.75))
I initially found it somewhat surprising that there was so much variation in the proportion of Georgia’s up until 1920, and then suddenly not. The only way I could explain that sudden shift would be a huge increase in the number of people generally right around 1920—more people, less noisy proportion—but that didn’t seem reasonable to me.
Turns out, that definitely happened:
babynames %>%
group_by(year) %>%
summarize(n = sum(n)) %>%
ggplot() +
aes(x = year, y = n) +
geom_line() +
labs(x = "Year", y = "Number of Babies") +
scale_y_continuous(labels = scales::comma)
I know very little about the history of the country that I live in. Alas, happy holidays, yall.🎄
Footnotes
I added this footnote so that you’d end up near the comments section. Let me know.↩︎