Which names that are also names of countries are most common?

Sitting around the living room over the weekend, someone asked “Which names that are also names of countries are most common?” We all gave guesses, and then I gave in to the urge to reach for my laptop and generate an answer that was authoritative enough for our purposes.

The data I can easily access answers a question more like: which names that are also English names of countries were most common in the USA from 1880 to 2017? Before you read further, try to guess!

Setup

We’ll use two CRAN packages to help answer this question:

babynames, which contains a data frame by the same name containing counts of babies by name by year in the US.
countrycode, which contains a data frame that includes, among other things, (English) names of countries.

install.packages(c("countrycode", "babynames"))

Also, of course:

install.packages(c("tidyverse"))

Loading those up:

library(countrycode)
library(babynames)
library(tidyverse)

Here’s what the data on baby names looks like:

babynames

# A tibble: 1,924,665 × 5
    year sex   name          n   prop
   <dbl> <chr> <chr>     <int>  <dbl>
 1  1880 F     Mary       7065 0.0724
 2  1880 F     Anna       2604 0.0267
 3  1880 F     Emma       2003 0.0205
 4  1880 F     Elizabeth  1939 0.0199
 5  1880 F     Minnie     1746 0.0179
 6  1880 F     Margaret   1578 0.0162
 7  1880 F     Ida        1472 0.0151
 8  1880 F     Alice      1414 0.0145
 9  1880 F     Bertha     1320 0.0135
10  1880 F     Sarah      1288 0.0132
# ℹ 1,924,655 more rows

And the country data:

codelist

# A tibble: 291 × 624
   ar5      cctld continent  country.name.de     country.name.de.regex
   <chr>    <chr> <chr>      <chr>               <chr>                
 1 ASIA     .af   Asia       Afghanistan         afghan               
 2 EIT      .al   Europe     Albanien            albanien             
 3 MAF      .dz   Africa     Algerien            algerien             
 4 ASIA     .as   Oceania    Amerikanisch-Samoa  ^(?=.*amerik).*samoa 
 5 OECD1990 .ad   Europe     Andorra             andorra              
 6 MAF      .ao   Africa     Angola              angola               
 7 LAM      .ai   Americas   Anguilla            anguill?a            
 8 LAM      .aq   Antarctica Antarktis           ^(?!.*franz).*antark…
 9 LAM      .ag   Americas   Antigua und Barbuda antigua              
10 LAM      .ar   Americas   Argentinien         argentin             
# ℹ 281 more rows
# ℹ 619 more variables: country.name.en <chr>,
#   country.name.en.regex <chr>, country.name.fr <chr>,
#   country.name.fr.regex <chr>, country.name.it <chr>,
#   country.name.it.regex <chr>, cow.name <chr>, cowc <chr>,
#   cown <dbl>, currency <chr>, dhs <chr>, ecb <chr>, eu28 <chr>,
#   eurocontrol_pru <chr>, eurocontrol_statfor <chr>, …

country.name.en is our ticket.

The reveal

We can use some of the core verbs from dplyr to answer our question:

top_names <-
  babynames %>%
  # only keep names that are also English country names.
  filter(name %in% codelist$country.name.en) %>%
  # then, for each name...
  group_by(name) %>%
  # take the sum of the counts across years, and...
  summarize(n = sum(n)) %>%
  # show the highest counts at the top.
  arrange(desc(n))

top_names

# A tibble: 64 × 2
   name          n
   <chr>     <int>
 1 Jordan   499903
 2 Chad     240662
 3 Georgia  151662
 4 Israel    60698
 5 Kenya     26199
 6 India     22154
 7 Unknown   18723
 8 Malaysia   7444
 9 Ireland    5350
10 China      4523
# ℹ 54 more rows

Was your guess in the top three?¹

Bonus points

Now that we’re here, of course, we have to get up to a little tomfoolery. How are those top names trending?

babynames %>%
  # only keep names that are in the top 5 from the previous result
  filter(name %in% top_names$name[1:5]) %>%
  # take a weighted average of the proportions across sex
  group_by(year, name) %>%
  summarize(prop = (sum(n) / sum(n/prop))) %>%
  mutate(pct = prop * 100) %>%
  # plot that feller
  ggplot() +
  aes(x = year, y = pct, color = name) +
  geom_line() +
  labs(x = "Year", y = "Percent", color = "Name") +
  scale_y_continuous(labels = scales::percent_format(scale = 1)) +
  theme(legend.position = c(0.15, 0.75))

Percentage of babies named with the top five US names that are also English names of countries. Georgia used to be the most popular by far (around .2%) until the 1910s or so. Chad had a big push around 1975, peaking at .4% of babies, and Jordan its own around 1990, peaking around .5%, but both were short-lived.

I initially found it somewhat surprising that there was so much variation in the proportion of Georgia’s up until 1920, and then suddenly not. The only way I could explain that sudden shift would be a huge increase in the number of people generally right around 1920—more people, less noisy proportion—but that didn’t seem reasonable to me.

Turns out, that definitely happened:

babynames %>% 
  group_by(year) %>% 
  summarize(n = sum(n)) %>% 
  ggplot() + 
  aes(x = year, y = n) + 
  geom_line() +
  labs(x = "Year", y = "Number of Babies") +
  scale_y_continuous(labels = scales::comma)

A line plot showing the number of babies per year in the US from 1880 to now. The important part for this part is that the number of babies per year skyrocketed from 500,000 to 2,500,000 between 1910 and 1920. — Note this isn’t necessarily the number of babies born per year in the U.S. as the data includes only those names/sexes/years corresponding to at least 5 babies.

I know very little about the history of the country that I live in. Alas, happy holidays, yall.🎄

Footnotes

I added this footnote so that you’d end up near the comments section. Let me know.↩︎

Reuse

CC BY-SA 4.0