Analyzing my own music listening data with R and the tidyverse (2023)

Aside from exchanging playlists with my partner every once in a while, I’m not much of a Spotify user. Around this time every year, though, all of my friends start posting their Spotify Wrapped, and I get jealous, as the platform that I listen to music on doesn’t have anything like it. Of course, though, it collects data about me (it’s 2023!); last year, I got to wondering whether I could make a lo-fi knockoff of wrapped using R, the tidyverse, and the data that I have access to. You already know:

library(tidyverse)

If you’re an R user and a listener of local files on the Mac Music app, this post is for you.🎁

Importing the data

In the Mac music app, navigate to:

Music app > File > Library > Export Library

…to export a .xml file. Last year, I griped about how much of a pain in the ass it was to tidy the resulting output. This year, we can all just install the package I wrote last year and forget about our woes:

pak::pak("simonpcouch/wrapped")

The wrapped package contains a function, wrap_library(), to tidy that .xml file into a tabular data frame.

library(wrapped)

wrapped <- wrap_library("data/Library.xml", 2022:2023)

wrapped

# A tibble: 12,545 × 8
      id track_title         artist album genre date_added skip_count play_count
   <int> <chr>               <chr>  <chr> <chr> <date>          <dbl>      <dbl>
 1 11729 Atom                Mediu… Heal… Indi… 2023-02-11         15        234
 2 11862 Reelin'             Matt … Ever… Indi… 2023-03-24         19        208
 3 11732 Gimme Back My Soul  Mediu… Heal… Indi… 2023-02-11         11        195
 4 12179 Swim                Noah … If T… Sing… 2023-07-21         11        191
 5 11733 Never Learned To D… Mediu… Heal… Indi… 2023-02-11         12        161
 6 12088 Get The Girl        Seafo… Get … Coun… 2023-06-16         26        159
 7 11855 Everything's Fine   Matt … Ever… Indi… 2023-03-24         12        153
 8 11656 Never Learned To D… Mediu… Neve… Indi… 2022-12-26          6        143
 9 12388 Desert Land         Matt … Dese… Indi… 2023-10-22         11        141
10 12097 Given               Justi… Dayd… R&B/… 2023-06-16         11        132
# ℹ 12,535 more rows

After that, Spotify Wrapped is just group_by() %>% summarize() %>% arrange() in a trench coat.🧥

For easier printing in this blog post, I’ll rearrange this data to show the most commonly noted output:

wrapped <- 
  wrapped %>%
  select(-id) %>%
  relocate(date_added, skip_count, .after = everything()) %>%
  relocate(play_count, .before = everything())

wrapped

# A tibble: 12,545 × 7
   play_count track_title            artist    album genre date_added skip_count
        <dbl> <chr>                  <chr>     <chr> <chr> <date>          <dbl>
 1        234 Atom                   Medium B… Heal… Indi… 2023-02-11         15
 2        208 Reelin'                Matt Cor… Ever… Indi… 2023-03-24         19
 3        195 Gimme Back My Soul     Medium B… Heal… Indi… 2023-02-11         11
 4        191 Swim                   Noah Gun… If T… Sing… 2023-07-21         11
 5        161 Never Learned To Dance Medium B… Heal… Indi… 2023-02-11         12
 6        159 Get The Girl           Seaforth  Get … Coun… 2023-06-16         26
 7        153 Everything's Fine      Matt Cor… Ever… Indi… 2023-03-24         12
 8        143 Never Learned To Dance Medium B… Neve… Indi… 2022-12-26          6
 9        141 Desert Land            Matt Cor… Dese… Indi… 2023-10-22         11
10        132 Given                  Justin N… Dayd… R&B/… 2023-06-16         11
# ℹ 12,535 more rows

Analyzing it

Top songs

The output is already arranged in descending order by play count, so we can just print the first few rows:

wrapped %>%
  select(track_title, artist, play_count) %>%
  head()

# A tibble: 6 × 3
  track_title            artist         play_count
  <chr>                  <chr>               <dbl>
1 Atom                   Medium Build          234
2 Reelin'                Matt Corby            208
3 Gimme Back My Soul     Medium Build          195
4 Swim                   Noah Gundersen        191
5 Never Learned To Dance Medium Build          161
6 Get The Girl           Seaforth              159

Medium! Build!

Top artists

wrapped %>%
  group_by(artist) %>%
  summarize(play_count = sum(play_count, na.rm = TRUE)) %>%
  arrange(desc(play_count)) %>%
  head()

# A tibble: 6 × 2
  artist         play_count
  <chr>               <dbl>
1 Medium Build         1921
2 Matt Corby           1622
3 Justin Nozuka        1058
4 Noah Gundersen        907
5 Patrick Droney        569
6 Mac Ayres             546

group_by() %>% summarize()! I told you!

I will fly to Australia to see Matt Corby play live if I have to.

Top genres

One of my first steps after buying a new record is to edit it’s metadata to fit into one of a few pre-defined genres. Many of these categorizations are sort of silly as a result, but it does make for a nice summary:

wrapped %>%
  group_by(genre) %>%
  summarize(play_count = sum(play_count, na.rm = TRUE)) %>%
  arrange(desc(play_count)) %>%
  head(5)

# A tibble: 5 × 2
  genre                  play_count
  <chr>                       <dbl>
1 Indie/Alternative            5337
2 Singer-Songwriter/Folk       3937
3 R&B/Soul                     2855
4 Country                      2258
5 Indie Pop                     971

Sort of confused by the existence of the “Indie Pop” category.remo::ji(“confused”)` Definitely need to clean up some of those entries.

Tip

You can selectively use the n argument to head() to hide things that you’re embarrassed about.

Top albums

wrapped %>%
  group_by(album, artist) %>%
  summarize(play_count = sum(play_count, na.rm = TRUE), .groups = "drop") %>%
  arrange(desc(play_count)) %>%
  head()

# A tibble: 6 × 3
  album                        artist         play_count
  <chr>                        <chr>               <dbl>
1 Everything's Fine            Matt Corby           1217
2 Health - EP                  Medium Build          971
3 Never Learned To Dance       Medium Build          819
4 Daydreams and Endless Nights Justin Nozuka         736
5 If This Is The End           Noah Gundersen        598
6 Comfortable Enough           Mac Ayres             428

Bonus points

There are a couple summarizations that Wrapped doesn’t do that I’m curious about.

Top song by month

I don’t have the right level of observation to see which songs I listened to the most every month, but I do have a variable giving the data I added a given song. We can use that information to find the top songs by month added:

wrapped %>%
  mutate(month = month(date_added)) %>%
  group_by(month) %>%
  summarize(
    track_title = track_title[which.max(play_count)], 
    artist = artist[which.max(play_count)]
  ) %>%
  head(11)

# A tibble: 11 × 3
   month track_title            artist           
   <dbl> <chr>                  <chr>            
 1     1 Sad Song               Brandon Ratcliff 
 2     2 Atom                   Medium Build     
 3     3 Reelin'                Matt Corby       
 4     4 Be Yourself            Wilder Woods     
 5     5 tennessee is mine      Alana Springsteen
 6     6 Get The Girl           Seaforth         
 7     7 Swim                   Noah Gundersen   
 8     8 You Take The High Road Bruno Major      
 9     9 Better Days            Noah Gundersen   
10    10 Desert Land            Matt Corby       
11    11 PANIC ATTACK           Clinton Kane

Top artist by genre

wrapped %>%
  group_by(genre, artist) %>%
  summarize(play_count = sum(play_count, na.rm = TRUE), .groups = "drop") %>%
  group_by(genre) %>%
  summarize(
    artist = artist[which.max(play_count)],
    play_count = play_count[which.max(play_count)]
  ) %>%
  arrange(desc(play_count)) %>%
  head()

# A tibble: 6 × 3
  genre                  artist            play_count
  <chr>                  <chr>                  <dbl>
1 Indie/Alternative      Matt Corby              1459
2 R&B/Soul               Justin Nozuka           1005
3 Indie Pop              Medium Build             971
4 Singer-Songwriter/Folk Noah Gundersen           598
5 Country                Alana Springsteen        538
6 Bluegrass              Mighty Poplar            369

Moved on

How many albums in my library did I not listen to at all this year? (I reset the play count for all of my library to zero each time I do this analysis.)

wrapped %>%
  group_by(album, artist) %>%
  summarize(play_count = sum(play_count, na.rm = TRUE), .groups = "drop") %>%
  filter(play_count == 0) %>%
  count()

# A tibble: 1 × 1
      n
  <int>
1  1195

That number is a lot bigger than I thought.😬

Reuse

CC BY-SA 4.0