Analyzing my music listening data with Databot

It’s Spotify Wrapped season.

Published

December 3, 2025

It’s Spotify Wrapped season, which means that everyone I follow on instagram is posting screenshots on their stories and I need to export my iTunes Library metadata as an .xml file and analyze it with the tidyverse. (If you’re new here, I do a little group_by() %>% summarize() on my own music listening data each year.)

This year, though, I wondered how well Databot would be able to wrangle the .xml file that iTunes can export into tidy data. Databot is an AI agent in Positron that dramatically accelerates exploratory data analysis; you can see both the code it writes and its output, and it regularly stops working to check in on where you’d like to head next.

After firing up Databot, I pointed it at the raw listening data and asked it to figure it out:

A screenshot of the beginning of a conversation with databot, where I say 'Read blog/2025-12-03-wrapped-databot/Library.xml and turn it into a tibble' and the agent begins working.

After reading a snippet of the file, it realized what it was looking at and drafted some code to load it into a tibble. A minute or so later, I was looking at my bootleg Spotify Wrapped analysis. My top songs:

library_tibble |>
  filter(!is.na(`Play Count`)) |>
  arrange(desc(`Play Count`)) |>
  select(Name, Artist, `Play Count`)
#> # A tibble: 3,380 × 3
#>    Name                       Artist           `Play Count`
#>    <chr>                      <chr>                   <dbl>
#>  1 WE COULD BE                MARO                      243
#>  2 when it gets dark          Humble the Great          223
#>  3 something that we can't..  Humble the Great          204
#>  4 I SEE IT COMING            MARO                      196
#>  5 Rituals                    Watchhouse                177
#>  6 LIFELINE                   MARO                      173
#>  7 find your own              Humble the Great          171
#>  8 Amy                        Yana                      165
#>  9 KID AGAIN                  Jon Bellion               163
#> 10 september lightning        Humble the Great          162

Most listened to artists:

library_tibble |>
  filter(!is.na(`Play Count`)) |>
  group_by(Artist) |>
  summarise(
    total_plays = sum(`Play Count`),
    num_tracks = n(),
    .groups = "drop"
  ) |>
  arrange(desc(total_plays)) |>
  head(10)
#> # A tibble: 10 × 3
#>    Artist           total_plays num_tracks
#>    <chr>                  <dbl>      <int>
#>  1 Noah Gundersen          2428        107
#>  2 MARO                    1877         39
#>  3 Humble the Great        1671         22
#>  4 half•alive              1449         22
#>  5 Yana                    1407         13
#>  6 Medium Build            1406         42
#>  7 Watchhouse              1333         32
#>  8 Jon Bellion             1233         48
#>  9 Kelsea Ballerini        1186         32
#> 10 Valley                  1161         64

Most listened to albums:

library_tibble |>
  filter(!is.na(`Play Count`)) |>
  group_by(Album, Artist) |>
  summarise(
    total_plays = sum(`Play Count`),
    num_tracks = n(),
    .groups = "drop"
  ) |>
  arrange(desc(total_plays)) |>
  head(10)
#> # A tibble: 10 × 4
#>    Album                 Artist    total_plays num_tracks
#>    <chr>                 <chr>           <dbl>      <int>
#>  1 i don't mind floating Humble t…        1633         10
#>  2 Rituals               Watchhou…        1228         13
#>  3 PATTERNS (DELUXE)     Kelsea B…        1119         20
#>  4 FATHER FIGURE         Jon Bell…        1111         14
#>  5 Persona               half•ali…         946         11
#>  6 Country               Medium B…         812         12
#>  7 Outline               Sam Lynch         777          9
#>  8 Lifeline (ft. Nasaya) MARO              665         10
#>  9 Exit Strategy         Yana              625          6
#> 10 ft. NASAYA            MARO              612          3

If you’re interested in reading the full transcript (there’s not much more to it), Databot supports exporting the conversation as history, so I’ve included it here:

Databot conversation: convert Apple Music Library.xml and analyze most-played songs

User

Read blog/2025-12-03-wrapped-databot/Library.xml and turn it into a tibble

Databot

I’ll read the XML file and convert it to a tibble for you.

I can see this is an Apple Music/iTunes Library XML file with track information. The file is quite large (over 510,000 lines!), but I can see the structure - it’s a plist format with tracks stored as nested dictionaries. I notice you already have xml_data loaded in your session. Let me convert this XML file to a tibble:

library(tidyverse)
library(xml2)

# Read the XML file
xml_data <- read_xml("blog/2025-12-03-wrapped-databot/Library.xml")

# Find all track dictionaries
tracks <- xml_find_all(xml_data, "//dict/key[text()='Tracks']/following-sibling::dict[1]/dict")

# Function to extract track information from a dict node
extract_track_info <- function(track) {
  keys <- xml_find_all(track, "./key")
  values <- xml_find_all(track, "./key/following-sibling::*[1]")
  
  # Get key names and values
  key_names <- xml_text(keys)
  key_values <- map2(values, xml_name(values), function(node, type) {
    if (type == "integer") return(as.numeric(xml_text(node)))
    if (type == "true") return(TRUE)
    if (type == "false") return(FALSE)
    return(xml_text(node))
  })
  
  # Create named list
  setNames(key_values, key_names)
}

# Extract all tracks and convert to tibble
library_tibble <- map(tracks, extract_track_info) |>
  map(as_tibble) |>
  bind_rows()

library_tibble
# A tibble: 14,876 × 47
   `Track ID` Name        Artist Album Genre Kind    Size
        <dbl> <chr>       <chr>  <chr> <chr> <chr>  <dbl>
 1       2698 Fixin' to … Bob D… Bob … Sing… MPEG… 3.56e6
 2       2700 In My Time… Bob D… Bob … Sing… MPEG… 3.98e6
 3       2702 House of t… Bob D… Bob … Sing… MPEG… 7.80e6
 4       2704 Pretty Peg… Bob D… Bob … Sing… MPEG… 5.03e6
 5       2706 Song to Wo… Bob D… Bob … Sing… MPEG… 4.08e6
 6       2708 Gospel Plow Bob D… Bob … Sing… MPEG… 2.75e6
 7       2710 Desolation… Bob D… High… Sing… MPEG… 2.73e7
 8       2712 Ballad Of … Bob D… High… Sing… MPEG… 1.43e7
 9       2714 Tombstone … Bob D… High… Sing… MPEG… 1.43e7
10       2716 Girl From … Bob D… The … Sing… MPEG… 8.14e6
# ℹ 14,866 more rows
# ℹ 40 more variables: `Total Time` <dbl>,
#   `Track Number` <dbl>, `Date Modified` <chr>,
#   `Date Added` <chr>, `Bit Rate` <dbl>,
#   `Sample Rate` <dbl>, Normalization <dbl>,
#   `Artwork Count` <dbl>, `Persistent ID` <chr>,
#   `Track Type` <chr>, Location <chr>, …
# ℹ Use `print(n = ...)` to see more rows

Perfect! I’ve successfully converted the Library.xml file into a tibble. The resulting tibble contains 14,876 tracks with 47 columns of information including:

  • Track ID, Name, Artist, Album, Genre
  • Size (file size in bytes)
  • Total Time (duration in milliseconds)
  • Track Number, Date Modified, Date Added
  • Bit Rate, Sample Rate
  • Location (file path)
  • And many more metadata fields

The tibble is now stored in the library_tibble variable in your R session. You can see it starts with Bob Dylan tracks and contains various audio file metadata from your Apple Music library.

User

Is there a variable for listen count? What song did I listen to most?

Databot

I’ll check if there’s a listen count variable in the tibble and find your most-listened-to song.

# Look at all column names
colnames(library_tibble)
 [1] "Track ID"             "Name"                
 [3] "Artist"               "Album"               
 [5] "Genre"                "Kind"                
 [7] "Size"                 "Total Time"          
 [9] "Track Number"         "Date Modified"       
[11] "Date Added"           "Bit Rate"            
[13] "Sample Rate"          "Normalization"       
[15] "Artwork Count"        "Persistent ID"       
[17] "Track Type"           "Location"            
[19] "File Folder Count"    "Library Folder Count"
 [ reached 'max' / getOption("max.print") -- omitted 27 entries ]
# Print all column names without truncation
print(colnames(library_tibble), max = 100)
 [1] "Track ID"              "Name"                 
 [3] "Artist"                "Album"                
 [5] "Genre"                 "Kind"                 
 [7] "Size"                  "Total Time"           
 [9] "Track Number"          "Date Modified"        
[11] "Date Added"            "Bit Rate"             
[13] "Sample Rate"           "Normalization"        
[15] "Artwork Count"         "Persistent ID"        
[17] "Track Type"            "Location"             
[19] "File Folder Count"     "Library Folder Count" 
[21] "Album Artist"          "Composer"             
[23] "Disc Number"           "Disc Count"           
[25] "Year"                  "Sort Album"           
[27] "Sort Name"             "Track Count"          
[29] "Play Count"            "Play Date"            
[31] "Play Date UTC"         "Sort Artist"          
[33] "Skip Count"            "Skip Date"            
[35] "Volume Adjustment"     "Sort Album Artist"    
[37] "Release Date"          "Purchased"            
[39] "Explicit"              "Disliked"             
[41] "Compilation"           "Part Of Gapless Album"
[43] "File Type"             "Work"                 
[45] "Sort Composer"         "Comments"             
[47] "Start Time"           

Great! There is a “Play Count” variable. Let me find your most-listened-to song:

# Find the song(s) with the most plays
library_tibble |>
  filter(!is.na(`Play Count`)) |>
  arrange(desc(`Play Count`)) |>
  select(Name, Artist, Album, `Play Count`, `Play Date UTC`) |>
  head(10)
# A tibble: 10 × 5
   Name         Artist Album `Play Count` `Play Date UTC`
   <chr>        <chr>  <chr>        <dbl> <chr>          
 1 WE COULD BE  MARO   ft. …          243 2025-05-20T14:…
 2 when it get… Humbl… i do…          223 2025-11-16T19:…
 3 something t… Humbl… i do…          204 2025-11-16T19:…
 4 I SEE IT CO… MARO   ft. …          196 2025-04-09T19:…
 5 Rituals      Watch… Ritu…          177 2025-06-08T18:…
 6 LIFELINE     MARO   ft. …          173 2025-05-20T14:…
 7 find your o… Humbl… i do…          171 2025-11-15T23:…
 8 Amy          Yana   Amy …          165 2025-03-28T14:…
 9 KID AGAIN    Jon B… FATH…          163 2025-11-13T20:…
10 september l… Humbl… i do…          162 2025-11-16T19:…

Yes! There is a “Play Count” variable, and your most-listened-to song is “WE COULD BE” by MARO from the album “ft. (deluxe)” with 243 plays! The last time you played it was on May 20, 2025.

Your top 10 most-played songs include:

  1. WE COULD BE - MARO (243 plays)
  2. when it gets quiet - Humble. (223 plays)
  3. something to hold onto - Humble. (204 plays)
  4. I SEE IT COMING - MARO (196 plays)
  5. Rituals - Watchhouse (177 plays)

It looks like you’re a big fan of MARO and Humble.!

User

It seems like the album names haven’t been parsed correctly. “Humble.” should be “Humble the Great”, and the MARO album is actually called “ft. NASAYA”.

Databot

You’re right! Let me check what’s happening with the album names in the XML parsing:

# Look at the top songs with more detail
library_tibble |>
  filter(!is.na(`Play Count`)) |>
  arrange(desc(`Play Count`)) |>
  select(Name, Artist, Album, `Play Count`) |>
  head(3)
# A tibble: 3 × 4
  Name                          Artist Album `Play Count`
  <chr>                         <chr>  <chr>        <dbl>
1 WE COULD BE                   MARO   ft. …          243
2 when it gets dark             Humbl… i do…          223
3 something that we can't give… Humbl… i do…          204

I see the issue - the album names are being truncated in the display. Let me look at the actual full values:

# Look at full album names
library_tibble |>
  filter(Name == "WE COULD BE" | Name == "when it gets dark") |>
  pull(Album)
[1] "ft. NASAYA"            "i don't mind floating"
[3] "Lifeline (ft. Nasaya)"
# Check Humble artist name
library_tibble |>
  filter(Name == "when it gets dark") |>
  select(Artist, Album)
# A tibble: 1 × 2
  Artist           Album                
  <chr>            <chr>                
1 Humble the Great i don't mind floating

Ah, you’re absolutely right! The data was parsed correctly - it was just the tibble display that was truncating the names. The actual values are:

  • Artist: “Humble the Great” (not “Humble.”)
  • Album: “ft. NASAYA” for the MARO song and “i don’t mind floating” for the Humble the Great songs

So to correct my earlier response: Your most-listened-to song is “WE COULD BE” by MARO from the album “ft. NASAYA” with 243 plays, and your second most-played is “when it gets dark” by Humble the Great from “i don’t mind floating” with 223 plays!

User

What about the albums that I listened to most? Artists?

Databot

I’ll calculate the most-listened-to albums and artists by summing up the play counts:

# Most listened to albums
library_tibble |>
  filter(!is.na(`Play Count`)) |>
  group_by(Album, Artist) |>
  summarise(
    total_plays = sum(`Play Count`),
    num_tracks = n(),
    .groups = "drop"
  ) |>
  arrange(desc(total_plays)) |>
  head(10)
# A tibble: 10 × 4
   Album                 Artist    total_plays num_tracks
   <chr>                 <chr>           <dbl>      <int>
 1 i don't mind floating Humble t…        1633         10
 2 Rituals               Watchhou…        1228         13
 3 PATTERNS (DELUXE)     Kelsea B…        1119         20
 4 FATHER FIGURE         Jon Bell…        1111         14
 5 Persona               half•ali…         946         11
 6 Country               Medium B…         812         12
 7 Outline               Sam Lynch         777          9
 8 Lifeline (ft. Nasaya) MARO              665         10
 9 Exit Strategy         Yana              625          6
10 ft. NASAYA            MARO              612          3
# Most listened to artists
library_tibble |>
  filter(!is.na(`Play Count`)) |>
  group_by(Artist) |>
  summarise(
    total_plays = sum(`Play Count`),
    num_tracks = n(),
    .groups = "drop"
  ) |>
  arrange(desc(total_plays)) |>
  head(10)
# A tibble: 10 × 3
   Artist           total_plays num_tracks
   <chr>                  <dbl>      <int>
 1 Noah Gundersen          2428        107
 2 MARO                    1877         39
 3 Humble the Great        1671         22
 4 half•alive              1449         22
 5 Yana                    1407         13
 6 Medium Build            1406         42
 7 Watchhouse              1333         32
 8 Jon Bellion             1233         48
 9 Kelsea Ballerini        1186         32
10 Valley                  1161         64

Great! Here are your listening habits:

Most-Listened-to Albums:

  1. “i don’t mind floating” by Humble the Great - 1,633 total plays (10 tracks)
  2. “Rituals” by Watchhouse - 1,228 plays (13 tracks)
  3. “PATTERNS (DELUXE)” by Kelsea Ballerini - 1,119 plays (20 tracks)
  4. “FATHER FIGURE” by Jon Bellion - 1,111 plays (14 tracks)
  5. “Persona” by half•alive - 946 plays (11 tracks)

Most-Listened-to Artists:

  1. Noah Gundersen - 2,428 total plays (107 tracks!)
  2. MARO - 1,877 plays (39 tracks)
  3. Humble the Great - 1,671 plays (22 tracks)
  4. half•alive - 1,449 plays (22 tracks)
  5. Yana - 1,407 plays (13 tracks)

Your top artist is Noah Gundersen by a significant margin, even though none of his individual songs appeared in your top 10 most-played tracks. It looks like you have a broad collection of his music that you listen to consistently!


Interestingly, the agent mis-read the tibble print output at first, thinking that the truncated display was the full name of the album name. (We’ll look into that.) At first, I assumed it was an XML parsing error, but it correctly realized its own error when I asked about it.

Back to top