One tøp song

2022-05-08

I'm a die-hard fan of twenty øne piløts (did you know they're a two piece band?) You can see this from the fact that I take the trouble to stylize the band name with ø's, even in its acronym, tøp. Therefore, you wouldn't expect neutrality from this blogpost.

The band and its members, Tyler Joseph and Josh Dun, are known for a Grammy and two all-gold records on RIAA, but to me they're irrelevant (the awards, not the members). I like the vibe of their songs and especially the lyrics. For example, take a look at the insightful final lines from Pet Cheetah (Trench) that build up to a pumping crescendo:

Pet cheetah, cheetah
Pet cheetah, cheetah
Pet cheetah, cheetah
Pet cheetah, cheetah
Pet cheetah, cheetah
Pet cheetah, cheetah
Pet cheetah, cheetah
Pet cheetah, cheetah
Pet cheetah, cheetah
Pet cheetah, cheetah
Pet cheetah, cheetah
Pet cheetah, cheetah
Pet cheetah, cheetah
Pet cheetah, cheetah
Pet cheetah, cheetah
Pet cheetah, cheetah

Whatever you say, I think it's a one-of-a-kind song that discusses making music for a fanbase (yes they make a lot of meta songs like this). The lines above are simple, but unique as well. Among all tøp songs, this is the only one that features the word "pet", and also "cheetah", just like how Nico And The Niners is the only one with "Nico" and "Niners". Wait, that's not right, because "Nico" appears earlier in the album, in the second verse of Morph.

This brought me into thinking: How many words are there that appear in only one twenty øne piløts song? And to pay off my efforts, can I turn this into a fun game for other tøp fans to play?

For the impatient, you may skip all the procedures and technicality. Go ahead and check out the results. Everyone else, please take your time on your ride.

Step 1: Download the lyrics

This isn't as easy as it seemed, nor is it too hard. The lyric provider is azlyrics.com, because it works without JavaScript and serves machine-readable HTML. So I went ahead and curl'd a random page.

$ curl https://www.azlyrics.com/lyrics/twentyonepilots/truce.html
<html>
<head><title>302 Found</title></head>
<body>
<center><h1>302 Found</h1></center>
<hr><center>nginx</center>
</body>
</html>

OK, time to man curl for the option to follow redirections. It's -L, btw. (HTML prettified)

$ curl -L https://www.azlyrics.com/lyrics/twentyonepilots/truce.html
<!DOCTYPE html>
<html lang="en">
  <head>
    <!-- some meta tags -->
    <title>AZLyrics - request for access</title>
    <!-- some stylesheets -->
    <!-- some <IE9 compat scripts -->
    <!-- jquery and the like -->
    <!-- recaptcha script -->
  </head>

  <body>
    <nav>...</nav>
    <!-- a commented out banner -->

    <!-- a few nested divs -->
            Access denied.
    <!-- end nested divs -->

    <!-- a commented out block with the note "bot ban" -->

    <!-- footer -->
  </body>
</html>

Damn, that's pretty… nasty, but it's exactly how I expected it to go. Now, I've done a lot of web scraping, so I know it's possible to fake a few HTTP headers to give curl some human skin. The most common headers are:

  • Referer
  • Cookie
  • User-Agent

So I tried them one by one. User-Agent worked.

$ curl -L -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0' \
    https://www.azlyrics.com/lyrics/twentyonepilots/truce.html

What is extra funny though, is that the server accepts even an empty UA string.

$ curl -L -H 'User-Agent: ' https://www.azlyrics.com/lyrics/twentyonepilots/truce.html

I couldn't resist:

$ curl -L -H 'User-Agent: definitely not curl' https://www.azlyrics.com/lyrics/twentyonepilots/truce.html

Above the lyrics an HTML comment reads:

Usage of azlyrics.com content by any third-party lyrics provider is prohibited by our licensing agreement. Sorry about that.

This won't stop me because I can't read.

So long story short, I curl'd the twenty øne pilots index page and BeautifulSoup'd all the title-URL pairs, which are once again curl'd and BeautifulSoup'd.

Soon I have a directory full of [song title].txt, but not all of them are useful. A few songs are not technically part of tøp's canon discography (some fans are gonna disagree on this one but I don't care), like the Elvis cover Can't Help Falling In Love, which is just a YouTube video of Tyler singing in the street; another one, Coconut Sharks In The Water, although well-known among fans, was only performed once for comical effect in 2011. In the end, I included their six studio albums and five singles, totaling 79 songs.

On to step 2!

Step 2: Look for every word

This is the core part of the project. I knew it's impossible by hand, so I sat down to write an algorithm in Python. It goes like this in pseudocode:

lyrics = dict()
for song in all_songs:
    lyrics[song] = read(song + ".txt").split_words()

for song in all_songs:
    other_songs = list(s in all_songs such that s != song)
    for word in lyrics[song]:
        for other_song in other_songs:
            if lyrics[other_song].includes(word):
                found = True

        if not found:
            append("results.txt", song + "\t" + word)

The latter block had three nested for loops. To optimize it a bit, I read all files before hand, split each one up into individual words, then threw them into a set to remove the duplicates. As for the third for loop, I could call break right after found = True, but instead resorted to the magic of list comprehension (variable names and structure taken from pseudocode above):

        if any([(word in lyrics[o]) for o in other_songs]):
            append("results.txt", song + "\t" + word)

I like to imagine Python optimized this one for me, but I'm not sure. Anyway, even if it doesn't this shouldn't be too bad. Plus, I like one-liners.

When splitting words, they are converted to lowercase. Punctuation marks and suffixes like 's and 'd are removed, but I forgot to remove 've. Fortunately there weren't many of them, so I removed them by hand.

You can read the real source code here: data/one_song_words.py

Step 3: Dedupe

The previous step brought about a problem. The script I wrote treated inflections as separate words, e.g. "vibe" (Chlorine), "vibes" and "vibing" (The Outside). So I wrote a script to find most of them.

The script reports occurrences of the following inflections of word:

word + "s",
word + "es",
word + "d",
word + "ed",
word + "ing",

and also in reverse, if word is already inflected:

re.sub("s$", "", word),
re.sub("es$", "", word),
re.sub("d$", "", word),
re.sub("ed$", "", word),
re.sub("ing$", "", word),

And when I ran it, what happened is it caught most of the offenders — like "vibe" vs. "vibes" — but not more subtle ones like "vibing". I ended up removing them again by hand, but it's possible I missed some.

Why didn't I just tell the script to remove the inflections automatically? Because there were false positives. For example, "sing" (Bandito and many others) and "singed" (Leave The City) are not the same thing. Other examples include "to" and "toes", "she" and "shed", "not" and "notes", "even" and "evening", etc. Also, although some pairs are of the same origin, they're pretty different semantically, like "weathered" (Chlorine) and "weather" (Good Day and Migraine). Leaving these alone, I axed everything else from my list.

Source code: data/dedupe.py

Step 4: Manual inspection

It was at this moment that I realized that I had forgot about stuff like "[x10]" (Holding On To You) that marks a repeated line. There were some onomatopoeic words like "mm-mm" (Choker), too, and don't get me started on hyphenated words: there were "treehouse" (Forest) and "tree-house" (Stressed Out). Words like "migraine", which comes from a song titled Migraine, are too easy for a game, so they are not included either. I also capitalized proper nouns like "Monday", and removed trailing periods and commas from every line I could find. In retrospect it could have been easier if I sanitized the lyric files from the beginning. At this moment there are 1,002 words left, but I don't know if there's more to knock out. I doubt anyone will notice.

Here's a fun story: after I deployed the app (yes there'll be a web app at the end) on r/twentyonepilots, one player reported an incorrect lyric from Migraine:

A difficult to be, stop feasting lumber-down trees

At first glance this lyric seemed unfamiliar to me, and it definitely isn't grammatically correct. I checked multiple sources: on azlyrics of course it's this one, but on Genius it says otherwise:

A difficult beast feasting on burnt down trees

Oops, better go check out the description from the official audio on Fueled By Ramen's (tøp's label, FBR for short) YouTube channel:

a difficult to be, stop feasting lumber down trees

And this video at 14:40 on Warner Music Japan's channel with Japanese and English subtitles:

燒け落ちた木々貪り食う、気難しい野獸
A difficult beast feasting on burnt down trees

Well, I tried.

So to settle this the only thing I could do was find out by myself. I grabbed WrightP's Official Acapella version and extracted that bit with Audacity. I slowed it down 50%, and it sounds like this:

Let me explain what I heard:

A difficult-a beast-a feasting-on bur- down trees

The "ng" sound between "feasting" and "on" is audible. There is no "l" sound as in "lumber-down", and there is no /ɒ/ or /ɑ/ sound following "st", which rules out "stop".

That settles it: Genius and WMG Japan are right, azlyrics and FBR are wrong. I suspect that azlyrics got its lyrics from FBR in the first place.

Track-word pairs: data/track_words

Step 5: Generating a dataset

Now that I have a 1000-something-line-long file of tab-separated track titles and unique words, it's time to generate a dataset for the game. Since I'll be producing a web game, the language is gonna be JavaScript, so the dataset will be in JSON. The first challenge is we need to know the line from which each word came from. This way if the player fails to recall it, we'll show them the line and they will go "hmm, yeah, Tyler really did sing this". But you see, my step 2 script completely scrambled the lyrics. So I wrote another Python script to "grep" them from the giant heap of txt files. It was pretty easy, and moments later I have this JSON file structured like this:

[
  {
    "track": "Redecorate",
    "word": "blankets",
    "lines": [
      "Then one night she got cold with no blankets on her bed",
      "Blankets over mirrors, she tends to like it"
    ]
  },
  {...},{...},...
]

I should try to shrink the 135kB (kilo, not kibi) dataset. First, the prettyprint was unnecessary, so let's do away with it. It instantly went down to 99kB. However, having everything on one line makes batch editing in vim a huge pain, and every launch took seconds. So as a compromise I inserted a linebreak after every word object, so for x words there would be (x+2) lines including the brackets. 1kB well spent. The JSON file is now a neat 100kB, which is a 26% optimization compared to the initial 135kB.

However, as I was coding JavaScript I realized that, since we're using the dataset as a JavaScript object, we don't have to play by JSON's rules. This means no more double quotes around keys! Each word object has 6 double quotes, 6 times 1000 is… 6kB! That's right, we just shrank the dataset to 94kB. Now that's a 30% optimization. All by frugal management of whitespace.

Later I found it would be better if I tagged the album to each word, but it would be super redundant. So instead, I placed lists of tracks in each album inside another JS file that is load alongside the words.

JSON generator script: data/mkjson.py

Datasets: data/words.json, words.js, and albums.js

Step 6: Design the game UI

I thought I despised the "mobile first" approach, but it turns out what I hated was the "mobile only" garbage. [Mobile Wikipedia] actually works remarkably well on desktop. What I'm doing is so much simpler than Wikipedia. The page contains the following fundamental elements:

  • the word
  • textbox for user input
  • candidate list
  • controls

I swear, the desktop version works just as smoothly as on mobile (although I failed to center a few elements).

Desktop UI

Mobile UI

▲ Notice that "twenty øne piløts" are joined with non-breaking spaces

And my absolute favorite thing here is the candidate list. I wouldn't expect anyone to type "House Of Gold" in its entirety, would I? Of course there should be some sort of search suggestion. The candidate list I implemented tries to match user input against the beginning of each song title, as well as acronyms. For example "hot" gives you Holding On To You. A hack was written for Heavydirtysoul so that "hds" would match it.

Oh, I almost forgot: the three buttons are twenty øne piløts-themed.

The classic |-/ logo: blue vertical bar, black dash, and red slash

▲ Former tøp logo from the Regional at Best era

Step 7: Game logic

From this point there's no repetitive chores, and I can finally focus on making a game. The concept is simple: the player tries to guess the song that a word came from.

Let me enumerate the steps in which the player would interact with my game:

  1. Game shows random word taken from dataset
  2. Player types track title into textbar, confirms
  3. Game indicates correct answer, shows album and line
  4. Player clicks Next, go to 1

The player might not be always right. In that case the flow would be:

  1. Game shows random word taken from dataset
  2. Player types track title into textbar, confirms
  3. Game indicates wrong answer
  4. Player tries again, go to 2; or clicks Next, go to 1

We need some hint mechanism so a clueless player has a chance of recalling something.

  1. Game shows random word taken from dataset
  2. Player does nothing, or makes incorrect guesses
  3. Player clicks Hint
  4. Game reveals some information about correct answer unless hints are depleted. Go to 2

I wanted this game to be as pressure-free as possible. Therefore, players can skip words or show answer at any time, and there are no scorekeeping counters or timers. Every 50 guesses the players made, the game reminds them to take a rest.

Source code: index.js

Step 8: Debugging

The game was designed to run offline. The server, if any, is there just to send you the HTML, stylesheet, and JavaScript for datasets and the game itself. This means it is possible to do everything in a file:// browser tab.

Because the web game is designed "mobile first" (but in a good way), I tested the UI extensively with and without DevTools mobile emulator, and on my phone. This way I figured out what interactions worked best on both keyboard and touchscreen.

As to the JavaScript, I did not exactly enjoy writing it, but it wasn't hellish suffering either. I no longer "hate" JavaScript; I just want to stay away from it from now on. I would describe my code as pretty type-safe… until it isn't.

Step 9: Visualizing and having fun with the dataset

No, it's not about fancy charts or scatter plots. I just thought it would be helpful if we could display all the words in a table, so I made a webpage for that. Fun fact: I gave up indentation for all the <tr> tags. Otherwise there would be 28*1002 = 28kB of wasted data.

Table of a few tracks, words that only appear in each one, and respective
lines

Then I thought, "hey, what if I pulled up a list of most frequently used English words and compared that to those I found?" So I downloaded a list from Wiktionary titled Frequency lists/TV/2006/1-1000 which is the top 1000 words used in "a collection of TV and movie scripts/transcripts" as of 2006. This time though, I made more use of Unix tools. It worked like this (the 1000-word list was saved in file 1000):

$ cut -f2 tracks_words  # extract word from "track<tab>word" | sort > /tmp/top
$ sort 1000 > /tmp/freq
$ comm -12 /tmp/top /tmp/freq  # find common words between the two files
ahead
anybody
anyway
...

And here we have the most frequent 88 words:

Table of a few words, and the track they are in

I ran some more stupid analysis on the dataset and found that the only song that had absolutely no unique word is Truce (a bad day to the Truce fans out there, eh?), and songs closest to zero are Before Your Start Your Day and Trees, contributing 2 each. The figures go all the way up to 51: Neon Gravestones, which is basically a rapped-out essay, has the most expansive vocabulary among all tøp songs. I wrote all my interesting findings in the trivia section for players to discover.

The scripts I used to generate HTML: data/mkhtml_all.py, and data/mkhtml/freq.py

The HTML: words.html

Step 10: Deployment

The only thing it takes to deploy a static website is scp. rsync if you have lots of data. Let's calculate how much data we have to transfer.

File Size (kB)
index.html 9.6
words.html 88
index.css 1.7
index.js 6.5
words.js 94
albums.js 2.3
img/*.jpg 115.4
Total 317.5

Incidentally, this is how much my game will consume from a player's data plan. I think it's small enough for anyone.

Results

On April 19, 2022, I published a version I thought was stable enough to r/twentyonepilots. It went reasonably popular. You can play it here: One tøp song

Here's a demo video (2.0 MiB):

The source code (MIT) is here. If you want, you can download lyrics to your favorite artists' songs and generate your own dataset to play with. A redditor considered Taylor Swift, and I'm looking forward to their progress.

In conslusion, I think I did a pretty good job at extracting, representing, and toying with data, but the process left a lot to improve. NLP connoisseurs are gonna be mad at me for not using this and that library, and some Unix guru might be capable of rewriting my Python scripts with sed, awk, and jq. I do not care. The final product is one of my better interactive web designs, made with no framework and minimal assets. The game is not designed to be addictive, unlike $insertGameNameHere. It is, after all, just for fun; in the disclaimer I wrote that the game is "not a tool for gatekeeping." That's how things are supposed to work.