A couple of months ago, I wrote myself a tool which could take a text file of song lyrics and generate an image showing how frequently each word appeared in the song (like a word cloud, where more frequent words were larger), and which words followed which words (unlike a word cloud, since it had arrows between the words).
After trying it on quite a few different songs, I came up with the idea of feeding it a very repetitive song, such as the road trip song 99 Bottles of Beer.
Yesterday, I decided to post this image to the Reddit r/dataisbeautiful community, and it received a lot of interest. I’ve had some people ask how I created an image like this, which this post will try to answer.
While I’ll try to keep from getting too technical, one thing we need to understand is that this song lyric image is a directed graph.
Simplified, a directed graph is a bunch of nodes (the circles, each with a unique word of the song) and edges (the arrows showing how the words are related).
For example, an edge (arrow) from “99” → “bottles” means that “99” comes just before “bottles” in the song lyrics.
I can create a directed graph with a (free!) tool like yEd Graph Editor, which lets me draw nodes (circles) and drag edges (arrows) between them.
So with this alone, I could create an entire song lyrics graph, but it would take a very long time – there are thousands of words in all ninety-nine verses of the song, so I’d have to draw thousands of arrows.
yEd files are in a format called GraphML. Here’s a sample of a very simple graph, and the GraphML that describes it:
Lines 1–6 tell us that this is the start of a GraphML document, and lines 17–18 end the document. What we care most about is the nodes (lines 8–11) and edges (lines 13–15).
You can see that each <node> has an id. Each <edge> has a source (where the arrow comes from) and target (where the arrow points to), and they use those same node ids. So, for example, <edge source="99" target="bottles"/> means “draw an arrow from the node with an id of 99 to the node with an id of bottles.”
Notice that each node can have multiple edges, so we only need to define each word as a node once – even though “bottles” is used hundreds of times throughout the song, we only need a single node with an id of bottles, and then we can refer to it with as many edges as we need.
Effectively, what I need to do is create a script which will loop through the lyrics text and create a <node> for each unique word. Then I need to go back through the lyrics and, looking at each pair of adjacent words, create an <edge> between them.
The resulting code is my song-lyrics-graph Python script. It’s built using the basic concept above, though it has some additional features too – plain vanilla GraphML doesn’t allow things like specifying the size of nodes, but yEd adds extensions to the GraphML document that let me do that.
As long as Python is installed on your computer and you’ve downloaded my script, you can drag and drop a .txt file of song lyrics onto the song_lyrics_graph.py file, and it will generate a .graphml file with a directed graph of your song.
My script does generate all the nodes and edges, but it doesn’t position them in a pretty layout – the file it generates will just have all the nodes on top of each other.
Fortunately, yEd has a layout engine that will try to figure out a good arrangement of the nodes. Open the Layout menu, and you’ll see a large selection of layouts to choose from.
For most songs, I’ve found out that the Tree / Balloon layout seems to work best, though you can certainly experiment with the others.
When you select Layout / Tree / Balloon, a set
Again, you can play around with the settings to try to make the graph look good, but these are the settings I usually use.
Click OK, and yEd will arrange your nodes as it sees fit.