Creating Song Lyrics Graphs

A couple of months ago, I wrote myself a tool which could take a text file of song lyrics and generate an image showing how frequently each word appeared in the song (like a word cloud, where more frequent words were larger), and which words followed which words (unlike a word cloud, since it had arrows between the words).

After trying it on quite a few different songs, I came up with the idea of feeding it a very repetitive song, such as the road trip song 99 Bottles of Beer.

A directed graph of the lyrics of "99 Bottles of Beer," with words as nodes and edges between subsequent lyrics.

Yesterday, I decided to post this image to the Reddit r/dataisbeautiful community, and it received a lot of interest. I’ve had some people ask how I created an image like this, which this post will try to answer.

Directed Graphs

While I’ll try to keep from getting too technical, one thing we need to understand is that this song lyric image is a directed graph.

Simplified, a directed graph is a bunch of nodes (the circles, each with a unique word of the song) and edges (the arrows showing how the words are related).

For example, an edge (arrow) from “99” → “bottles” means that “99” comes just before “bottles” in the song lyrics.

I can create a directed graph with a (free!) tool like yEd Graph Editor, which lets me draw nodes (circles) and drag edges (arrows) between them.

The first two lines of the song – “99 bottles of beer on the wall / 99 bottles of beer” – in graph form

So with this alone, I could create an entire song lyrics graph, but it would take a very long time – there are thousands of words in all ninety-nine verses of the song, so I’d have to draw thousands of arrows.

Automatically Generating a yEd Graph

To save time, I want to be able to take a text file of song lyrics and automatically convert it into a yEd document.

yEd files are in a format called GraphML. Here’s a sample of a very simple graph, and the GraphML that describes it:

A simple graph, with the following nodes: 99, bottles, of, beer. Arrows join these nodes in that order.

Lines 1–6 tell us that this is the start of a GraphML document, and lines 17–18 end the document. What we care most about is the nodes (lines 8–11) and edges (lines 13–15).

You can see that each <node> has an id. Each <edge> has a source (where the arrow comes from) and target (where the arrow points to), and they use those same node ids. So, for example, <edge source="99" target="bottles"/> means “draw an arrow from the node with an id of 99 to the node with an id of bottles.”

Notice that each node can have multiple edges, so we only need to define each word as a node once – even though “bottles” is used hundreds of times throughout the song, we only need a single node with an id of bottles, and then we can refer to it with as many edges as we need.

Effectively, what I need to do is create a script which will loop through the lyrics text and create a <node> for each unique word. Then I need to go back through the lyrics and, looking at each pair of adjacent words, create an <edge> between them.

The resulting code is my song-lyrics-graph Python script. It’s built using the basic concept above, though it has some additional features too – plain vanilla GraphML doesn’t allow things like specifying the size of nodes, but yEd adds extensions to the GraphML document that let me do that.

As long as Python is installed on your computer and you’ve downloaded my script, you can drag and drop a .txt file of song lyrics onto the song_lyrics_graph.py file, and it will generate a .graphml file with a directed graph of your song.

yEd Layouts

My script does generate all the nodes and edges, but it doesn’t position them in a pretty layout – the file it generates will just have all the nodes on top of each other.

Screenshot of yEd

Fortunately, yEd has a layout engine that will try to figure out a good arrangement of the nodes. Open the Layout menu, and you’ll see a large selection of layouts to choose from.

Screenshot of yEd's Layout menu

For most songs, I’ve found out that the Tree / Balloon layout seems to work best, though you can certainly experiment with the others.

When you select Layout / Tree / Balloon, a set

Screenshot of the yEd Balloon Layout settings menu. Root Node Policy: Weighted Center Root. Routing Style for Non-Tree Edges: Straight-Line. Preferred Child Wedge: 200. Preferred Root Wedge: 360. Minimal Edge Length: 10. Compactness Factor: 0.5. Place Children Interleaved and Straighten Chains is checked, all other checkboxes are unchecked.

Again, you can play around with the settings to try to make the graph look good, but these are the settings I usually use.

Click OK, and yEd will arrange your nodes as it sees fit.

From there, you can export your graph as a .png image by using the File / Export menu!

Generating GPX and KML maps with Ruby on Rails

While working on my various projects, I’ve dealt with various types of maps.

My Flight Historian plots flight data using the Great Circle Mapper tool. These maps are simple to generate from my flight data (I just have to pass it a plain text collection of airport codes) and easy to embed in my website. However, because they are static images, they can’t be easily panned, zoomed, or otherwise manipulated in the way that modern map websites and apps can.

A sample Great Circle Mapper map of my flights in 2018.
(Map generated by Paul Bogard using the Great Circle Mapper – copyright © Karl L. Swartz)

On the other hand, my driving maps require too much detail for a single image, so I create them in Google Earth, which lets me manipulate the view as much as I need to. The driving data is a bit more complicated than my flight log data; while my flight log represents the abstract shortest distance straight line between two airports (and thus only requires specifying the airport at each end), a single drive can involve tens of thousands of coordinates that can be joined together, connect-the-dots style, to show the actual driving route taken.

A sample driving data map using Google Earth

Fortunately, all those coordinates are automatically generated and saved by my car’s GPS navigation unit in a file format called GPX (GPS Exchange Format), which is an XML-based file format which contains (among other things) latitude/longitudes sampled, in the case of my particular GPS, once per second.

Sample from a GPX file from a recent trip

Google Earth doesn’t use GPX format (though it can import it); instead, it uses a format called KML (Keyhole Markup Language, from back when Google Earth was Keyhole EarthViewer). KML is also an XML-based format, so conceptually it’s similarly a collection of coordinates that can all be joined together, with its own slightly different style.

The same set of latitudes, longitudes, and altitudes in KML format. (Note that KML reverses the order of longitude and latitude.)

But while GPX and KML can be used to represent complicated route shapes, they don’t have to be. These formats are both just as capable of taking a pair of points on the globe and drawing the shortest line between them. With that in mind, I decided to try to have Flight Historian automatically generate KML and GPX versions of my flight map, which would let me show my flight routes in Google Earth and Google Maps.

Continue reading “Generating GPX and KML maps with Ruby on Rails”

My “Worst” Layovers

Flying out of a smaller city like Dayton, I’m used to having flight layovers on the way to nearly everywhere I travel. While any layover is going to lengthen a trip, one of the most common complaints I hear from traveling companions is when a layover forces them to fly east to go west, or vice versa.

Traveling east (DAY–IAD) to go west (TUL)

[All maps in this post are generated by Paul Bogard using the Great Circle Mapper – copyright © Karl L. Swartz]

I started thinking about a way to quantify how bad a layover was, and ultimately decided that it would be best to compare the sum of the (great circle) distances for each of the flights flown compared to the (great circle) distance of a direct flight from the origin to the destination:

{ratio}_{layover} = \dfrac{distance_1+distance_2+\ldots+distance_n}{distance_\text{direct}}

This would give me a ratio of how much further I flew than I needed to, where a higher ratio would mean a worse layover. A ratio of 2 would mean I flew twice as far as I needed to, a ratio of 3 would mean three times as far, and so on. A ratio of 1 would mean a layover didn’t add any extra distance at all.

My Worst 5 Layovers

Since I keep track of all of my flight data, I can use this ratio to determine my worst layovers.

Note: these layovers are the “worst” in a mathematical sense only –
the ones that add the most distance relative to the shortest theoretical distance. None of these were subjectively bad – the worst in that sense would probably be awarded to some of the weather/mechanical IRROPS that added extra unplanned layovers and days to my travel time. My intent is not to complain that any of the below are bad, but just to come up with an interesting way to quantify some of my flight data.

#5 Worst: Nashville–Charlotte–Dayton

Flown: 697 mi · Direct: 293 mi · Ratio: 2.379

A lot of my bad layovers come from trips that are just on the threshold where either driving or flying could make sense (for me, about a six hour drive). Because these are some of the shortest direct distances I fly, any deviation in the layover tends to greatly increase the length of the trip. In this case, the route was about 2.3 times longer than a direct flight would have been.

#4 Worst: St Louis–Charlotte–Dayton (both directions)

Flown: 944 mi · Direct: 338 mi · Ratio: 2.793

Similarly, St. Louis is right on the drive/fly threshold for me.

There used to be a direct flight between Dayton and St. Louis back when American Airlines was still operating St. Louis as a hub it inherited from TWA, but now it takes a layover to get there. Usually I can at least go through Chicago O’Hare which is more direct (ratio of 1.467), but occasionally I end up having to fly through Charlotte to get a flight at the right time of day.

#3 Worst: Dayton–Dallas/Fort Worth–Boston

Flown: 2,419 mi · Direct: 707 mi · Ratio: 3.421

This is the one that I thought would be my worst layover. I got this trip for free with frequent flier miles, so I wasn’t going to complain too much about the routing, but I’ve always thought this was a pretty ridiculous-looking map.

#2 Worst: Milwaukee–Atlanta–Dayton

Flown: 1,103 mi · Direct: 283 mi · Ratio: 3.898

All I can guess is that it was probably the cheapest flight available when I booked it, and I wasn’t a very experienced traveler at the time.

#1 Worst: Des Moines–Houston–Wichita

Flown: 1,346 mi · Direct: 335 mi · Ratio: 4.018

This is one of my few short trips that didn’t start at home. I had a work trip where I had to be in Des Moines for the first half of the week, and Wichita the second half. Again, it’s about a six hour drive between the two cities, but with as out of the way as this layover turned out to be (more than quadrupling my distance traveled!), I might have been better off driving.

My Best 5 Layovers

#5 Best: Chicago–Toronto–Munich

Flown: 4,561 mi · Direct: 4,517 mi · Ratio: 1.010

While there are nonstop flights available between Chicago and Munich, I booked this route on frequent flier miles and had to take a layover to do so. That said, it only added a percent to the length of the trip (and at least made the transatlantic flight slightly shorter), so it worked out fine.

#4 Best: Dayton–Denver–Burbank

Flown: 1,930 mi · Direct: 1,911 mi · Ratio: 1.010

Dayton doesn’t have any direct flights to west coast airports (in fact, Denver is the longest direct flight from Dayton), so this routing was pretty decent to get to Burbank.

#3 Best: Charleston–Charlotte–Dayton

Flown: 538 mi · Direct: 536 mi · Ratio: 1.004

Normally my job had me flying United when I went to Charleston, so I had a lot of layovers at Washington Dulles. However, my very first return flight from Charleston was right after United had merged their reservation system with Continental. They were having a lot of issues and my flight got cancelled, so United ended up putting me on a US Airways flight through Charlotte, which was a better layover anyway.

#2 Best: Dayton–Chicago–Seattle (both directions)

Flown: 1,955 mi · Direct: 1,952 mi · Ratio: 1.002

This route was what I expected my best layover to be, and it looks like I was only one place off. The stop in Chicago only adds two tenths of a percent to the length of this route.

#1 Best: Chicago–Cleveland–New York

Flown: 738 mi · Direct: 738 mi · Ratio: 1.000

So while obviously a trip with a layover is still going to take longer than a direct flight, this is about the best layover you can get: any increase in distance for the layover is within the rounding error, and the stop didn’t add a single extra mile.

Interestingly enough, this trip section was part of the same trip that had my second-worst layover of Milwaukee–Atlanta–Dayton, shown above.

Methodology

My flight log’s table of flights contains a trip_id and a trip_section number for that trip, and since layovers are going to be contained within trip sections, I needed to first determine every unique trip_id and trip_section combinations in my flight log:

Flight.all.map{|f| [f.trip_id, f.trip_section]}.uniq

Then I used that to create an array of trip sections, each entity of which contained an array of pairs of airport codes (for example, [["DAY","CLT"],["CLT","STL"]]):

.map{|ts| Flight.where(trip_id: ts.first, trip_section: ts.last).order(:departure_utc).map{|f| [f.origin_airport.iata_code, f.destination_airport.iata_code]}}

Once I had that, I used uniq to remove duplicate routes. Since there was no point in evaluating direct flights (e.g., routes with just a single flight), I also used a select block to keep only routes that had more than one flight:

.uniq.select{|f| f.count > 1}

So now that I had a collection of trip sections with layovers, I had to calculate their total distance, and the direct distance between the first flight’s origin and the last flight’s destination.

Every Airport in my flight log has a latitude and longitude stored, and my flight log already has a Route.distance_by_iata(iata1, iata2) method to find the great circle distance between two airport codes (using the haversine formula).

To get the total trip section route distance flown, I used a map command to create an array of trip distances, and a reduce command to sum them (assuming ts is the array of flight airport code pairs in a trip section):

rd = ts.map{|f| Route.distance_by_iata(f.first, f.last)}.reduce(0, :+)

Best distance (direct flight distance) is easier, since I just need to run the distance calculation on the first flight’s first airport, and the last flight’s last airport:

bd = Route.distance_by_iata(ts.first.first, ts.last.last)

So combining these, we can use a map on the collection of trip sections to create an array of hashes of trip section routes, distances, best distances, and ratios:

.map{|ts| rd = ts.map{|f| Route.distance_by_iata(f.first, f.last)}.reduce(0, :+); bd = Route.distance_by_iata(ts.first.first, ts.last.last); {route: ts.map{|f| f.first}.push(ts.last.last).join("-"), route_distance: rd, best_distance: bd, ratio: (rd.to_f/bd.to_f).round(3)}}

And sort it by ratio descending:

.sort_by{|f| -f[:ratio]}

Combining these all into a single statement:

output = Flight.all.map{|f| [f.trip_id, f.trip_section]}.uniq.map{|ts| Flight.where(trip_id: ts.first, trip_section: ts.last).order(:departure_utc).map{|f| [f.origin_airport.iata_code, f.destination_airport.iata_code]}}.uniq.select{|f| f.count > 1}.map{|ts| rd = ts.map{|f| Route.distance_by_iata(f.first, f.last)}.reduce(0, :+); bd = Route.distance_by_iata(ts.first.first, ts.last.last); {route: ts.map{|f| f.first}.push(ts.last.last).join("-"), route_distance: rd, best_distance: bd, ratio: (rd.to_f/bd.to_f).round(3)}}.sort_by{|f| -f[:ratio]}

Running it on my flight log provided me my results:

And for ease of comparison, I decided to convert it into CSV-formatted output so I could import it into Excel:

output.map{|f| puts f[:route] + "," + f[:route_distance].to_s + "," + f[:best_distance].to_s + "," + f[:ratio].to_s + "\n"}

With that, I had all the information I needed to create my 5 best and 5 worst layovers list.

[Edit on 11 Feb 2019: I have now updated Flight Historian so that trip section pages with a layover show the layover ratio.]

State Abbreviations Graph

In a recent chat that I participated in, we were discussing US two-letter state abbreviations that were one letter off of each other (e.g., NY and NJ).

After that discussion, I was curious about whether it would be possible to step from any state abbreviation to any other by changing one letter at a time, using only valid states along the way. My first step was to determine if there were any state abbreviations which didn’t share a first or last letter with any other states, so I wrote a simple Ruby script to test that.

state_codes = %w(AL AK AZ AR CA CO CT DE FL GA HI ID IL IN IA KS KY LA ME MD MA MI MN MS MO MT NE NV NH NJ NM NY NC ND OH OK OR PA RI SC SD TN TX UT VT VA WA WV WI WY)
state_codes.sort!
one_letter_changes = Hash.new()
state_codes.each do |sc|
  one_letter_changes[sc] = state_codes.select{|s| sc != s && (sc[0] == s[0] || sc[1] == s[1])}
  puts "#{sc}: #{one_letter_changes[sc].join(", ")}"
end
Matching States

So every state had at least one other state it could go to. Texas (TX) had the fewest, with only Tennessee (TN); Massachusetts (MA) had the most, as quite a few state codes start with M or end with A.

Now I needed to find out if all the states would connect to each other, or if there would be several distinct “neighborhoods” of states. I decided to do this visually by creating a graph, using the output of my script to draw the connections:

State One Letter Changes
Graph with US state abbreviations as the vertices, green lines connecting state abbreviations with the same first letter, and blue lines connecting state abbreviations with the same second letter

Based on this graph, it is possible for any state abbreviation to change to any other state abbreviation!

I was also curious about the number of steps needed to go between any pair of state abbreviations, so I wrote a path distance algorithm based on Dijkstra’s algorithm (but with each path having equal weight) to find the shortest number of hops between any pair:

state_codes = %w(AL AK AZ AR CA CO CT DE FL GA HI ID IL IN IA KS KY LA ME MD MA MI MN MS MO MT NE NV NH NJ NM NY NC ND OH OK OR PA RI SC SD TN TX UT VT VA WA WV WI WY)
state_codes.sort!
one_letter_changes = Hash.new()
state_codes.each do |sc|
  one_letter_changes[sc] = state_codes.select{|s| sc != s && (sc[0] == s[0] || sc[1] == s[1])}
end

def path_distance(graph, source)
  vertexes = graph.keys
  return nil unless vertexes.include?(source)
  
  distance = Hash.new()
  previous_vertex = Hash.new()
  arbitrarily_large_distance = vertexes.length
  unvisited_vertices = Array.new

  vertexes.each do |v|
    distance[v] = arbitrarily_large_distance
    previous_vertex[v] = nil
    unvisited_vertices.push(v)
  end
  distance[source] = 0;

  while(unvisited_vertices.any?)
    min_distance_vertex = unvisited_vertices.min_by{|v| distance[v]}
    
    graph[min_distance_vertex].each do |neighbor|
      alt = distance[min_distance_vertex] + 1
      if alt < distance[neighbor]
        distance[neighbor] = alt
        previous_vertex[neighbor] = min_distance_vertex
      end
    end
    unvisited_vertices -= [min_distance_vertex]
  end

  return distance

end

state_codes.each do |code|
  values = path_distance(one_letter_changes, code).sort_by{|k,v| k}.reject{|k,v| k > code}.map{|d| d[1]}
  puts "#{code}  #{values.join("  ")}"
end
puts "   #{state_codes.join(" ")}"
State to State

Based on the results, the highest number of hops is 6 – so every state abbreviation can be changed into any other state abbreviation in at most six steps!

Switching Flight Historian to ICAO Regions

Early on during the development of Flight Historian, I realized that I’d have to do some filtering of my maps by region. Most of my travel is within the United States, so a world map of all of my flights left the United States as an unreadable mess of lines. Thus, I gave Flight Historian the ability to toggle between world maps (all flights) and CONUS maps (flights within the CONtiguous United States – that is, the United States except for Alaska, Hawaii, and territories).

Because of peculiarities with how the Great Circle Mapper generates maps, showing region maps wasn’t as simple as setting a map center and zoom level. Instead, I had to know which airports were inside the CONUS and which ones were outside. The easiest solution was to add an is_conus attribute to my Airports table, which would be set to true for CONUS airports and false for OCONUS (Outside CONUS) airports. Once I had that, I could set the world map to use every airport, and the CONUS map to show only airports where is_conus was true.

This worked well enough when I was only showing two regions (world and CONUS). But as I traveled, I realized I was going to want to zoom in on other regions (for example, Europe) as well, which meant that I’d have to have some way match airports to other regions.

Continue reading “Switching Flight Historian to ICAO Regions”

Flight Log Version 2.1: Import Flights from Digital Boarding Passes

In general, my Flight Historian has been a big time saver for me as far as tracking my flights – instead of manually generating reports and maps from an Excel file, I can simply add flights to a database and let it do all the work. However, as I’ve started tracking more details about my flights over time, the task of entering the flights has become less simple.

Screenshot 2017-04-09 21.59.30
There are currently 24 fields to fill out in the flight log. Not every field is required, but it can still take several minutes to fill out a flight.

Since I’d been working on parsing boarding pass barcode data, it seemed like a logical next step to write some sort of scanner that would read a boarding pass barcode and import the data as a new flight. Then one of my Twitter followers had a suggestion:

Continue reading “Flight Log Version 2.1: Import Flights from Digital Boarding Passes”

Creating Multiple Flash Messages in Ruby on Rails

On my Flight Historian application, a number of my pages make use of the flash and flash.now session messages capability for errors, warnings, successes, and informational messages. However, some of those pages needed to have multiple messages of the same type (e.g., multiple warnings), which flash didn’t allow me to do. Additionally, I had some views that were generating status messages of their own (for example, if a collection was empty on a page that had multiple collections), and so I ended up with several ways to generate messages that didn’t output consistent HTML.

Continue reading “Creating Multiple Flash Messages in Ruby on Rails”

Parsing Boarding Pass Dates in Ruby

The bar codes on paper or electronic boarding passes contain a good deal of data about a given flight. One of my goals for Flight Historian is to allow me to add a new flight by scanning the bar code, but in order to do that, I need to write a Ruby parser for the data in these boarding passes. This parser will accept bar code data, and return a collection of field names, values, and its interpretation of what those values mean.

One of the more difficult challenges I’m running into, though, is interpreting the date of the flight from the bar code.

Continue reading “Parsing Boarding Pass Dates in Ruby”

I’ve Rehosted my Portfolio

My Portfolio was the first non-tutorial Ruby on Rails application I created, and while it was functional, there were a lot of things I did wrong.

When I looked for somewhere to host my website, I was still looking at it from a PHP mindset – find a server somewhere that will run Rails, and FTP files up to that host as I create or update them. This meant that I wasn’t taking advantage of Git’s capability to push updates to a production server. Additionally, hosting a Rails application this way meant I was stuck on whatever version of Rails was present on that server – and the server I selected back in 2012 is still running Rails 3.2.11. With all of that, it was time to move my Portfolio to a more modern host.

I’d originally written my Flight Historian as a component of my Portfolio. Earlier this year, I’d split it off into its own separate application, and hosted that new application on Heroku. It was time to make this move for my Portfolio as well.

My portfolio still had a lot of vestigial code from the Flight Log, and there were a lot of changes I wanted to make to the layout and structure of the site (I particularly wanted to move the projects front and center to the home page of the site), so I decided to take the opportunity for a complete rewrite as a new rails application.

The new Portfolio can be found at the same old address:

https://www.pbogard.com/