The Python scientific stack is fairly mature, and there are libraries for a variety of use cases, including machine learning, and data analysis. Data visualization is an important part of being able to explore data and communicate results, but has lagged a bit behind other tools such as R in the past.
Luckily, many new Python data visualization libraries have been created in the past few years to close the gap. matplotlib has emerged as the main data visualization library, but there are also libraries such as vispy, bokeh, seaborn, pygal, folium, and networkx that either build on matplotlib or have functionality that it doesn’t support.
In this post, we’ll use a real-world dataset, and use each of these libraries to make visualizations. As we do that, we’ll discover what areas each library is best in, and how to leverage the Python data visualization ecosystem most effectively.
At Dataquest, we’ve built an interactive course that teaches you about Python data visualization tools. If you want to learn in more depth, check it out here.
Exploring the dataset
Before we dive into visualizing the data, let’s take a quick look at the dataset we’ll be working with. We’ll be using data from openflights. We’ll be using route, airport, and airline data. Each row in the route data corresponds to an airline route between two airports. Each row in the airport data corresponds to an airport in the world, and has information about it. Each row in the airline data represents a single airline.
We first read in the data:
The data doesn’t have column headers, so we add them in by assigning to the
columns attribute. We want to read every column in as a string – this will make comparing across dataframes easier later, when we want to match rows based on id. We do this by setting the
dtype parameter when reading in the data.
We can take a quick look at each dataframe:
|0||1||Goroka||Goroka||Papua New Guinea||GKA||AYGA||-6.081689||145.391881||5282||10||U||Pacific/Port_Moresby|
|1||2||Madang||Madang||Papua New Guinea||MAG||AYMD||-5.207083||145.788700||20||10||U||Pacific/Port_Moresby|
|2||3||Mount Hagen||Mount Hagen||Papua New Guinea||HGU||AYMH||-5.826789||144.295861||5388||10||U||Pacific/Port_Moresby|
|3||4||Nadzab||Nadzab||Papua New Guinea||LAE||AYNZ||-6.569828||146.726242||239||10||U||Pacific/Port_Moresby|
|4||5||Port Moresby Jacksons Intl||Port Moresby||Papua New Guinea||POM||AYPY||-9.443383||147.220050||146||10||U||Pacific/Port_Moresby|
|1||2||135 Airways||\N||NaN||GNL||GENERAL||United States||N|
|2||3||1Time Airline||\N||1T||RNX||NEXTIME||South Africa||Y|
|3||4||2 Sqn No 1 Elementary Flying Training School||\N||NaN||WYT||NaN||United Kingdom||N|
|4||5||213 Flight Unit||\N||NaN||TFU||NaN||Russia||N|
We can do a variety of interesting explorations with each dataset individually, but it’s through combining them that we’ll see the most gains. Pandas will aid us as we do our analysis because it can easily filter matrices or apply functions across them. We’ll dive into a few interesting metrics, such as analyzing airlines and routes.
Before we can do so, we need to do a bit of data cleaning:
This line ensures that we have only numeric data in the
Making a histogram
Now that we understand the structure of the data, we can go ahead and start making plots to explore it. For our first plot, we’ll use matplotlib. matplotlib is a relatively low-level plotting library in the Python stack, so it generally takes more commands to make nice-looking plots than it does with other libraries. On the other hand, you can make almost any kind of plot with matplotlib. It’s very flexible, but that flexibility comes at the cost of verbosity.
We’ll first make a histogram showing the distribution of route lengths by airlines. A histogram divides all the route lengths into ranges (or “bins”), and counts how many routes fall into each range. This can tell us if airlines fly more shorter routes, or more longer ones.
In order to do this, we need to first calculate route lengths. The first step is a distance formula. We’ll use haversine distance, which calculates the distance between latitude, longitude pairs.
Then we can make a function that calculates distance between the
dest airports for a single route. To do this, we need to get the
dest_id airports from the routes dataframe, then match them up with the
id column in the
airports dataframe to get the latitude and longitude of those airports. Then, it’s just a matter of doing the calculation. Here’s the function:
The function can fail if there’s an invalid value in the
dest_id columns, so we’ll add in a
try/except block to catch these.
Finally, we’ll use pandas to apply the distance calculation function across the
routes dataframe. This will give us a pandas series containing all the route lengths. The route lengths are all in kilometers.
Now that we have a series of route lengths, we can create a histogram, which will bin the values into ranges and count how many routes fall into each range:
We import the matplotlib plotting functions with
import matplotlib.pyplot as plt. We then setup matplotlib to show plots in an ipython notebook with
%matplotlib inline. Finally, we can make a histogram with
plt.hist(route_lengths, bins=20). As we can see, airlines fly more short routes than long routes.
We can make a similar plot with seaborn, a higher-level plotting library for Python. Seaborn builds on matplotlib and makes certain types of plots, usually having to do with statistical work, simpler. We can use the
distplot function to plot a histogram with a kernel density estimate on top of it. A kernel density estimate is a curve – essentially a smoothed version of the histogram that’s easier to see patterns in.
As you can see, seaborn also has nicer default styles than matplotlib. Seaborn doesn’t have its own version of all the matplotlib plots, but it’s a nice way to quickly get nice-looking plots that go into more depth than default matplotlib charts. It’s also a good library if you need to go more into depth and do more statistical work.
Histograms are great, but maybe we want to see the average route length by airline. We can instead use a bar chart – this will have an individual bar for each airline, telling us the average length by airline. This will let us see which carriers are regional, and which are international. We can use pandas, a python data analysis library, to figure out the average route length per airline.
We first make a new dataframe with the route lengths and the airline ids. We split
route_length_df into groups based on the
airline_id, essentially making one dataframe per airline. We then use the pandas
aggregate function to take the mean of the
length column in each airline dataframe, and recombine each result into a new dataframe. We then sort the dataframe so that the airlines with the most routes come first.
We can then plot this out with matplotlib:
plt.bar method plots each airline against the average route length each airline flies(
The problem with the plot above is that we can’t easily see which airline has what route length. In order to do this, we’ll need to be able to see the axis labels. This is a bit tough since there are so many airlines. One way to make this easier to work with is to make the plot interactive, which will allow us to zoom in and out to see the labels. We can use the bokeh library for this – it makes it simple to make interactive, zoomable plots.
To use bokeh, we’ll need to preprocess our data first:
The code above will get the names for each row in
airline_route_lengths, and add in the
name column, which contains the name of each airline. We also add in the
id column so we can do this lookup (the apply function doesn’t pass in an index).
Finally, we reset the index column to have all unique values. Bokeh doesn’t work properly without this.
Now, we can move on to the charting piece:
output_notebook to setup bokeh to show a plot in an ipython notebook. Then, we make a bar plot, using our dataframe and certain columns. Finally, the
show function shows the plot.
With this plot, we can zoom in and see which airlines fly the longest routes. The image above makes the labels looked crunched together, but they are much easier to see as you zoom in.
Horizontal bar charts
Pygal is a python data analysis library that makes attractive charts quickly. We can use it to make a breakdown of routes by length. We’ll first divide our routes into short, medium, and long, and calculate the percentage of each in our
We can then plot each one as a bar in a pygal horizontal bar chart:
Above, we first create an empty chart. Then, we add elements, including a title and bars. Each bar is passed a percentage value (out of
100) showing how common that type of route is.
Finally, we render the chart to a file, and use IPython’s SVG display capabilities to load and show the file. This plot looks quite a bit nicer than the default matplotlib charts, but we did need to write more code to create it. Pygal may be good for small presentation-quality graphics.
Enjoying this post? Learn data science with Dataquest!
Start for Free
- Learn from the comfort of your browser.
- Work with real-life data sets.
- Build a portfolio of projects.
Scatter plots enable us to compare columns of data. We can make a simple scatter plot to compare airline id number to length of airline names:
First we calculate the length of each name by using the pandas
apply method. This will find the number of characters long each airline name is.
We then make a scatter plot comparing the airline ids to the name lengths using matplotlib. When we plot, we convert the
id column of
airlines to an integer type. If we don’t do this, the plot won’t work, as it needs numeric values on the x-axis. We can see that quite a few of the longer names appear in the earlier ids. This may mean that airlines founded earlier tend to have longer names.
We can verify this hunch using seaborn. Seaborn has an augmented version of a scatterplot, a joint plot, that shows how correlated the two variables are, as well as the individual distributions of each.
The above plot shows that there isn’t any real correlation between the two variables – the r squared value is low.
Our data is inherently a good fit for mapping – we have latitude and longitude pairs for airports, and for source and destination airports.
The first map we can make is one that shows all the airports all over the world. We can do this with the basemap extension to matplotlib. This enables drawing world maps and adding points, and is very customizable.
In the above code, we first draw a map of the world, using a mercator projection. A mercator projection is a way to project the whole plot of the world onto a 2-d surface. Then, we draw the airports on top of the map, using red dots.
The problem with the above map is that it’s hard to see where each airport is – they just kind of merge into a red blob in areas with high airport density.
Just like with bokeh, there’s an interactive mapping library, folium, we can use to zoom into the map and help us find individual airports.
Folium uses leaflet.js to make a fully interactive map. You can click on each airport to see the name in the popup. A screenshot is shown above, but the actual map is much more impressive. Folium also lets you modify options pretty extensively to make nicer markers, or add more things to the map.
Drawing great circles
It would be pretty cool to see all the air routes on a map. Luckily, we can use basemap to do this. We’ll draw great circles connecting source and destination airports. Each circle will show the route of a single airliner. Unfortunately, there are so many routes that showing them all would be a mess. Instead, we’ll show the first
The above code will draw a map, then draw the routes on top of it. We add in some filters to prevent overly long routes from obscuring the others.
Drawing network diagrams
The final exploration we’ll do is drawing a network diagram of airports. Each airport will be a node in the network, and we’ll draw edges between nodes if there’s a route between the airports. If there are multiple routes, we’ll add to the edge weight, to show that the airports are more connected. We’ll use the networkx library to do this.
First, we’ll need to compute the edge weights between airports.
Once the above code finishes running, the weights dictionary contains every edge between two airports that has a weight higher than 2. So any airports that are connected by 2 or more routes will appear.
Now, we need to draw the graph.
There has been a proliferation of Python libraries for data visualization, and it’s possible to make almost any kind of visualization. Most libraries build on matplotlib and make certain use cases simpler. If you want to learn in more depth how to visualize data using matplotlib, seaborn, and other tools, check out our interactive course here.