SQL, pronounced "sequel" (or S-Q-L, if you prefer), is a critical tool for data scientists. In fact, it's arguably the most important language for getting a job in data. In this tutorial, we're going to dig into SQL basics from the perspective of a total beginner to get you up and running with this crucial skill.
Let's start by answering a simple question:
What is SQL?
SQL stands for Structured Query Language. A query language is a kind of programming language that's designed to facilitate retrieving specific information from databases, and that's exactly what SQL does. To put it simply, SQL is the language of databases.
That matters because most companies store their data in databases. And while there are many types of databases (like MySQL, PostgreSQL, Microsoft SQL Server), most of them speak SQL, so once you've got SQL basics under your belt, you'll be able to work with any of them.
Even if you're planning to do your analysis with another language like Python, at most companies, chances are you'll need to use SQL to retrieve the data you need from the company's database. As of this writing, there are more than 80,000 SQL jobs listed on Indeed in the US alone.
So let's get started with learning SQL!
(If you'd prefer to learn interactively, writing and running SQL queries from your browser, you should check out our SQL Fundamentals course, which you can start for free!)
In this tutorial we'll be working with a dataset from the bike-sharing service Hubway, which includes data on over 1.5 million trips made with the service.
We'll start by looking a little bit at databases, what they are and why we use them, before starting to write some queries of our own in SQL.
If you'd like to follow along you can download the
hubway.db file here (130 MB).
SQL Basics: Relational Databases
A relational database is a database that stores related information across multiple tables and allows you to query information in more than one table at the same time.
It's easier to understand how this works by thinking through an example. Imagine you're a business and you want to keep track of your sales information. You could set up a spreadsheet in Excel with all of the information you want to keep track of as separate columns: Order number, date, amount due, shipment tracking number, customer name, customer address, and customer phone number.
This setup would work fine for tracking the information you need to begin with, but as you start to get repeat orders from the same customer you'll find that their name, address and phone number gets stored in multiple rows of your spreadsheet.
As your business grows and the number of orders you're tracking increases, this redundant data will take up unnecessary space and generally decrease the efficiency of your sales tracking system. You might also run into issues with data integrity. There's no guarantee, for example, that every field will be populated with the correct data type or that the name and address will be entered exactly the same way every time.
With a relational database, like the one in the above diagram, you avoid all of these issues. You could set up two tables, one for orders and one for customers. The 'customers' table would include a unique ID number for each customer, along with the name, address and phone number we were already tracking. The 'orders' table would include your order number, date, amount due, tracking number and, instead of a separate field for each item of customer data, it would have a column for the customer ID.
This enables us to pull up all of the customer info for any given order, but we only have to store it once in our database rather than listing it out again for every single order.
Our Data Set
Let's start by taking a look at our database. The database has two tables,
stations. To begin with, we'll just look at the
trips table. It contains the following columns:
id— A unique integer that serves as a reference for each trip
duration— The duration of the trip, measured in seconds
start_date— The date and time the trip began
start_station— An integer that corresponds to the
idcolumn in the
stationstable for the station the trip started at
end_date— The date and time the trip ended
end_station— The 'id' of the station the trip ended at
bike_number— Hubway's unique identifier for the bike used on the trip
sub_type— The subscription type of the user.
"Registered"for users with a membership,
"Casual"for users without a membership
zip_code— The zip code of the user (only available for registered members)
birth_date— The birth year of the user (only available for registered members)
gender— The gender of the user (only available for registered members)
With this information and the SQL commands we'll learn shortly, here are some questions that we'll try to answer over the course of this post:
- What was the duration of the longest trip?
- How many trips were taken by 'registered' users?
- What was the average trip duration?
- Do registered or casual users take longer trips?
- Which bike was used for the most trips?
- What is the average duration of trips by users over the age of 30?
The SQL commands we'll use to answer these questions are:
Installation and Setup
For the purposes of this tutorial, we will be using a database system called SQLite3. SQLite has come as part of Python from version 2.5 onwards, so if you have Python installed you'll almost certainly have SQLite as well. Python and the SQLite3 library can easily be installed and set up with Anaconda if you don't already have them.
Using Python to run our SQL code allows us to import the results into a Pandas dataframe to make it easier to display our results in an easy to read format. It also means we can perform further analysis and visualization on the data we pull from the database, although that will be beyond the scope of this tutorial.
Alternatively, if we don't want to use or install Python, we can run SQLite3 from the command line. Simply download the "precompiled binaries" from the SQLite3 web page and use the following code to open the database:
~$ sqlite hubway.db SQLite version 3.14.0 2016-07-26 15:17:14Enter ".help" for usage hints.sqlite>
From here we can just type in the query we want to run and we will see the data returned in our terminal window.
An alternative to using the terminal is to connect to the SQLite database via Python. This would allow us to use a Jupyter notebook, so that we could see the results of our queries in a neatly formatted table.
To do this, we'll define a function that takes our query (stored as a string) as an input and shows the result as a formatted dataframe:
Of course, we don't have to use Python with SQL. If you're an R programmer already, our SQL Fundamentals for R Users course would be a great place to start.
import sqlite3 import pandas as pd db = sqlite3.connect('hubway.db') def run_query(query): return pd.read_sql_query(query,db)
The first command we'll work with is
SELECT will be the foundation of almost every query we write - it tells the database which columns we want to see. We can either specify columns by name (separated by commas) or use the wildcard
* to return every column in the table.
In addition to the columns we want to retrieve, we also have to tell the database which table to get them from. To do this we use the keyword
FROM followed by the name of the table. For example, if we wanted to see the
bike_number for every trip in the
trips table, we could use the following query:
SELECT start_date, bike_number FROM trips;
In this example, we started with the
SELECT command so that the database knows we want it to find us some data. Then we told the database we were interested in the
bike_number columns. Finally we used
FROM to let the database know that the columns we want to see are part of the
One important thing to be aware of when writing SQL queries is that we'll want to end every query with a semicolon (
;). Not every SQL database actually requires this, but some do, so it's best to form this habit.
The next command we need to know before we start to run queries on our Hubway database is
LIMIT simply tells the database how many rows you want it to return.
SELECT query we looked at in the previous section would return the requested information for every row in the
trips table, but sometimes that could mean a lot of data. We might not want all of it. If, instead, we wanted to see the
bike_number for the first five trips in the database, we could add
LIMIT to our query as follows:
SELECT start_date, bike_number FROM trips LIMIT 5;
We simply added the
LIMIT command and then a number representing the number of rows we want to be returned. In this instance we used 5, but you can replace that with any number to get the appropriate amount of data for the project you're working on.
We will use
LIMIT a lot in our queries on the Hubway database in this tutorial — the
trips table contains over a 1.5 million rows of data and we certainly don't need to display all of them!
Let's run our first query on the Hubway database. First we will store our query as a string and then use the function we defined earlier to run it on the database. Take a look at the following example:
query = 'SELECT * FROM trips LIMIT 5;' run_query(query)
|0||1||9||2011-07-28 10:12:00||23||2011-07-28 10:12:00||23||B00468||Registered||'97217||1976.0||Male|
|1||2||220||2011-07-28 10:21:00||23||2011-07-28 10:25:00||23||B00554||Registered||'02215||1966.0||Male|
|2||3||56||2011-07-28 10:33:00||23||2011-07-28 10:34:00||23||B00456||Registered||'02108||1943.0||Male|
|3||4||64||2011-07-28 10:35:00||23||2011-07-28 10:36:00||23||B00554||Registered||'02116||1981.0||Female|
|4||5||12||2011-07-28 10:37:00||23||2011-07-28 10:37:00||23||B00554||Registered||'97214||1983.0||Female|
This query uses
* as a wildcard instead of specifying columns to return. This means the
SELECT command has given us every column in the
trips table. We also used the
LIMIT function to restrict the output to the first five rows of the table.
You will often see that people capitalize the commmand keywords in their queries (a convention that we'll follow throughout this tutorial) but this is mostly a matter of preference. This capitalization makes the code easier to read, but it doesn't actually affect the code's function in any way. If you prefer to write your queries with lowercase commands, the queries will still execute correctly.
Our previous example returned every column in the
trips table. If we were only interested in the
start_date columns, we could replace the wildcard with the column names as follows:
query = 'SELECT duration, start_date FROM trips LIMIT 5' run_query(query)
The final command we need to know before we can answer the first of our questions is
ORDER BY. This command allows us to sort the database on a given column.
To use it, we simply specify the name of the column we would like to sort on. By default,
ORDER BY sorts in ascending order. If we would like to specify which order the database should be sorted, we can add the keyword
ASC for ascending order or
DESC for descending order.
For example, if we wanted to sort the
trips table from the shortest
duration to the longest we could add the following line to our query:
ORDER BY duration ASC
ORDER BY commands in our repertoire, we can now attempt to answer our first question: What was the duration of the longest trip?
To answer this question, it's helpful to break it down into sections and identify which commands we will need to address each part.
First we need to pull the information from the
duration column of the
trips table. Then, to find which trip is the longest, we can sort the
duration column in descending order. Here's how we might work this through to come up with a query that will get the information we're looking for:
SELECTto retrieve the
ORDER BYto sort the
durationcolumn and use the
DESCkeyword to specify that you want to sort in descending order
LIMITto restrict the output to 1 row
Using these commands in this way will return the single row with the longest duration, which will provide us the answer to our question.
One more thing to note — as your queries add more commands and get more complicated, you may find it easier to read if you separate them onto multiple lines. This, like capitalization, is a matter of personal preference. It doesn't affect how the code runs (the system just reads the code from the beginning until it reaches the semicolon), but it can make your queries clearer and easier to follow. In Python, we can separate a string onto multiple lines by using triple quote marks.
Let's go ahead and run this query and find out how long the longest trip lasted.
query = ''' SELECT duration FROM trips ORDER BY duration DESC LIMIT 1; ''' run_query(query)
Now we know that the longest trip lasted 9999 seconds, or a little over 166 minutes. With a maximum value of 9999, however, we don't know whether this is really the length of the longest trip or if the database was only set up to allow a four digit number.
If it's true that particularly long trips are being cut short by the database, then we might expect to see a lot of trips at 9999 seconds where they reach the limit. Let's try running the same query as before, but adjust the
LIMIT to return the 10 highest durations to see if that's the case:
query = ''' SELECT durationFROM trips ORDER BY duration DESC LIMIT 10 ''' run_query(query)
What we see here is that there aren't a whole bunch of trips at 9999, so it doesn't look like we're cutting off the top end of our durations, but it's still difficult to tell whether that's the real length of the trip or just the maximum allowed value.
Hubway charges additional fees for rides over 30 minutes (somebody keeping a bike for 9999 seconds would have to pay an extra $25 in fees) so it's plausible that they decided 4 digits would be sufficient to track the majority of rides.
The previous commands are great for pulling out sorted information for particular columns, but what if there is a specific subset of the data we want to look at? That's where
WHERE comes in. The
WHERE command allows us to use a logical operator to specify which rows should be returned. For example you could use the following command to return every trip taken with bike
WHERE bike_number = "B00400"
You'll also notice that we use quote marks in this query. That's because the
bike_number is stored as a string. If the column contained numeric data types the quote marks would not be necessary.
Let's write a query that uses
WHERE to return every column in the
trips table for each row with a
duration longer than 9990 seconds:
query = ''' SELECT * FROM trips WHERE duration > 9990; ''' run_query(query)
|0||4768||9994||2011-08-03 17:16:00||22||2011-08-03 20:03:00||24||B00002||Casual|
|1||8448||9991||2011-08-06 13:02:00||52||2011-08-06 15:48:00||24||B00174||Casual|
|2||11341||9998||2011-08-09 10:42:00||40||2011-08-09 13:29:00||42||B00513||Casual|
|3||24455||9995||2011-08-20 12:20:00||52||2011-08-20 15:07:00||17||B00552||Casual|
|4||55771||9994||2011-09-14 15:44:00||40||2011-09-14 18:30:00||40||B00139||Casual|
|5||81191||9993||2011-10-03 11:30:00||22||2011-10-03 14:16:00||36||B00474||Casual|
|6||89335||9997||2011-10-09 02:30:00||60||2011-10-09 05:17:00||45||B00047||Casual|
|7||124500||9992||2011-11-09 09:08:00||22||2011-11-09 11:55:00||40||B00387||Casual|
|8||133967||9996||2011-11-19 13:48:00||4||2011-11-19 16:35:00||58||B00238||Casual|
|9||147451||9996||2012-03-23 14:48:00||35||2012-03-23 17:35:00||33||B00550||Casual|
|10||315737||9995||2012-07-03 18:28:00||12||2012-07-03 21:15:00||12||B00250||Registered||'02120||1964||Male|
|11||319597||9994||2012-07-05 11:49:00||52||2012-07-05 14:35:00||55||B00237||Casual|
|12||416523||9998||2012-08-15 12:11:00||54||2012-08-15 14:58:00||80||B00188||Casual|
|13||541247||9999||2012-09-26 18:34:00||54||2012-09-26 21:21:00||54||T01078||Casual|
As we can see, this query returned 14 different trips, each with a duration of 9990 seconds or more. Something that stands out about this query is that all but one of the results has a
"Casual". Perhaps this is an indication that
"Registered" users are more aware of the extra fees for long trips. Maybe Hubway could do a better job of conveying their pricing structure to Casual users to help them avoid overage charges.
We can already see how even a beginner-level command of SQL can help us answer business questions and find insights in our data.
WHERE, we can also combine multiple logical tests in our
WHERE clause using
OR. If, for example, in our previous query we had only wanted to return the trips with a
duration over 9990 seconds that also had a
sub_type of Registered, we could use
AND to specify both conditions.
Here's another personal preference recommendation: use parentheses to separate each logical test, as demonstrated in the code block below. This isn't strictly required for the code to function, but parentheses make your queries easier to understand as you increase the complexity.
Let's run that query now. We already know it should only return one result, so it should be easy to check that we've got it right:
query = ''' SELECT * FROM trips WHERE (duration >= 9990) AND (sub_type = "Registered") ORDER BY duration DESC; ''' run_query(query)
|0||315737||9995||2012-07-03 18:28:00||12||2012-07-03 21:15:00||12||B00250||Registered||'02120||1964.0||Male|
The next question we set out at the beginning of the post is "How many trips were taken by 'registered' users?" To answer it, we could run the same query as above and modify the
WHERE expression to return all of the rows where
sub_type is equal to
'Registered' and then count them up.
However, SQL actually has a built-in command to do that counting for us,
COUNT allows us to shift the calculation to the database and save us the trouble of writing additional scripts to count up results. To use it, we simply include
COUNT(column_name) instead of (or in addition to) the columns you want to
SELECT, like this:
SELECT COUNT(id) FROM trips
In this instance, it doesn't matter which column we choose to count because every column should have data for each row in our query. But sometimes a query might have missing (or "null") values for some rows. If we're not sure whether a column contains null values we can run our
COUNT on the
id column — the
id column is never null, so we can be sure our count won't have missed anything.
We can also use
COUNT(*) to count up every row in our query. It's worth noting that sometimes we might actually want to run
COUNT on a column with null values. For example, we might want to know how many rows in our database have missing values for a column.
Let's take a look at a query to answer our question. We can use
SELECT COUNT(*) to count up the total number of rows returned and
WHERE sub_type = "Registered" to make sure we only count up the trips taken by Registered users.
query = ''' SELECT COUNT(*)FROM trips WHERE sub_type = "Registered"; ''' run_query(query)
This query worked, and has returned the answer to our question. But the column heading isn't particularly descriptive. If someone else were to look at this table, they wouldn't be able to understand what it meant.
If we want to make our results more readable, we can use
AS to give our output an alias (or nickname). Let's re-run the previous query but give our column heading an alias of
Total Trips by Registered Users:
query = ''' SELECT COUNT(*) AS "Total Trips by Registered Users" FROM trips WHERE sub_type = "Registered"; ''' run_query(query)
|Total Trips by Registered Users|
COUNT is not the only mathematical trick SQL has up its sleeves. We can also use
MAX to return the sum, average, minimum and maximum of a column respectively. These, along with
COUNT, are known as aggregate functions.
So to answer our third question, "What was the average trip duration?", we can use the
AVG function on the
duration column (and, once again, use
AS to give our output column a more descriptive name):
query = ''' SELECT AVG(duration) AS "Average Duration" FROM trips; ''' run_query(query)
It turns out that the average trip duration is 912 seconds, which is about 15 minutes. This makes some sense, since we know that Hubway charges extra fees for trips over 30 minutes. The service is designed for riders to take short, one-way trips.
What about our next question, do registered or casual users take longer trips? We already know one way to answer this question — we could run two
SELECT AVG(duration) FROM trips queries with
WHERE clauses that restrict one to
"Registered" and one to
Let's do it a different way, though. SQL also includes a way to answer this question in a single query, using the
GROUP BY command.
GROUP BY separates rows into groups based on the contents of a particular column and allows us to perform aggregate functions on each group.
To get a better idea of how this works, let's take a look at the
gender column. Each row can have one of three possible values in the
Null (missing; we don't have
gender data for casual users).
When we use
GROUP BY, the database will separate out each of the rows into a different group based on the value in the
gender column, in much the same way that we might separate a deck of cards into different suits. We can imagine making two piles, one of all the males, one of all the females.
Once we have our two separate piles, the database will perform any aggregate functions in our query on each of them in turn. If we used
COUNT, for example, the query would count up the number of rows in each pile and return the value for each separately.
Let's walk through exactly how to write a query to answer our question of whether registered or casual users take longer trips.
- As with each of our queries so far, we'll start with
SELECTto tell the database which information we want to see. In this instance, we'll want
- We'll also include
GROUP BY sub_typeto separate out our data by subscription type and calculate the averages of registered and casual users separately.
Here's what the code looks like when we put it all together:
query = ''' SELECT sub_type, AVG(duration) AS "Average Duration" FROM trips GROUP BY sub_type; ''' run_query(query)
That's quite a difference! On average, registered users take trips that last around 11 minutes whereas casual users are spending almost 25 minutes per ride. Registered users are likely taking shorter, more frequent trips, possibly as part of their commute to work. Casual users, on the other hand, are spending around twice as long per trip.
It's possible that casual users tend to come from demographics (tourists, for example) that are more inclined to take longer trips make sure they get around and see all the sights. Once we've discovered this difference in the data, there are many ways the company might be able to investigate it to better understand what's causing it.
For the purposes of this tutorial, however, let's move on. Our next question was which bike was used for the most trips?. We can answer this using a very similar query. Take a look at the following example and see if you can figure out what each line is doing — we'll go through it step by step afterwards so you can check you got it right:
query = ''' SELECT bike_number as "Bike Number", COUNT(*) AS "Number of Trips" FROM trips GROUP BY bike_number ORDER BY COUNT(*) DESC LIMIT 1; ''' run_query(query)
|Bike Number||Number of Trips|
As you can see from the output, bike
B00490 took the most trips. Let's run through how we got there:
- The first line is a
SELECTclause to tell the database we want to see the
bike_numbercolumn and a count of every row. It also uses
ASto tell the database to display each column with a more useful name.
- The second line uses
FROMto specify that the data we're looking for is in the
- The third line is where things start to get a little tricky. We use
GROUP BYto tell the
COUNTfunction on line 1 to count up each value for
- On line four we have an
ORDER BYclause to sort the table in descending order and make sure our most-used bike is at the top.
- Finally we use
LIMITto restrict the output to the first row, which we know will be the bike that was used in the highest number of trips because of how we sorted the data on line four.
Our final question is a little more tricky than the others. We want to know the average duration of trips by registered members over the age of 30.
We could just figure out the year in which 30 year olds were born in our heads and then plug it in, but a more elegant solution is to use arithmetic operations directly within our query. SQL allows us to use
/ to perform an arithmetic operation on an entire column at once.
query = ''' SELECT AVG(duration) FROM trips WHERE (2017 - birth_date) > 30; ''' run_query(query)
So far we've been looking at queries that only pull data from the
trips table. However, one of the reasons SQL is so powerful is that it allows us to pull data from multiple tables in the same query.
Our bike-sharing database contains a second table,
stations table contains information about every station in the Hubway network and includes an
id column that is referenced by the
Before we start to work through some real examples from this database, though, let's look back at the hypothetical order tracking database from earlier. In that database we had two tables,
customers, and they were connected by the
Let's say we wanted to write a query that returned the
name for every order in the database. If they were both stored in the same table, we could use the following query:
SELECT order_number, name FROM orders;
order_number column and the
name column are stored in two different tables, so we have to add a few extra steps. Let's take a moment to think through the additional things the database will need to know before it can return the information we want:
- Which table is the
- Which table is the
- How is the information in the
orderstable connected to the information in the
To answer the first two of these questions, we can include the table names for each column in our
SELECT command. The way we do this is simply to write the table name and column name separated by a
.. For example, instead of
SELECT order_number, name we would write
SELECT orders.order_number, customers.name. Adding the table names here helps the database to find the columns we're looking for by telling it which table to look in for each.
To tell the database how the
customers tables are connected, we use
JOIN specifies which tables should be connected and
ON specifies which columns in each table are related.
We're going to use an inner join, which means that rows will only be returned where there is a match in the columns specified in
ON. In this example, we will want to use
JOIN on whichever table we didn't include in the
FROM command. So we can either use
FROM orders INNER JOIN customers or
FROM customers INNER JOIN orders.
As we discussed earlier, these tables are connected on the
customer_id column in each table. Therefore, we will want to use
ON to tell the database that these two columns refer to the same thing like this:
ON orders.customer_ID = customers.customer_id
Once again we use the
. to make sure the database knows which table each of these columns is in. So when we put all of this together, we get a query that looks like this:
SELECT orders.order_number, customers.name FROM orders INNER JOIN customers ON orders.customer_id = customers.customer_id
This query will return the order number of every order in the database along with the customer name that is associated with each.
Returning to our Hubway database, we can now write some queries to see
JOIN in action.
Before we get started, we should take a look at the rest of the columns in the
stations table. Here's a query to show us the first 5 rows so we can see what the
stations table looks like:
query = ''' SELECT * FROM stations LIMIT 5; ''' run_query(query)
|0||3||Colleges of the Fenway||Boston||42.340021||-71.100812|
|1||4||Tremont St. at Berkeley St.||Boston||42.345392||-71.069616|
|2||5||Northeastern U / North Parking Lot||Boston||42.341814||-71.090179|
|3||6||Cambridge St. at Joy St.||Boston||42.361284999999995||-71.06514|
id— A unique identifier for each station (corresponds to the
end_stationcolumns in the
station— The station name
municipality— The municipality that the station is in (Boston, Brookline, Cambridge or Somerville)
lat— The latitude of the station
lng— The longitude of the station
- Which stations are most frequently used for round trips?
- How many trips start and end in different municipalities?
Like before, we'll try to answer some questions in the data, starting with which station is the most frequent starting point? Let's work through it step by step:
- First we want to use
SELECTto return the
stationcolumn from the
stationstable and the
COUNTof the number of rows.
- Next we specify the tables we want to
JOINand tell the database to connect them
start_stationcolumn in the
tripstable and the
idcolumn in the
- Then we get into the meat of our query - we
stationcolumn in the
stationstable so that our
COUNTwill count up the number of trips for each station separately
- Finally we can
LIMITthe output to a manageable number of results
query = ''' SELECT stations.station AS "Station", COUNT(*) AS "Count" FROM trips INNER JOIN stations ON trips.start_station = stations.idGROUP BY stations.stationORDER BY COUNT(*) DESC LIMIT 5; ''' run_query(query)
|0||South Station - 700 Atlantic Ave.||56123|
|1||Boston Public Library - 700 Boylston St.||41994|
|2||Charles Circle - Charles St. at Cambridge St.||35984|
|3||Beacon St / Mass Ave||35275|
|4||MIT at Mass Ave / Amherst St||33644|
If you're familiar with Boston, you'll understand why these are the most popular stations. South Station is one of the main commuter rail stations in the city, Charles Street runs along the river close to some nice scenic routes, and Boylston and Beacon streets are right downtown near a number of office buildings.
The next question we'll look at is which stations are most frequently used for round trips? We can use much the same query as before. We will
SELECT the same output columns and
JOIN the tables in the same way, but this time we'll add a
WHERE clause to restrict our
COUNT to trips where the
start_station is the same as the
query = '''SELECT stations.station AS "Station", COUNT(*) AS "Count" FROM trips INNER JOIN stations ON trips.start_station = stations.id WHERE trips.start_station = trips.end_station GROUP BY stations.station ORDER BY COUNT(*) DESC LIMIT 5; ''' run_query(query)
|0||The Esplanade - Beacon St. at Arlington St.||3064|
|1||Charles Circle - Charles St. at Cambridge St.||2739|
|2||Boston Public Library - 700 Boylston St.||2548|
|3||Boylston St. at Arlington St.||2163|
|4||Beacon St / Mass Ave||2144|
As we can see, a number of these stations are the same as the previous question but the amounts are much lower. The busiest stations are still the busiest stations, but the lower numbers overall suggest that people are typically using Hubway bikes to get from point A to point B rather than cycling around for a while before returning to where they started.
There is one significant difference here — the Esplande, which was not one of the overall busiest stations from our first query, appears to be the busiest for round trips. Why? Well, a picture is worth a thousand words. This certainly looks like a nice spot for a bike ride:
On to the next question: how many trips start and end in different municipalities? This question takes things a step further. We want to know how many trips start and end in a different
municipality. To achieve this, we need to
trips table to the
stations table twice. Once
start_station column and then
In order to do this, we have to create an alias for the
stations table so that we are able to differentiate between data that relates to the
start_station and data that relates to the
end_station. We can do this in exactly the same way we've been creating aliases for individual columns to make them display with a more intuitive name, using
For example we can use the following code to
stations table to the
trips table using an alias of 'start'. We can then combine 'start' with our column names using
. to refer to data that comes from this specific
JOIN (rather than the second
JOIN we will do
INNER JOIN stations AS start ON trips.start_station = start.id
Here's what the final query will look like when we run it. Note that we've used
<> to represent "is not equal to", but
!= would also work.
query = ''' SELECT COUNT(trips.id) AS "Count" FROM trips INNER JOIN stations AS start ON trips.start_station = start.id INNER JOIN stations AS end ON trips.end_station = end.id WHERE start.municipality <> end.municipality; ''' run_query(query)
This shows that about 300,000 out of 1.5 million trips (or 20%) ended in a different municipality than they started — further evidence that people mostly use Hubway bicycles for relatively short journeys rather than longer trips between towns.
If you've made it this far, congratulations! You've begun to master the basics of SQL. We have covered a number of important commands,
GROUP BY and
JOIN, as well as aggregate and arithmetic functions. These will give you a strong foundation to build on as you continue your SQL journey.
After finishing this beginner SQL tutorial, you should now be able to pick up a database you find interesting and write queries to pull out information. A good first step might be to continue working with the Hubway database to see what else you can find out. Here are some other questions you might want to try and answer:
- How many trips incurred additional fees (lasted longer than 30 minutes)?
- Which bike was used for the longest total time?
- Did registered or casual users take more round trips?
- Which municipality had the longest average duration?
If you would like to take things a step further, check out our interactive SQL courses, which cover everything you'll need to know from beginning to advanced-level SQL for data analyst and data scientist jobs. Look under the SQL menu on our data science courses page for a full listing of all our interactive SQL course offerings.
You also might want to read our post about exporting the data from your SQL queries into Pandas or check out our SQL Intermediate Tutorial.
Professional Pensions Analyst, Amateur Photographer, Boston-Area Sports Fan