October 3, 2016

Working with SQLite Databases using Python and Pandas

SQLite is a database engine that makes it simple to store and work with relational data. Much like the csv format, SQLite stores data in a single file that can be easily shared with others. Most programming languages and environments have good support for working with SQLite databases. Python is no exception, and a library to access SQLite databases, called sqlite3, has been included with Python since version 2.5.

In this post, we’ll walk through how to use sqlite3 to create, query, and update databases. We’ll also cover how to simplify working with SQLite databases using the pandas package. We’ll be using Python 3.5, but this same approach should work with Python 2.

If you want to learn the basics of SQL, you might like to checkout our SQL basics blogpost first.

Before we get started, let’s take a quick look at the data we’ll be working with. We’ll be looking at airline flight data, which contains information on airlines, airports, and routes between airports. Each route represents a repeated flight that an airline flies between a source and a destination airport.

All of the data is in a SQLite database called flights.db, which contains three tables — airports, airlines, and routes. You can download the data here.

Here are two rows from the airlines table:

	id	name	alias	iata	icao	callsign	country	active
10	11	4D Air	\N	NaN	QRT	QUARTET	Thailand	N
11	12	611897 Alberta Limited	\N	NaN	THD	DONUT	Canada	N

As you can see above, each row is a different airline, and each column is a property of that airline, such as name, and country. Each airline also has a unique id, so we can easily look it up when we need to.

Here are two rows from the airports table:

	id	name	city	country	code	icao	latitude	longitude	altitude	offset	dst	timezone
0	1	Goroka	Goroka	Papua New Guinea	GKA	AYGA	-6.081689	145.391881	5282	10	U	Pacific/Port_Moresby
1	2	Madang	Madang	Papua New Guinea	MAG	AYMD	-5.207083	145.7887	20	10	U	Pacific/Port_Moresby

As you can see, each row corresponds to an airport, and contains information on the location of the airport. Each airport also has a unique id, so we can easily look it up.

Here are two rows from the routes table:

	airline	airline_id	source	source_id	dest	dest_id	codeshare	stops	equipment
0	2B	410	AER	2965	KZN	2990	NaN	0	CR2
1	2B	410	ASF	2966	KZN	2990	NaN	0	CR2

Each route contains an airline_id, which the id of the airline that flies the route, as well as source_id, which is the id of the airport that the route originates from, and dest_id, which is the id of the destination airport for the flight.

Now that we know what kind of data we’re working with, let’s start by connecting to the database and running a query.

Querying database rows in Python

In order to work with a SQLite database from Python, we first have to connect to it. We can do that using the connect function, which returns a Connection object:

import sqlite3
conn = sqlite3.connect("flights.db")

Once we have a Connection object, we can then create a Cursor object. Cursors allow us to execute SQL queries against a database:

cur = conn.cursor()

Once we have a Cursor object, we can use it to execute a query against the database with the aptly named execute method. The below code will fetch the first 5 rows from the airlines table:

cur.execute("select * from airlines limit 5;")

You may have noticed that we didn’t assign the result of the above query to a variable. This is because we need to run another command to actually fetch the results. We can use the fetchall method to fetch all of the results of a query:

results = cur.fetchall()
print(results)

[(0, '1', 'Private flight', '\\N', '-', None, None, None, 'Y'), (1, '2', '135 Airways', '\\N', None, 'GNL', 'GENERAL', 'United States', 'N'), (2, '3', '1Time Airline', '\\N', '1T', 'RNX', 'NEXTIME', 'South Africa', 'Y'), (3, '4', '2 Sqn No 1 Elementary Flying Training School', '\\N', None, 'WYT', None, 'United Kingdom', 'N'), (4, '5', '213 Flight Unit', '\\N', None, 'TFU', None, 'Russia', 'N')]

As you can see, the results are formatted as a list of tuples. Each tuple corresponds to a row in the database that we accessed. Dealing with data this way is fairly painful. We’d need to manually add column heads, and manually parse the data. Luckily, the pandas library has an easier way, which we’ll look at in the next section.

Before we move on, it’s good practice to close Connection objects and Cursor objects that are open. This prevents the SQLite database from being locked. When a SQLite database is locked, you may be unable to update the database, and may get errors. We can close the Cursor and the Connection like this:

cur.close()
conn.close()

Mapping airports

Using our newfound knowledge of queries, we can create a plot that shows where all the airports in the world are. First, we query latitudes and longitudes:

import sqlite3
conn = sqlite3.connect("flights.db")
cur = conn.cursor()
coords = cur.execute("""
select cast(longitude as float),
cast(latitude as float)
from airports;""").fetchall()

The above query will retrieve the latitude and longitude columns from airports, and convert both of them to floats. We then call the fetchall method to retrieve them.

We then need to setup our plotting by importing matplotlib, the primary plotting library for Python. Combined with the basemap package, this allows us to create maps only using Python.

We first need to import the libraries:

from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt

Then, we setup our map, and draw the continents and coastlines that will form the background of our map:

m = Basemap(
projection='merc',
llcrnrlat=-80,
urcrnrlat=80,
llcrnrlon=-180,
urcrnrlon=180,
lat_ts=20,
resolution='c')
m.drawcoastlines()
m.drawmapboundary()

Finally, we plot the coordinates of each airport onto the map. We retrieved a list of tuples from the SQLite database. The first element in each tuple is the longitude of the airport, and the second is the latitude. We’ll convert the longitudes and latitudes into their own lists, and then plot them on the map:

x, y = m(
[l[0] for l in coords],
[l[1] for l in coords])
m.scatter(
x,
y,
1, marker='o',
color='red')

We end up with a map that shows every airport in the world:

As you may have noticed, working with data from the database is a bit painful. We needed to remember which position in each tuple corresponded to what database column, and manually parse out individual lists for each column. Luckily, the pandas library gives us an easier way to work with the results of SQL queries.

Reading results into a pandas DataFrame

We can use the pandas read_sql_query function to read the results of a SQL query directly into a pandas DataFrame. The below code will execute the same query that we just did, but it will return a DataFrame. It has several advantages over the query we did above:

It doesn’t require us to create a Cursor object or call fetchall at the end.
It automatically reads in the names of the headers from the table.
It creates a DataFrame, so we can quickly explore the data.

import pandas as pd
import sqlite3
conn = sqlite3.connect("flights.db")
df = pd.read_sql_query("select * from airlines limit 5;", conn)
df

	index	id	name	alias	iata	icao	callsign	country	active
0	0	1	Private flight	\N	–	None	None	None	Y
1	1	2	135 Airways	\N	None	GNL	GENERAL	United States	N
2	2	3	1Time Airline	\N	1T	RNX	NEXTIME	South Africa	Y
3	3	4	2 Sqn No 1 Elementary Flying Training School	\N	None	WYT	None	United Kingdom	N
4	4	5	213 Flight Unit	\N	None	TFU	None	Russia	N

As you can see, we get a nicely formatted DataFrame as the result. We could easily manipulate the columns:

df["country"]

0 None
1 United States
2 South Africa
3 United Kingdom
4 Russia
Name: country, dtype: object

It’s highly recommended to use the read_sql_query function when possible.

Mapping routes

Now that we know how to read queries into pandas DataFrames, we can create a map of every airline route in the world. We first start by querying the data. The below query will:

Get the latitude and longitude for the source airport for each route.
Get the latitude and longitude for the destination airport for each route.
Convert all the coordinate values to floats.
Read the results into a DataFrame, and store them to the variable routes.

routes = pd.read_sql_query("""
select cast(sa.longitude as float) as source_lon,
cast(sa.latitude as float) as source_lat,
cast(da.longitude as float) as dest_lon,
cast(da.latitude as float) as dest_lat
from routes
inner join airports sa on sa.id = routes.source_id
inner join airports da on da.id = routes.dest_id;
""",
conn)

We then setup our map:

m = Basemap(
projection='merc',
llcrnrlat=-80,
urcrnrlat=80,
llcrnrlon=-180,
urcrnrlon=180,
lat_ts=20,
resolution='c'
)
m.drawcoastlines()

We iterate through the first 3000 rows, and draw them. The below code will:

Loop through the first 3000 rows in routes.
Figure out if the route is too long.
If the route isn’t too long:
- Draw a circle between the origin and the destination.

for name, row in routes[:3000].iterrows():
if abs(row["source_lon"] - row["dest_lon"]) < 90:
# Draw a great circle between source and dest airports.
m.drawgreatcircle(
row["source_lon"],
row["source_lat"],
row["dest_lon"],
row["dest_lat"],
linewidth=1,
color='b'
)

We end up with the following map:

The above is much more efficient when we use pandas to turn the results of the SQL query into a DataFrame, instead of working with the raw results from sqlite3.

Now that we know how to query database rows, let’s move on to modifying them.

Modifying database rows

We can use the sqlite3 package to modify a SQLite database by inserting, updating, or deleting rows. Creating the Connection is the same for this as it is when you’re querying a table, so we’ll skip that part.

Inserting rows with Python

To insert a row, we need to write an INSERT query. The below code will add a new row to the airlines table. We specify 9 values to insert, one for each column in airlines. This will add a new row to the table.

cur = conn.cursor()
cur.execute("insert into airlines values (6048, 19846, 'Test flight', '', '', null, null, null, 'Y')")

If you try to query the table now, you actually won’t see the new row yet. Instead you’ll see that a file was created called flights.db-journal. flights.db-journal is storing the new row until you’re ready to commit it to the main database, flights.db.

SQLite doesn’t write to the database until you commit a transaction. A transaction consists of 1 or more queries that all make changes to the database at once. This is designed to make it easier to recover from accidental changes, or errors. Transactions allow you to run several queries, then finally alter the database with the results of all of them. This ensures that if one of the queries fails, the database isn’t partially updated.

A good example is if you have two tables, one of which contains charges made to people’s bank accounts (charges), and another which contains the dollar amount in the bank accounts (balances). Let’s say a bank customer, Roberto, wants to send $50 to his sister, Luisa. In order to make this work, the bank would need to:

Create a row in charges that says $50 is being taken from Roberto’s account and sent to Luisa.
Update Roberto’s row in the balances table and remove $50.
Update Luisa’s row in the balances table and add $50.

These will require three separate SQL queries to update all of the tables. If a query fails, we’ll be stuck with bad data in our database. For example, if the first two queries work, then the third fails, Roberto will lose his money, but Luisa won’t get it. Transactions mean that the main database isn’t updated unless all the queries succeed. This prevents the system from getting into a bad state, where customers lose their money.

By default, sqlite3 opens a transaction when you do any query that modifies the database. You can read more about it here. We can commit the transaction, and add our new row to the airlines table, using the commit method:

conn.commit()

Now, when we query flights.db, we’ll see the extra row that contains our test flight:

pd.read_sql_query("select * from airlines where id=19846;", conn)

	index	id	name	alias	iata	icao	callsign	country	active
0	1	19846	Test flight			None	None	None	Y

Passing parameters into a query

In the last query, we hardcoded the values we wanted to insert into the database. Most of the time, when you insert data into a database, it won’t be hardcoded, it will be dynamic values you want to pass in. These dynamic values might come from downloaded data, or might come from user input.

When working with dynamic data, it might be tempting to insert values using Python string formatting:

cur = conn.cursor()
name = "Test Flight"
cur.execute("insert into airlines values (6049, 19847, {0}, '', '', null, null, null, 'Y')".format(name))
conn.commit()

You want to avoid doing this! Inserting values with Python string formatting makes your program vulnerable to SQL Injection attacks. Luckily, sqlite3 has a straightforward way to inject dynamic values without relying on string formatting:

cur = conn.cursor()
values = ('Test Flight', 'Y')
cur.execute("insert into airlines values (6049, 19847, ?, '', '', null, null, null, ?)", values)
conn.commit()

Any ? value in the query will be replaced by a value in values. The first ? will be replaced by the first item in values, the second by the second, and so on. This works for any type of query. This created a SQLite parameterized query, which avoids SQL injection issues.

Updating rows

We can modify rows in a SQLite table using the execute method:

cur = conn.cursor()
values = ('USA', 19847)
cur.execute("update airlines set country=? where id=?", values)
conn.commit()

We can then verify that the update happened:

pd.read_sql_query("select * from airlines where id=19847;", conn)

	index	id	name	alias	iata	icao	callsign	country	active
0	6049	19847	Test Flight			None	None	USA	Y

Deleting rows

Finally, we can delete the rows in a database using the execute method:

cur = conn.cursor()values = (19847, )
cur.execute("delete from airlines where id=?", values)conn.commit()

We can then verify that the deletion happened, by making sure no rows match our query:

pd.read_sql_query("select * from airlines where id=19847;", conn)

	index	id	name	alias	iata	icao	callsign	country	active

Creating tables

We can create tables by executing a SQL query. We can create a table to represent each daily flight on a route, with the following columns:

id — integer
departure — date, when the flight left the airport
arrival — date, when the flight arrived at the destination
number — text, the flight number
route_id — integer, the id of the route the flight was flying

cur = conn.cursor()
cur.execute("create table daily_flights (id integer, departure date, arrival date, number text, route_id integer)")conn.commit()

Once we create a table, we can insert data into it normally:

cur.execute("insert into daily_flights values (1, '2016-09-28 0:00', '2016-09-28 12:00', 'T1', 1)")
conn.commit()

When we query the table, we’ll now see the row:

pd.read_sql_query("select * from daily_flights;", conn)

	id	departure	arrival	number	route_id
0	1	2016-09-28 0:00	2016-09-28 12:00	T1	1

Creating tables with pandas

The pandas package gives us a much faster way to create tables. We just have to create a DataFrame first, then export it to a SQL table. First, we’ll create a DataFrame:

from datetime import datetime
df = pd.DataFrame(
[[1, datetime(2016, 9, 29, 0, 0) ,
datetime(2016, 9, 29, 12, 0), 'T1', 1]],
columns=["id", "departure", "arrival", "number", "route_id"])

Then, we’ll be able to call the to_sql method to convert df to a table in a database. We set the keep_exists parameter to replace to delete and replace any existing tables named daily_flights:

df.to_sql("daily_flights", conn, if_exists="replace")

We can then verify that everything worked by querying the database:

pd.read_sql_query("select * from daily_flights;", conn)

	index	id	departure	arrival	number	route_id
0	0	1	2016-09-29 00:00:00	2016-09-29 12:00:00	T1	1

Altering tables with Pandas

One of the hardest parts of working with real-world data science is that the data you have per record changes often. Using our airline example, we may decide to add an airplanes field to the airlines table that indicates how many airplanes each airline owns. Luckily, there’s a way to alter a table to add columns in SQLite:

cur.execute("alter table airlines add column airplanes integer;")

Note that we don’t need to call commit — alter table queries are immediately executed, and aren’t placed into a transaction. We can now query and see the extra column:

pd.read_sql_query("select * from airlines limit 1;", conn)

	index	id	name	alias	iata	icao	callsign	country	active	airplanes
0	0	1	Private flight	\N	–	None	None	None	Y	None

Note that all the columns are set to null in SQLite (which translates to None in Python) because there aren’t any values for the column yet.

Altering tables with Pandas

It’s also possible to use Pandas to alter tables by exporting the table to a DataFrame, making modifications to the DataFrame, then exporting the DataFrame to a table:

df = pd.read_sql("select * from daily_flights", conn)
df["delay_minutes"] = None
df.to_sql("daily_flights", conn, if_exists="replace")

The above code will add a column called delay_minutes to the daily_flights table.

Next Steps

You should now have a good grasp on how to work with data in a SQLite database using Python and pandas. We covered querying databases, updating rows, inserting rows, deleting rows, creating tables, and altering tables. This covers all of the major SQL operations, and almost everything you’d work with on a day to day basis.

If you’re interested in learning more about how to work with Python and SQL, you can check out our interactive SQL course at Dataquest.

Here are some supplemental resources if you want to dive deeper:

Lastly, if you want to keep practicing, you can download the file we used in this blog post, flights.db, here.

Pandas

SQLite

Tutorials

Working with SQLite Databases using Python and Pandas

Querying database rows in Python

Mapping airports

Reading results into a pandas DataFrame

Mapping routes

Modifying database rows

Inserting rows with Python

Passing parameters into a query

Updating rows

Deleting rows

Creating tables

Creating tables with pandas

Altering tables with Pandas

Altering tables with Pandas

Next Steps

Intermediate SQL: PostgreSQL, Subqueries, and more!

SQL vs T-SQL: Understanding the Differences

Working with SQLite Databases using Python and Pandas

Querying database rows in Python

Mapping airports

Reading results into a pandas DataFrame

Mapping routes

Modifying database rows

Inserting rows with Python

Passing parameters into a query

Updating rows

Deleting rows

Creating tables

Creating tables with pandas

Altering tables with Pandas

Altering tables with Pandas

Next Steps

More learning resources

Intermediate SQL: PostgreSQL, Subqueries, and more!

SQL vs T-SQL: Understanding the Differences