Loading Data into Postgres using Python and CSVs

Loading Data into Postgres using Python and CSVs

An introduction to Postgres with Python

Data storage is one of (if not) the most integral parts of a data system. You will find hundreds of articles online detailing how to write insane SQL analysis queries, how to run complex machine learning algorithms on petabytes of training data, and how to build statistical models on thousands of rows in a database. The only problem is: no one mentions how you get the data stored in the first place.

Whether you are a data analyst, data scientist, data engineer, or even a web developer, it is important to know how to store and access your data. In this blog post, we are going to focus on a type of data storage called a relational database. Relational databases are the most common storage used for web content, large business storage, and, most relevant, for data platforms.

Specifically, we’ll be focusing on Postgres (or PostgreSQL), one of the biggest open source relational databases. We like Postgres due to its high stability, ease of accessbility in cloud providers (AWS, Google Cloud, etc), and the fact it is open source! Using the Python library, psycopg2, we will run through an example of how you can create your own table from scratch and then load a data set into a local running Postgres server.

This tutorial is based on our Dataquest Introduction to Postgres course, which is part of our Data Engineering Learning Path. The course goes into a lot more detail, allowing you to learn by doing as you write code.

Before we begin, you should make sure you have the necessary tools installed. The most important thing is to have a local version of Postgres installed on your computer. The Postgres wiki has an installation page with guides on the most popular operating systems.

Next, you should ensure that you have the psycopg2 library installed. If not, you can run:

pip install psycopg2

In our code examples, we will be using Python version 3.6 on a Mac or Linux OS. If you are running version 2.7 or are on a Windows machine, the commands should still be similar. With everything set up, let’s dive into connecting to your local Postgres server!

Postgres and the client-server model

If you have ever used a SQL engine in the past, you most likely have been introduced to SQLite, as it is one of the most common database engines. All your data is saved onto a single file, making it portable and easy to develop with. Furthermore, SQLite contains most of the SQL commands that are used with bigger engines—so if you know SQLite then you know the basics of every other SQL database

However, there are drawbacks to using SQLite in a data production system. Because of its simple use case, SQLite was not built for multiple connections. SQLite only allows a single process to write to the database, making it difficult to share with multiple people and services.

Postgres, on the other hand, is a much more robust engine that is implemented as a server rather than a single file. As a server, Postgres accepts connections from clients who can request a SELECT, INSERT, or any other type of SQL query. This type of a design is called a client-server model, where clients can interact with the server.

This model did not happen by accident, it’s actually a very common model that you are using right now to connect to www.dataquest.io/blog. Your computer, laptop, or whatever device that you are using to read this post is the client that is requesting information from the www.dataquest.io server. Anytime you are accessing a website, your browser (the client) will continually be requesting website data from the server.

In Postgres’ case, however, instead of requesting website data, the client connecting to the Postgres service will use database=specific requests. These types of requests are implemented according to a defined protocol, which is a set of rules that both the client and server have agreed to. You can think of a protocol as the language that both the client and server will use when the client requests and the server responds with data.

Using this model, Postgres can handle multiple connections to the database, solving the challenge of connecting multiple users to a single database. Also, as a server, Postgres can implement more advanced querying features with security measures in place. All these reasons make Postgres an excellent choice for data analysis.

Connecting to Postgres

If we wish to communicate to the Postgres server, we need to use a type of client that speaks the database protocol described earlier. We’ll use psycopg2, an open source library that implements the Postgres protocol. You can think of psycopg2 as being similar to connecting to a SQLite database using sqlite3.

Use the following example to connect to a Postgres database using psycopg2

import psycopg2
conn = psycopg2.connect("host=localhost dbname=postgres user=postgres")

Let’s describe these three parameters we passed in to the psycopg2.connect() method. First, it is necessary to specify a hostname host describing where your Postgres server is running. Because we are running a local version of Postgres, we use the default hostname localhost.

Then, we need to pass in the database name dbname and a user user. Because of the multiple connections, Postgres uses multiple users and databases as a way to improve security and division of data. We use postgres for both values since these are the default values that Postgres is installed with.

Without these values in psycopg2, Postgres will not know where you would like to connect to and will fail. Remember, anytime you want to connect to a Postgres server, whether on the web or local, you will have to pass in the appropriate connection parameters. Now that we are connected, it’s time to take advantage of Postgres’ features!

Interacting with the database

In the previous example, we opened a connection to Postgres by using the connect() method of the psycopg2 module. The connect() method takes in a sequence of parameters that the library uses to connect to the Postgres server. The return value of the connect() method is a Connection object.

The connection object creates a client session with the database server that instantiates a persistant client to speak with. To issue commands against the database, you will also need to create another object called the Cursor object. Cursor is created by the Connection object and using the Cursor object we will be able to execute our commands.

To execute commands on the Postgres database, you call the execute method on the Cursor object with a stringified SQL command. Here’s an example of how you could run this on a fake notes table:

import psycopg2
conn = psycopg2.connect("host=localhost dbname=postgres user=postgres")
cur = conn.cursor()
cur.execute('SELECT * FROM notes')

In this example, the cur object calls the execute method and, if it is successful, will return None. To get the returned value (or values) from your query, you need to call one of the two methods: fetchone() or fetchall(). The fetchone() method returns the first row result or None and the fetchall() method returns a list of each row in the table or an empty list [] if there are no rows.

import psycopg2
conn = psycopg2.connect("host=localhost dbname=postgres user=postgres")
cur = conn.cursor()
cur.execute('SELECT * FROM notes')
one = cur.fetchone()
all = cur.fetchall()

With our current database, unfortunately, we don’t have any tables created. Without any tables, there is nothing interesting to query on. To fix that, let’s go ahead and create our first table!

Creating a table

Now that we have a basic understanding of how to connect and execute queries against a database, it’s time to create your first Postgres table. From the Postgres documentation, here’s how you can create a table in Postgres:

CREATE TABLE tableName(
   column1 dataType1 PRIMARY KEY,
   column2 dataType2,
   column3 dataType3,
   ...
);

Each column[n] is a placeholder for the column name, dataType[n] is the data type you want to store for that column, and PRIMARY KEY is an example of an optional parameter to add on to the table. In Postgres, every table requires at least one PRIMARY KEY column that contains a unique set of values. Let’s now take a look at the CSV file we wish to load into the database (note that the CSV does not contain real users but are randomly generated users using a Python library called faker).

id email name address
0 [email protected] Anna Jackson 22871 Hannah Camp Armstrongton, NE 30037
1 [email protected] Natalie Holloway 7357 Barbara Mission Mooremouth, HI 03826
2 [email protected]' Aaron Hull 362 Cox Bypass Suite 052 New Darrenmouth, IA 67749-2829

You can download this CSV file from the Dataquest servers, here. In the CSV file you just downloaded, user_accounts.csv, it looks like we have basically two different data types we need to account for. The first is the integer type for each of the IDs, and the rest would be string types. Postgres, like other relational databases, is type sensitive – meaning you have to declare types for each column of the table you create. You can find a list of all the types in the Postgres documentation.

To create a table that fits our dataset, we will have to run the CREATE TABLE command with the columns in the same order as the CSV file and their respective types. Similar to running a SELECT query, we will write the command as a string and pass it to the execute() method. Here’s how it would look for this table:

import psycopg2
conn = psycopg2.connect("host=localhost dbname=postgres user=postgres")
cur = conn.cursor()
cur.execute("""
CREATE TABLE users(
    id integer PRIMARY KEY,
    email text,
    name text,
    address text
)
""")

SQL transactions

If you checked the postgres database now you would notice that there actually isn’t a users table inside it. This isn’t a bug – it’s because of a concept called SQL transactions. In contrast with SQLite, every query made in that engine is executed and immediately reflected as a change on the database.

With Postgres, we’re dealing with multiple users who could be changing the database at the same time. Let’s imagine a simple scenario where we’re keeping track of accounts for different customers of a bank. We could write a simple query to create a table for this:

CREATE TABLE accounts(
   id integer PRIMARY KEY,
   name text,
   balance float
);

Our table has the following two rows in the table:

id    name    balance
1     Jim     100
2     Sue     200

Let’s say Sue gives 100 dollars to Jim. We could model this with two queries:

UPDATE accounts SET balance=200 WHERE name="Jim";

UPDATE accounts SET balance=100 WHERE name="Sue";

In the above example, we remove 100 dollars from Sue, and add 100 dollars to Jim. What if the second UPDATE statement has an error in it, the database fails, or another user has a conflicting query? The first query would run properly, but the second would fail. That would result in the following:

id    name    balance
1     Jim     100
2     Sue     200

Jim would be credited 100 dollars, but 100 dollars would not be removed from Sue. This would cause the bank to lose money.

Transactions prevent this type of behavior by ensuring that all the queries in a transaction block are executed at the same time. If any of the transactions fail, the whole group fails, and no changes are made to the database at all.

Whenever we open a Connection in psycopg2, a new transaction will automatically be created. All queries run up until the commit method is called will be placed into the same transaction block. When commit is called, the PostgreSQL engine will run all the queries at once.

If we don’t want to apply the changes in the transaction block, we can call the rollback method to remove the transaction. Not calling either commit or rollback will cause the transaction to stay in a pending state, and will result in the changes not being applied to the database.

To commit our changes and create the users table from before, all we need to do is to run the commit() method at the end of our transaction. This is what it should look like now:

conn = psycopg2.connect("dbname=dq user=dq")
cur = conn.cursor()
cur.execute("""
CREATE TABLE users(
    id integer PRIMARY KEY,
    email text,
    name text,
    address text
)
""")
conn.commit()

Inserting the data

With our table created and commited, it’s time to load the CSV file into the database!

A common way of loading data into a Postgres table is to issue an INSERT command on the table. The insert command requires a table name to insert to and the sequence of values to insert. Here’s an example of an insert query on the users table:

INSERT INTO users VALUES (10, "[email protected]", "Some Name", "123 Fake St.")

Using the INSERT command, we can insert into the users table using pyscopg2. First, write a string INSERT SQL command for the execute() method. Then, format the string with all the values:

import psycopg2
conn = psycopg2.connect("host=localhost dbname=postgres user=postgres")
cur = conn.cursor()
insert_query = "INSERT INTO users VALUES {}".format("(10, '[email protected]', 'Some Name', '123 Fake St.')")
cur.execute(insert_query)
conn.commit()

Unfortunately, there is an issue with the format() string method. The problem is that you have to do manual type formating, like the string escaping "'[email protected]'". You’re very likely to get errors using this way of inserting.

Fortunately, psycopg2 provides another way to perform string interpolation without format(). This is the recommended way to call INSERT using psycopg2:

import psycopg2
conn = psycopg2.connect("host=localhost dbname=postgres user=postgres")
cur = conn.cursor()
cur.execute("INSERT INTO users VALUES (%s, %s, %s, %s)", (10, '[email protected]', 'Some Name', '123 Fake St.'))
conn.commit()

This type of insert automatically converts each one of the types to the proper datatype expected by the table. Another benefit is that your insert queries are actually sped up since the INSERT statement is prepared in the database. In our Postgres course, we cover this optimization if you are interested, but for now let’s just work on inserting the CSV file.

We begin by loading the CSV file in Python using the csv module. Then, we’ll run the INSERT query for each row and then commit the transcation:

import csv
import psycopg2

conn = psycopg2.connect("host=localhost dbname=postgres user=postgres")
cur = conn.cursor()

with open('user_accounts.csv', 'r') as f:
    reader = csv.reader(f)
    next(reader)  # Skip the header row.
    for row in reader:
        cur.execute(
            "INSERT INTO users VALUES (%s, %s, %s, %s)",
            row
        )
conn.commit()

While this accomplishes the task of loading in our data, it’s actually not the most efficient way of doing it. As you can see, we had to loop through every single row from the file just to insert them into the database! Luckily for us, Postgres has a command specific for loading files into tables.

Copying the data

The Postgres command to load files directy into tables is called COPY. It takes in a file (like a CSV) and automatically loads the file into a Postgres table. Instead of creating the query and then running it through execute() like INSERT, psycopg2, has a method written solely for this query.

The method to load a file into a table is called copy_from. Like the execute() method, it is attached to the Cursor object. However, it differs quite a bit from the execute() method due to its parameters.

The copy_from arguments requires a file to load (without the header), the tablename it should load into, as well as a delimiter (the key argument sep). Then, running commit(), the file is transferred into ths is the most efficient, and recommended, way to load CSV files into a Postgres table.

This is how we use copy_from() to load our file instead of looping INSERT commands:

import psycopg2

conn = psycopg2.connect("host=localhost dbname=postgres user=postgres")
cur = conn.cursor()
with open('user_accounts.csv', 'r') as f:
    # Notice that we don't need the `csv` module.
    next(f)  # Skip the header row.
    cur.copy_from(f, 'users', sep=',')
    
conn.commit()

It’s that simple! Finally, we have successfully loaded the user_accounts.csv file into our table using the optimal Postgres way. Let’s take a second to look back at what we’ve learned so far.

Summary

  • Postgres uses the client-server model to enable multiple connections to the database.
  • Using the popular psycopg2 library, we can use Python to connect to Postgres.
  • Postgres is type sensitive so we have to declare types on each of our columns.
  • Postgres uses SQL transactions to save the state of the database.
  • When inserting data, use psycopg2 string interpolation instead of .format().
  • The most efficient way to load files into Postgres tables is to use COPY, or the psycopg2.copy_from() method.

Next Steps

If you’d like to learn more, this tutorial is based on our Dataquest Introduction to Postgres course, which is part of our Data Engineering Learning Path. The course extends on the model from this post, and offers hands-on learning as you follow along writing code.

If you want to keep working with this table, here are a few more things you can do:

  • Delete the data and then reload it using the DELETE command.
  • Use more efficient types for the data you’re loading in. For example, Postgres has a lot more types than just TEXT for string data. We cover Postgres’ types extensibly in a course mission, if you’re interested.
  • Try to export the table into another file using the COPY ... TO or cur.copy_to() method. In the course, we cover this and additional INSERT and COPY options to load in your data.