January 10, 2018

Postgres Internals: Building a Description Tool

In previous blog posts, we have described the Postgres database and ways to interact with it using Python. Those posts provided the basics, but if you want to work with databases in production systems, then it is necessary to know how to make your queries faster and more efficient. To understand what efficiency means in Postgres, it's important to learn how Postgres works under the hood. In this post, we will focus on the more advanced concepts of Postgres and relational databases. To begin, we will learn how Postgres stores its own internal data for describing, debugging, and identifying the bottlenecks in a system. Then, we'll use our knowledge on Postgres internal data to build our own version of a database description tool in Python. Like our previous blog posts, we will be using the following tools:

A local version of Postgres (v9.2 or higher)
Python 3
The Python driver for Postgres, psycopg2

The data set we'll use comes from the United States' Department of Housing and Urban Development – also known as HUD. We have packaged the files into a zip file that contains the data as CSVs, and a Python 3 script (load_hud_tables.py) that will copy the CSVs into your local running Postgres. If you are running a Postgres server without the default connection, you'll want to update the connection string in the script. Using the HUD's data set, we will work on a real world data example using available commands in every Postgres engine. Starting from a blank slate, we will investigate the tables and their datatypes. Then, we will explore the HUD tables using internal Postgres tables to give us a detailed description about the contents of the database. To start, download and unzip the dq_postgres_internals.zip file. Go into the dq_postgres_internals/ directory, change the connection parameters in load_hud_tables.py, and then run the script. This will load the CSV files into your local Postgres instance. Once you have loaded the files into your Postgres server, we can start off by connecting to the database. Change the connection values if your local Postgres instance differs from the defaults. Throughout this post, any future reference to the the object cur will be the following connected cursor:

import psyocpg2conn = psycopg2.connect(host="127.0.0.1", dbname="postgres", user="postgres", password="password")
cur = conn.cursor()

Investigating the tables

Now that we're connected, it's time to start exploring what tables we have available. First, we'll examine the Postgres internal tables which provide us details about our database. Let's begin by describing what these internal tables are. In every Postgres engine, there are a set of internal tables Postgres uses to manage its entire structure. These tables are located in the Postgres documentation as the groups information_schema and system catalogs. These contain all the information about data, names of tables, and types stored in a Postgres database. For example, when we use the attribute, cur.description, it is pulling information from the internal tables to display to the user. Unfortunately, there is no detailed schema of the data sets. As a result, we need to create our own detailed description of what's included. Let's use an internal table from the information_schema to get a high level overview of what tables are stored in the database. The internal table is called information_schema.tables, and we can see from the documentation that there are plenty of columns to choose from:

Name	Data Type	Description
table_catalog	sql_identifier	Name of the database that contains the table (always the current database)
table_schema	sql_identifier	Name of the schema that contains the table
table_name	sql_identifier	Name of the table
table_type	character_data	Type of the table: BASE TABLE for a persistent base table (the normal table type), VIEW for a view, FOREIGN TABLE for a foreign table, or LOCAL TEMPORARY for a temporary table
self_referencing_column_name	sql_identifier	Applies to a feature not available in PostgreSQL
reference_generation	character_data	Applies to a feature not available in PostgreSQL
user_defined_type_catalog	sql_identifier	If the table is a typed table, the name of the database that contains the underlying data type (always the current database), else null.
user_defined_type_schema	sql_identifier	If the table is a typed table, the name of the schema that contains the underlying data type, else null.
user_defined_type_name	sql_identifier	If the table is a typed table, the name of the underlying data type, else null.
is_insertable_into	yes_or_no	YES if the table is insertable into, NO if not (Base tables are always insertable into, views not necessarily.)
is_typed	yes_or_no	YES if the table is a typed table, NO if not
commit_action	character_data	If the table is a temporary table, then PRESERVE, else null. (The SQL standard defines other commit actions for temporary tables, which are not supported by PostgreSQL.)

At this point, we are only concerned with the names of the tables that are in the database. Taking a look at the table description above, there is a column table_name that exposes this information. Let's query that column and see what we're dealing with.

cur.execute("SELECT table_name FROM information_schema.tables ORDER BY table_name")
table_names = cur.fetchall()
for name in table_names:
    print(name)

Working with schemas

When you run the previous command, you'll notice that the output contains an overwhelming amount of tables. Each one of these tables are found in the postgres database. The problem is: we're including internal tables in our results. In the output, you'll notice that many tables started with the prefix pg_*. Each one of these tables is part of the pg_catalog group of internal tables. These are the system catalog tables we described earlier. Then, as you may have guessed, there is also a group of other tables under the information_schema name. These tables are harder to find, though, as they follow no obvious prefix pattern. How do we know which tables are internal and which tables are user created? We need to describe schemas to explain the above question. When dealing with relational databases, the word schema is generalized as a term with multiple meanings. You may have heard schema being used to describe tables (their datatypes, name, columns, etc.) or schema as the blueprint of a database. The ambiguous meanings of schema only serve to confuse us. For Postgres, however, the term schema has been reserved for a specific purpose. In Postgres, schemas are used as a namespace for tables, with the distinct purpose of separating them into isolated groups or sets within a single database.

Let's break this down further. Postgres uses the concept of databases to separate users and data within a Postgres server. When you create a database, you are creating an isolated environment where users can query against tables that can only be found in that particular database. Here's an example: suppose the Department of Homeland Security (DHS) and HUD shared the same government Postgres database, but they wanted to separate their users and data. Then, they would use databases to separate their data and users with a separate database for each agency. Now, when a user wants to connect to their data, they need to specify which database they will connect to and there, and only there, can they work with their tables. However, suppose that there were analysts that wanted to do a cross-section analysis against citizenship data (citizens table) and urban housing developments (developments table). Well, then they would want to query both the table in the dhs database and the hud database. With Postgres, though, this is not possible.


# Connect to the `dhs` database.
conn = psycopg2.connect(dbname='dhs')
cur = conn.cursor()
# This query works.
cur.execute('SELECT * FROM citizens') 
# This query will fail because it is in the `hud` database.
cur.execute('SELECT * FROM developments')

What if we wanted to seperate tables into distinct groups, but still allow cross table querying? This is the perfect use case for schemas. Instead of databases, using a different schema for each agency will seperate the tables using a namespace, but still allow analysts to query both tables.


# Connect to the US Government database.
conn = psycopg2.connect(dbname='us_govt_data')
cur = conn.cursor()
# Using schemas, both queries work!
cur.execute('SELECT * FROM dhs.citizens')
cur.execute('SELECT * FROM hud.developments')

This is how Postgres divides their internal tables (and user created tables as well!). When a database is created, there are 3 schemas instantiated: pg_catalog (for the system catalog tables), information_schema (for the information schema tables), and public (the default schema for user created tables). Each time you issue a CREATE TABLE command in a database, by default Postgres assigns that table to the public schema. Now, returning to our problem from before, how can we seperate user created tables from the internal tables? Take a look at the columns for information_schema.tables again. Now, check to see if there exists a column that can seperate user created tables from internal tables.

Name	Data Type	Description
table_catalog	sql_identifier	Name of the database that contains the table (always the current database)
table_schema	sql_identifier	Name of the schema that contains the table
table_name	sql_identifier	Name of the table
table_type	character_data	Type of the table: BASE TABLE for a persistent base table (the normal table type), VIEW for a view, FOREIGN TABLE for a foreign table, or LOCAL TEMPORARY for a temporary table
self_referencing_column_name	sql_identifier	Applies to a feature not available in PostgreSQL
reference_generation	character_data	Applies to a feature not available in PostgreSQL
user_defined_type_catalog	sql_identifier	If the table is a typed table, the name of the database that contains the underlying data type (always the current database), else null.
user_defined_type_schema	sql_identifier	If the table is a typed table, the name of the schema that contains the underlying data type, else null.
user_defined_type_name	sql_identifier	If the table is a typed table, the name of the underlying data type, else null.
is_insertable_into	yes_or_no	YES if the table is insertable into, NO if not (Base tables are always insertable into, views not necessarily.)
is_typed	yes_or_no	YES if the table is a typed table, NO if not
commit_action	character_data	If the table is a temporary table, then PRESERVE, else null. (The SQL standard defines other commit actions for temporary tables, which are not supported by PostgreSQL.)

There's a column named table_schema that will fit our requirement. We can filter on this column to select all public tables. Here's how we can write the query:


conn = psycopg2.connect(dbname="dq", user="hud_admin", password="eRqg123EEkl")
cur = conn.cursor()
cur.execute("SELECT table_name FROM information_schema.tables WHERE table_schema='public' ORDER BY table_name")
for table_name in cur.fetchall():
    name = table_name[0]
    print(name)

Describing the tables

From the output, there are only three tables we will be working with. These are:

homeless_by_coc
state_info
state_household_incomes

Using the table names, we can call the cur.description attribute to get a detailed look at the columns, types, and any other meta information for each table. Before, we learned that it was possible to issue a SELECT query and then call the description attribute to get the table information. However, what if we wanted to do it within the for loop for each table? Hopefully your first thought was not to use .format(). In previous posts, we have already mentioned the issues surrounding string interpolation with .format(). The answer is to mogirfy the string using the mogrify() method, or the second positional argument in the execute() method. Unfortunately, things are not that easy. Attempting to insert the table name – using mogirfy() – instead of a column name, filter key, or group by key results in an error. Here's an example of this error:

table_name = "state_info"
bad_interpolation = cur.mogrify("SELECT * FROM
# This will execute the query: SELECT * FROM 'state_info' LIMIT 0
# Notice the single quotation marks around state_info.
cur.execute(bad_interpolation)  
# Throws an error

From the code snippet, you may have noticed that using mogrify() on the table name "state_info" converts the name to a Postgres string. This would be required behaviour for a column name or filter query, but not a table name. Instead, you have to use a class from the psycopg2.extensions module named AsIs.


from psycopg2.extensions import AsIs
table_name = "state_info"
proper_interpolation = cur.mogrify("SELECT * FROM
cur.execute(proper_interpolation)  
# Executes!

The table name in the SELECT query does not need to be string quoted. Therefore, AsIs keeps it as a valid SQL representation of non-string quoted instead of converting it. Using AsIs, we can check the description for each of the tables without having to write out each request!


from psycopg2.extensions import AsIs
cur.execute("SELECT table_name 
FROM information_schema.tables WHERE table_schema='public'")
for table in cur.fetchall():
    table = table[0]
    cur.execute("SELECT * FROM
    print(cur.description, "\n")

Type code mappings

With each of the descriptions printed out, we now have a detailed look at the tables we will be working with. Here's a snippet of the output from the homeless_by_coc description:

(Column(name='id', type_code=23, display_size=None, internal_size=4, precision=None, scale=None, null_ok=None), Column(name='year', type_code=1082, display_size=None, internal_size=4, precision=None, scale=None, null_ok=None), Column(name='state', type_code=1042, display_size=None, internal_size=2, precision=None, scale=None, null_ok=None), Column(name='coc_number', type_code=1042, display_size=None, internal_size=8, precision=None, scale=None, null_ok=None), Column(name='coc_name', type_code=1043, display_size=None, internal_size=128, precision=None, scale=None, null_ok=None), Column(name='measures', type_code=1043, display_size=None, internal_size=64, precision=None, scale=None, null_ok=None), Column(name='count', type_code=23, display_size=None, internal_size=4, precision=None, scale=None, null_ok=None))

Understanding the description attribute, you should be comfortable with the metadata available. However, we are once again faced with an integer type_code instead of a human readable type. There is too much mental overhead when remembering what human readable types they represent (ie. TEXT, INTEGER, or BOOLEAN). You can use psycopg2 type values to find the approximate type for each column, but we can do better than approximate those values. Using an internal table, we can accurately map the types for every column in the HUD tables. The internal table we will use comes from the system catalog schema, pg_catalog, and it is accurately named pg_type. We advise checking out the table description in the documentation as there are too many rows to add in this section. You can find the table description here. Within this table, there are a lot of defined columns – many that you do not need to be concerned about. However, one interesting thing to note about this table is that it can be used to create your own Postgres types from scratch. For example, using this table, you could create a HEX type that could be used to only store hexadecimal characters in your columns. Let's loop through the returned SELECT query and map the integer type code to the string. It should look something similar to this:


type_mappings = {
    16: 'bool',
    18: 'char',
    19: 'name',
    ...
}

Using dictionary comprehension, we can write the following:


cur.execute("SELECT oid, typname 
FROM pg_catalog.pg_type")
type_mappings = {
    int(oid): typname
    for oid, typname in cur.fetchall()}

Readable description types

Great! Now we have a mapping of all type codes to their type names. Using the type_mappings dictionary, the types are provided without looking them up in the documentation. Let's put all this together and create our own table descriptions. We want to rewrite the description attributes from a list of tuples towards something human readable. We'll want to assemble output from the previous exercises into this dictionary:


{
    "homeless_by_coc":
        {
            columns: [
                {
                    name: "id",
                    type: "int4",
                    internal_size: 4
                },
                {
                    name: "year",
                    type: "date",
                    internal_size: 4
                },
                {
                    name: "state",
                    type: "char",
                    internal_size: 2
                },
                {
                    name: "coc_number",
                    type: "char",
                    internal_size: 128
                },
                {
                    name: "measures",
                    type: "varchar",
                    internal_size: 64
                 },
                {
                    name: "count",
                    type: "int4",
                    internal_size: 4
                }
            ]
                    }
    ...
}

Using type_mappings, and table_names, the steps that will get us the required result are:

Loop through the table_names with the table variable.
Get the description attribute for the given table.
Map the name of the table to a dictionary with a columns key.
Recreate the columns list from the screen example by iterating through the description, and mapping the appropriate types.

cur.execute("SELECT oid, typname FROM pg_catalog.pg_type")
type_mappings = {
    int(oid): typname
    for oid, typname in cur.fetchall()}
cur.execute("SELECT table_name FROM information_schema.tables WHERE table_schema='public'")
table_names = [table[0] for table in cur.fetchall()]
readable_description = {}
for table in table_names:
    cur.execute("SELECT * FROM
    readable_description[table] = dict(
        columns=[
            dict(
                name=col.name,
                type=type_mappings[col.type_code],
                length=col.internal_size
            )
            for col in cur.description
        ]    )print(readable_description)

Number of rows

Things are starting to come together. Now, to complete our investigation, we want to provide some additional information about the rows in our tables. Let's provide our description with the number of rows in the table. We can find the number of rows using the COUNT() aggregate function. This is extremely similar to SQLite's aggreggate function as well as other implementations of the SQL syntax. If you want to learn more about Postgres' aggregate functions, they are all defined in the pg_catalog.pg_aggregate internal table. As a reminder, here's how you would use the COUNT() function in Postgres:

SELECT COUNT(*) FROM example_table

We want the description table to look like:

{
    "homeless_by_coc":
        {
            columns: [
                {
                    name: "id",
                    type: "int4",
                    internal_size: 4
                },
                ...
                {
                    name: "count",
                    type: "int4",
                    internal_size: 4
                }
            ],
            total: 86529
        }
    ...
}

Instead of iterating through the table_names list, we'll iterate on the readable_description dictionary keys:


for table in readable_description.keys():
    cur.execute("SELECT COUNT(*) FROM
    readable_description[table]["total"] = cur.fetchone()

Sample rows

Finally, let's add some sample rows the readable_description dictionary. Since there are a lot of rows for the homeless_by_coc table, we should add on a limit for each query. Even if you add a higher limit than rows available, the query will still execute.

SELECT * FROM example_table LIMIT 100

We'll add the limit query within the same loop as retrieving the count. Instead of iterating over the keys twice, we can perform both operations within the same loop. However, we should be cautious of the call order of cur.fetchall(). We can overwrite query executions if we fail to immediately fetch their results. The cur.execute() command does not return read results, and it's the user's responsibility to request them. For example, the following query will only return the results of the LIMIT and not the COUNT:


cur.execute("SELECT COUNT(*) FROM homeless_by_coc")
cur.execute("SELECT * FROM homless_by_coc LIMIT 100")
# Calling .fetchall() will only return the rows in the LIMIT query.
print(cur.fetchall())

Let's add those two queries into a single code block:


for table in readable_description.keys():
    cur.execute("SELECT COUNT(*) FROM
    readable_description[table]["total"] = cur.fetchone()
    cur.execute("SELECT * FROM
    readable_description[table]["sample_rows"] = cur.fetchall()

Putting it all together, we now have a general script that will return a human readable dictionary of all user created tables in a database.


import psyocpg2 
from psyocpg2.extensions import AsIs
conn = psycopg2.connect(host="127.0.0.1", dbname="postgres", user="postgres", password="password")
cur = conn.cursor()
cur.execute("SELECT oid, typname FROM pg_catalog.pg_type")
type_mappings = {
    int(oid): typname
    for oid, typname in cur.fetchall()}
cur.execute("SELECT table_name FROM information_schema.tables WHERE table_schema='public'")
table_names = [table[0] for table in cur.fetchall()]
readable_description = {}
for table in table_names:
    cur.execute("SELECT * FROM
    readable_description[table] = dict(
        columns=[
            dict(
                name=col.name,
                type=type_mappings[col.type_code],
                length=col.internal_size
            )
            for col in cur.description
        ]
    )

Next steps

In this post, we went from having no knowledge about a database and its tables, to a human readable description of the tables we will be working with. First, we learned about Postgres' internal tables and the concept behind schemas. Then, we applied our knowledge to build our own description dictionary from scratch. This post was adapted from a lesson in our Data Engineering track. The lesson is part of the course Optimizing Postgres Databases, and we expand on the knowledge of Postgres' internal tables to optimize the HUD tables and their queries. The course focusses on solving real-life scenarios that you will encounter in production-grade data analysis systems.

Data Engineering

Postgres

Postgres Internals: Building a Description Tool

Investigating the tables

Working with schemas

Describing the tables

Type code mappings

Readable description types

Number of rows

Sample rows

Next steps

21 Data Science Projects for Beginners (with Source Code)

Data Science Project: Profitable App Profiles for App Store and Google Play

Postgres Internals: Building a Description Tool

Investigating the tables

Working with schemas

Describing the tables

Type code mappings

Readable description types

Number of rows

Sample rows

Next steps

More learning resources

21 Data Science Projects for Beginners (with Source Code)

Data Science Project: Profitable App Profiles for App Store and Google Play