SQL Subqueries: A Beginner’s Guide (with Code Examples)
Every data scientist needs to know their way around a SQL database, including subqueries. Here’s an introduction.
In this article, we’ll cover the basics of SQL subqueries, their syntax, how they can be useful, and when and how to use them when querying a database.
This article assumes that you have some basic knowledge about selecting data using SQL, such as grouping data, aggregate functions, filtering, and basic joins.
What’s a Subquery
A subquery is nothing more than a query inside another query. We mostly use them to add a new column to the main query result, to create a filter, or to create a consolidated source from which to select the data.
The subquery will always be between parentheses, and it can appear in different places within the main query, depending on the objective — usually within the
WHERE cluases. Also, the number of subqueries is unlimited, which means you can have as many nested queries as you need.
To write some real SQL code, we’ll use the Chinook database as an example. This is a sample database available for multiple types of databases.
The database contains information about a fictional digital music shop, such as data about the artists, songs, playlists, music genres, and albums from the music shop, as well as information on the shop’s employees, customers, and purchases.
This is the schema of the database, so you can have a better understanding of how the queries we’ll write work:
Subquery to Create a New Column
The first use case of a subquery consists of using it to add a new column to your main query’s output. This is what the syntax will look like:
SELECT column_1, columns_2, (SELECT ... FROM table_2 GROUP BY 1) FROM table_1 GROUP BY 1
Let’s see a practical example.
Here we want to see the number of playlists in the app to which the users added each song.
The main query returns two columns: the name of the song and the number of playlists to which users added it. It’s the second column that demands a subquery. The subquery is necessary here because we have to match the
track_id assigned to the playlist with the
track_id in the track table and then count them for each track.
SELECT t.name, (SELECT count(playlist_id) FROM playlist_track pt WHERE pt.track_id = t.track_id ) as number_of_playlists FROM track t GROUP BY 1 ORDER BY number_of_playlists DESC LIMIT 50
We then get this output:
|A Midsummer Night’s Dream, Op.61 Incidental Music: No.7 Notturno||5|
|Aria Mit 30 Veränderungen, BWV 988 "Goldberg Variations": Aria||5|
|Carmina Burana: O Fortuna||5|
|Cavalleria Rusticana \ Act \ Intermezzo Sinfonico||5|
|Concerto for Piano No. 2 in F Minor, Op. 21: II. Larghetto||5|
|Concerto for Violin, Strings and Continuo in G Major, Op. 3, No. 9: I. Allegro||5|
|Das Lied Von Der Erde, Von Der Jugend||5|
|Die Walküre: The Ride of the Valkyries||5|
|Die Zauberflöte, K.620: "Der Hölle Rache Kocht in Meinem Herze"||5|
|Fantasia On Greensleeves||5|
|Intoitus: Adorate Deum||5|
|Jupiter, the Bringer of Jollity||5|
|Karelia Suite, Op.11: 2. Ballade (Tempo Di Menuetto)||5|
|Lamentations of Jeremiah, First Set \ Incipit Lamentatio||5|
|Metopes, Op. 29: Calypso||5|
|Miserere mei, Deus||5|
When creating a new column, a subquery might also be useful to perform some calculations. Of course, if this is the case, the output of the subquery must be a single number.
In the next query, we want to know the percentage of tracks of each genre in our database. The syntax is basically the same as in the last example, only that the subquery will be a part of creating the new column and not the entire new column.
For this task, we need to divide the number of songs in each genre by the total number of songs in that track table. We can easily access the total number of tracks with this query:
SELECT count(*) as total_tracks FROM track | total_tracks | |--------------| | 3503 |
We can find the total number of tracks per genre with the following query:
SELECT g.name as genre, count(t.track_id) as number_of_tracks FROM genre g INNER JOIN track t on g.genre_id = t.genre_id GROUP BY 1 ORDER BY 2 DESC
|Alternative & Punk||332|
|Sci Fi & Fantasy||26|
|Rock And Roll||12|
If we combine these two queries such that the first one will be the subquery, the output is the percentage of songs per genre:
SELECT g.name as genre, round(cast(count(t.track_id) as float) / (SELECT count(*) FROM track), 2) as perc FROM genre g INNER JOIN track t on g.genre_id = t.genre_id GROUP BY 1 ORDER BY 2 DESC
|Alternative & Punk||0.09|
|Sci Fi & Fantasy||0.01|
|Rock And Roll||0|
Subquery as Filtering
Using a SQL subquery as a filter to the main query is one of my favorite use cases. In this scenario, the subquery will be in the
WHERE clause, and we can use operators such as
< to filter, depending on the output of the subquery.
This is the syntax:
SELECT column_1, columns_2 FROM table_1 WHERE column_1 in (SELECT ... FROM table_2)
In our example, let’s say we want to know how many customers with at least U$100 spent on the store each employee is assigned to. Let’s do this in two steps.
First, let’s just get the number of customers for each employee. That’s a simple query.
SELECT employee_id, e.last_name, count(distinct customer_id) as number_of_customers FROM employee e INNER JOIN customer c on e.employee_id = c.support_rep_id GROUP BY 1,2 ORDER BY 3 DESC
This is the output:
Now, let’s see which clients have spent at least U$100 in the store. This is the query:
SELECT c.customer_id, round(sum(i.total), 2) as total FROM customer c INNER JOIN invoice i on c.customer_id = i.customer_id GROUP BY c.customer_id HAVING sum(i.total) > 100 ORDER BY 2 DESC
This is the output:
Now, in order to combine these two queries, the first one will be the main query, and the second one will be in the
WHERE clause to filter the main query.
This is how it works:
SELECT employee_id, e.last_name, count(distinct customer_id) as number_of_customers FROM employee e INNER JOIN customer c on e.employee_id = c.support_rep_id WHERE customer_id in ( SELECT c.customer_id FROM customer c INNER JOIN invoice i on c.customer_id = i.customer_id GROUP BY c.customer_id HAVING sum(i.total) > 100) GROUP BY 1, 2 ORDER BY 3 DESC
This is the final output:
Notice two important points:
- We removed the
total_purchasedcolumn when placing the query 2 in the
WHEREclause of the main query. That’s because we want this query to return only one column, which is the one the main query uses as a filter. Had we not done that, we would see an error message like this (depending on the version of SQL):
sub-select returns 2 columns - expected 1
- We used the
INoperator. As the name says, we wanted to check which customers were IN the list of columns with over U$100 in purchases.
To use a math operator such as
<>, the subquery should return a number, not a column. That’s not the case in this example, but we can easily adapt the code to work like that when necessary.
Subquery as a New Table
The last approach to using a SQL subquery that we’ll see in this article is using it to create a new, consolidated source from which you can extract data.
We use this approach when the main query becomes too complex and we want to keep our code readable and organized — and also when we will use this new source of data repetitively for different purposes and we don’t want to rewrite it over and over.
It usually looks like this:
SELECT column_1, column_2 FROM (SELECT ... FROM table_1 INNER JOIN table_2) WHERE column_1 > 100
As an example, this will be our subquery:
SELECT c.customer_id, c.last_name, c.country, c.state, count(i.customer_id) as number_of_purchases, round(sum(i.total), 2) as total_purchased, (SELECT count(il.track_id) n_tracks FROM invoice_line il INNER JOIN invoice i on i.invoice_id = il.invoice_id WHERE i.customer_id = c.customer_id ) as count_tracks FROM customer c INNER JOIN invoice i on i.customer_id = c.customer_id GROUP BY 1, 2, 3, 4 ORDER BY 6 DESC
Its result is this new table:
In this new table, we have consolidated the ID, last name, country, state, number of purchases, the total amount of dollars spent, and number of tracks purchased for each customer in the database.
Now, we can see which users in the United States have purchased at least 50 songs:
SELECT new_table.* FROM (SELECT c.customer_id, c.last_name, c.country, c.state, count(i.customer_id) as number_of_purchases, round(sum(i.total), 2) as total_purchased, (SELECT count(il.track_id) n_tracks FROM invoice_line il INNER JOIN invoice i on i.invoice_id = il.invoice_id WHERE i.customer_id = c.customer_id ) as count_tracks FROM customer c INNER JOIN invoice i on i.customer_id = c.customer_id GROUP BY 1, 2, 3, 4 ORDER BY 6 DESC) as new_table WHERE new_table.count_tracks >= 50 AND new_table.country = 'USA'
Notice that we just need to select the columns and apply the filters we want in the SQL subquery.
This is the output:
We can also see the number of users with at least 50 songs purchased per state:
SELECT state, count(*) FROM (SELECT c.customer_id, c.last_name, c.country, c.state, count(i.customer_id) as number_of_purchases, round(sum(i.total), 2) as total_purchased, (SELECT count(il.track_id) n_tracks FROM invoice_line il INNER JOIN invoice i on i.invoice_id = il.invoice_id WHERE i.customer_id = c.customer_id ) as count_tracks FROM customer c INNER JOIN invoice i on i.customer_id = c.customer_id GROUP BY 1, 2, 3, 4 ORDER BY 6 DESC) as new_table WHERE new_table.count_tracks >= 50 AND new_table.country = 'USA' GROUP BY new_table.state ORDER BY 2 desc
Now, we only need to add the aggregation function
count and the
GROUP BY clause. We keep working with the subquery as if it were a new source of data.
It’s also possible to use this new SQL table to do some math and select the top 10 users with the highest average amount of money spent by order:
SELECT customer_id, last_name, round(total_purchased / number_of_purchases, 2) as avg_purchase FROM (SELECT c.customer_id, c.last_name, c.country, c.state, count(i.customer_id) as number_of_purchases, round(sum(i.total), 2) as total_purchased, (SELECT count(il.track_id) n_tracks FROM invoice_line il INNER JOIN invoice i on i.invoice_id = il.invoice_id WHERE i.customer_id = c.customer_id ) as count_tracks FROM customer c INNER JOIN invoice i on i.customer_id = c.customer_id GROUP BY 1, 2, 3, 4 ORDER BY 6 DESC) as new_table ORDER BY 3 DESC LIMIT 10
We use two of the columns in the subquery to perfom the calculation and get this result:
There are many other ways to use the data in this subquery, depending on the user’s needs — or even to build a bigger subquery, if necessary.
If using the subquery for this purpose is too frequent, it might be interesting to create a view in the database, depending on its architecture, and use this new view as a new consolidated source. Check with your data engineering team!
SQL subqueries are a very important tool, not only for data scientists but also for anyone working with SQL on a regular basis — it’s definitely worth the time it takes to understand them.
In this article, we covered the following:
- How and when to use a subquery
- Using a subquery to create a new column in the main query
- Using a subquery as a filter
- Using a subquery as a new source of data