SQL Subqueries: A Beginner’s Guide (with Code Examples)
Every data scientist needs to know their way around a SQL database, including subqueries. Here’s an introduction.
In this article, we'll cover the basics of SQL subqueries, their syntax, how they can be useful, and when and how to use them when querying a database.
This article assumes that you have some basic knowledge about selecting data using SQL, such as grouping data, aggregate functions, filtering, and basic joins.
What's a Subquery
A subquery is nothing more than a query inside another query. We mostly use them to add a new column to the main query result, to create a filter, or to create a consolidated source from which to select the data.
The subquery will always be between parentheses, and it can appear in different places within the main query, depending on the objective — usually within the SELECT
, FROM
or WHERE
cluases. Also, the number of subqueries is unlimited, which means you can have as many nested queries as you need.
The Database
To write some real SQL code, we'll use the Chinook database as an example. This is a sample database available for multiple types of databases.
The database contains information about a fictional digital music shop, such as data about the artists, songs, playlists, music genres, and albums from the music shop, as well as information on the shop's employees, customers, and purchases.
This is the schema of the database, so you can have a better understanding of how the queries we'll write work:
Subquery to Create a New Column
The first use case of a subquery consists of using it to add a new column to your main query's output. This is what the syntax will look like:
SELECT column_1,
columns_2,
(SELECT
...
FROM table_2
GROUP BY 1)
FROM table_1
GROUP BY 1
Let's see a practical example.
Here we want to see the number of playlists in the app to which the users added each song.
The main query returns two columns: the name of the song and the number of playlists to which users added it. It's the second column that demands a subquery. The subquery is necessary here because we have to match the track_id
assigned to the playlist with the track_id
in the track table and then count them for each track.
SELECT t.name,
(SELECT
count(playlist_id)
FROM playlist_track pt
WHERE pt.track_id = t.track_id
) as number_of_playlists
FROM track t
GROUP BY 1
ORDER BY number_of_playlists DESC
LIMIT 50
We then get this output:
name | number_of_playlists |
---|---|
A Midsummer Night's Dream, Op.61 Incidental Music: No.7 Notturno | 5 |
Aria Mit 30 Veränderungen, BWV 988 "Goldberg Variations": Aria | 5 |
Ave Maria | 5 |
Carmen: Overture | 5 |
Carmina Burana: O Fortuna | 5 |
Cavalleria Rusticana \ Act \ Intermezzo Sinfonico | 5 |
Concerto for Piano No. 2 in F Minor, Op. 21: II. Larghetto | 5 |
Concerto for Violin, Strings and Continuo in G Major, Op. 3, No. 9: I. Allegro | 5 |
Das Lied Von Der Erde, Von Der Jugend | 5 |
Die Walküre: The Ride of the Valkyries | 5 |
Die Zauberflöte, K.620: "Der Hölle Rache Kocht in Meinem Herze" | 5 |
Fantasia On Greensleeves | 5 |
Intoitus: Adorate Deum | 5 |
Jupiter, the Bringer of Jollity | 5 |
Karelia Suite, Op.11: 2. Ballade (Tempo Di Menuetto) | 5 |
Koyaanisqatsi | 5 |
Lamentations of Jeremiah, First Set \ Incipit Lamentatio | 5 |
Metopes, Op. 29: Calypso | 5 |
Miserere mei, Deus | 5 |
Doing Math
When creating a new column, a subquery might also be useful to perform some calculations. Of course, if this is the case, the output of the subquery must be a single number.
In the next query, we want to know the percentage of tracks of each genre in our database. The syntax is basically the same as in the last example, only that the subquery will be a part of creating the new column and not the entire new column.
For this task, we need to divide the number of songs in each genre by the total number of songs in that track table. We can easily access the total number of tracks with this query:
SELECT
count(*) as total_tracks
FROM track
| total_tracks |
|--------------|
| 3503 |
We can find the total number of tracks per genre with the following query:
SELECT
g.name as genre,
count(t.track_id) as number_of_tracks
FROM genre g
INNER JOIN track t on g.genre_id = t.genre_id
GROUP BY 1
ORDER BY 2 DESC
genre | number_of_tracks |
---|---|
Rock | 1297 |
Latin | 579 |
Metal | 374 |
Alternative & Punk | 332 |
Jazz | 130 |
TV Shows | 93 |
Blues | 81 |
Classical | 74 |
Drama | 64 |
R&B/Soul | 61 |
Reggae | 58 |
Pop | 48 |
Soundtrack | 43 |
Alternative | 40 |
Hip Hop/Rap | 35 |
Electronica/Dance | 30 |
Heavy Metal | 28 |
World | 28 |
Sci Fi & Fantasy | 26 |
Easy Listening | 24 |
Comedy | 17 |
Bossa Nova | 15 |
Science Fiction | 13 |
Rock And Roll | 12 |
Opera | 1 |
If we combine these two queries such that the first one will be the subquery, the output is the percentage of songs per genre:
SELECT
g.name as genre,
round(cast(count(t.track_id) as float) / (SELECT count(*) FROM track), 2) as perc
FROM genre g
INNER JOIN track t on g.genre_id = t.genre_id
GROUP BY 1
ORDER BY 2 DESC
genre | perc |
---|---|
Rock | 0.37 |
Latin | 0.17 |
Metal | 0.11 |
Alternative & Punk | 0.09 |
Jazz | 0.04 |
TV Shows | 0.03 |
Blues | 0.02 |
Classical | 0.02 |
Drama | 0.02 |
R&B/Soul | 0.02 |
Reggae | 0.02 |
Alternative | 0.01 |
Easy Listening | 0.01 |
Electronica/Dance | 0.01 |
Heavy Metal | 0.01 |
Hip Hop/Rap | 0.01 |
Pop | 0.01 |
Sci Fi & Fantasy | 0.01 |
Soundtrack | 0.01 |
World | 0.01 |
Bossa Nova | 0 |
Comedy | 0 |
Opera | 0 |
Rock And Roll | 0 |
Science Fiction | 0 |
Subquery as Filtering
Using a SQL subquery as a filter to the main query is one of my favorite use cases. In this scenario, the subquery will be in the WHERE
clause, and we can use operators such as IN
, =
, <>
, >
, and <
to filter, depending on the output of the subquery.
This is the syntax:
SELECT
column_1,
columns_2
FROM table_1
WHERE column_1 in
(SELECT
...
FROM table_2)
In our example, let's say we want to know how many customers with at least U$100 spent on the store each employee is assigned to. Let's do this in two steps.
First, let's just get the number of customers for each employee. That's a simple query.
SELECT employee_id,
e.last_name,
count(distinct customer_id) as number_of_customers
FROM employee e
INNER JOIN customer c on e.employee_id = c.support_rep_id
GROUP BY 1,2
ORDER BY 3 DESC
This is the output:
employee_id | last_name | number_of_customers |
---|---|---|
3 | Peacock | 21 |
4 | Park | 20 |
5 | Johnson | 18 |
Now, let's see which clients have spent at least U$100 in the store. This is the query:
SELECT
c.customer_id,
round(sum(i.total), 2) as total
FROM customer c
INNER JOIN invoice i on c.customer_id = i.customer_id
GROUP BY c.customer_id
HAVING sum(i.total) > 100
ORDER BY 2 DESC
This is the output:
customer_id | total |
---|---|
5 | 144.54 |
6 | 128.7 |
46 | 114.84 |
58 | 111.87 |
1 | 108.9 |
13 | 106.92 |
34 | 102.96 |
Now, in order to combine these two queries, the first one will be the main query, and the second one will be in the WHERE
clause to filter the main query.
This is how it works:
SELECT employee_id,
e.last_name,
count(distinct customer_id) as number_of_customers
FROM employee e
INNER JOIN customer c on e.employee_id = c.support_rep_id
WHERE customer_id in (
SELECT
c.customer_id
FROM customer c
INNER JOIN invoice i on c.customer_id = i.customer_id
GROUP BY c.customer_id
HAVING sum(i.total) > 100)
GROUP BY 1, 2
ORDER BY 3 DESC
This is the final output:
employee_id | last_name | number_of_customers |
---|---|---|
3 | Peacock | 3 |
4 | Park | 3 |
5 | Johnson | 1 |
Notice two important points:
- We removed the
total_purchased
column when placing the query 2 in theWHERE
clause of the main query. That's because we want this query to return only one column, which is the one the main query uses as a filter. Had we not done that, we would see an error message like this (depending on the version of SQL):
sub-select returns 2 columns - expected 1
- We used the
IN
operator. As the name says, we wanted to check which customers were IN the list of columns with over U$100 in purchases.
To use a math operator such as =
, or <>
, the subquery should return a number, not a column. That's not the case in this example, but we can easily adapt the code to work like that when necessary.
Subquery as a New Table
The last approach to using a SQL subquery that we'll see in this article is using it to create a new, consolidated source from which you can extract data.
We use this approach when the main query becomes too complex and we want to keep our code readable and organized — and also when we will use this new source of data repetitively for different purposes and we don't want to rewrite it over and over.
It usually looks like this:
SELECT
column_1,
column_2
FROM
(SELECT
...
FROM table_1
INNER JOIN table_2)
WHERE column_1 > 100
As an example, this will be our subquery:
SELECT c.customer_id,
c.last_name,
c.country,
c.state,
count(i.customer_id) as number_of_purchases,
round(sum(i.total), 2) as total_purchased,
(SELECT
count(il.track_id) n_tracks
FROM invoice_line il
INNER JOIN invoice i on i.invoice_id = il.invoice_id
WHERE i.customer_id = c.customer_id
) as count_tracks
FROM customer c
INNER JOIN invoice i on i.customer_id = c.customer_id
GROUP BY 1, 2, 3, 4
ORDER BY 6 DESC
Its result is this new table:
customer_id | last_name | country | state | number_of_purchases | total_purchased | count_tracks |
---|---|---|---|---|---|---|
5 | Wichterlová | Czech Republic | None | 18 | 144.54 | 146 |
6 | Holý | Czech Republic | None | 12 | 128.7 | 130 |
46 | O'Reilly | Ireland | Dublin | 13 | 114.84 | 116 |
58 | Pareek | India | None | 13 | 111.87 | 113 |
1 | Gonçalves | Brazil | SP | 13 | 108.9 | 110 |
13 | Ramos | Brazil | DF | 15 | 106.92 | 108 |
34 | Fernandes | Portugal | None | 13 | 102.96 | 104 |
3 | Tremblay | Canada | QC | 9 | 99.99 | 101 |
42 | Girard | France | None | 11 | 99.99 | 101 |
17 | Smith | USA | WA | 12 | 98.01 | 99 |
50 | Muñoz | Spain | None | 11 | 98.01 | 99 |
53 | Hughes | United Kingdom | None | 11 | 98.01 | 99 |
57 | Rojas | Chile | None | 13 | 97.02 | 98 |
20 | Miller | USA | CA | 12 | 95.04 | 96 |
37 | Zimmermann | Germany | None | 10 | 94.05 | 95 |
22 | Leacock | USA | FL | 12 | 92.07 | 93 |
21 | Chase | USA | NV | 11 | 91.08 | 92 |
30 | Francis | Canada | ON | 13 | 91.08 | 92 |
26 | Cunningham | USA | TX | 12 | 86.13 | 87 |
36 | Schneider | Germany | None | 11 | 85.14 | 86 |
27 | Gray | USA | AZ | 9 | 84.15 | 85 |
2 | Köhler | Germany | None | 11 | 82.17 | 83 |
12 | Almeida | Brazil | RJ | 11 | 82.17 | 83 |
35 | Sampaio | Portugal | None | 16 | 82.17 | 83 |
55 | Taylor | Australia | NSW | 10 | 81.18 | 82 |
In this new table, we have consolidated the ID, last name, country, state, number of purchases, the total amount of dollars spent, and number of tracks purchased for each customer in the database.
Now, we can see which users in the United States have purchased at least 50 songs:
SELECT
new_table.*
FROM
(SELECT c.customer_id,
c.last_name,
c.country,
c.state,
count(i.customer_id) as number_of_purchases,
round(sum(i.total), 2) as total_purchased,
(SELECT
count(il.track_id) n_tracks
FROM invoice_line il
INNER JOIN invoice i on i.invoice_id = il.invoice_id
WHERE i.customer_id = c.customer_id
) as count_tracks
FROM customer c
INNER JOIN invoice i on i.customer_id = c.customer_id
GROUP BY 1, 2, 3, 4
ORDER BY 6 DESC) as new_table
WHERE
new_table.count_tracks >= 50
AND new_table.country = 'USA'
Notice that we just need to select the columns and apply the filters we want in the SQL subquery.
This is the output:
customer_id | last_name | country | state | number_of_purchases | total_purchased | count_tracks |
---|---|---|---|---|---|---|
17 | Smith | USA | WA | 12 | 98.01 | 99 |
20 | Miller | USA | CA | 12 | 95.04 | 96 |
22 | Leacock | USA | FL | 12 | 92.07 | 93 |
21 | Chase | USA | NV | 11 | 91.08 | 92 |
26 | Cunningham | USA | TX | 12 | 86.13 | 87 |
27 | Gray | USA | AZ | 9 | 84.15 | 85 |
18 | Brooks | USA | NY | 8 | 79.2 | 80 |
25 | Stevens | USA | WI | 10 | 76.23 | 77 |
16 | Harris | USA | CA | 8 | 74.25 | 75 |
28 | Barnett | USA | UT | 10 | 72.27 | 73 |
24 | Ralston | USA | IL | 8 | 71.28 | 72 |
23 | Gordon | USA | MA | 10 | 66.33 | 67 |
19 | Goyer | USA | CA | 9 | 54.45 | 55 |
We can also see the number of users with at least 50 songs purchased per state:
SELECT
state,
count(*)
FROM
(SELECT c.customer_id,
c.last_name,
c.country,
c.state,
count(i.customer_id) as number_of_purchases,
round(sum(i.total), 2) as total_purchased,
(SELECT
count(il.track_id) n_tracks
FROM invoice_line il
INNER JOIN invoice i on i.invoice_id = il.invoice_id
WHERE i.customer_id = c.customer_id
) as count_tracks
FROM customer c
INNER JOIN invoice i on i.customer_id = c.customer_id
GROUP BY 1, 2, 3, 4
ORDER BY 6 DESC) as new_table
WHERE
new_table.count_tracks >= 50
AND new_table.country = 'USA'
GROUP BY new_table.state
ORDER BY 2 desc
Now, we only need to add the aggregation function count
and the GROUP BY
clause. We keep working with the subquery as if it were a new source of data.
The output:
state | count(*) |
---|---|
CA | 3 |
AZ | 1 |
FL | 1 |
IL | 1 |
MA | 1 |
NV | 1 |
NY | 1 |
TX | 1 |
UT | 1 |
WA | 1 |
WI | 1 |
It's also possible to use this new SQL table to do some math and select the top 10 users with the highest average amount of money spent by order:
SELECT
customer_id,
last_name,
round(total_purchased / number_of_purchases, 2) as avg_purchase
FROM
(SELECT c.customer_id,
c.last_name,
c.country,
c.state,
count(i.customer_id) as number_of_purchases,
round(sum(i.total), 2) as total_purchased,
(SELECT
count(il.track_id) n_tracks
FROM invoice_line il
INNER JOIN invoice i on i.invoice_id = il.invoice_id
WHERE i.customer_id = c.customer_id
) as count_tracks
FROM customer c
INNER JOIN invoice i on i.customer_id = c.customer_id
GROUP BY 1, 2, 3, 4
ORDER BY 6 DESC) as new_table
ORDER BY 3 DESC
LIMIT 10
We use two of the columns in the subquery to perfom the calculation and get this result:
customer_id | last_name | avg_purchase |
---|---|---|
3 | Tremblay | 11.11 |
6 | Holý | 10.72 |
29 | Brown | 10.15 |
18 | Brooks | 9.9 |
37 | Zimmermann | 9.4 |
27 | Gray | 9.35 |
16 | Harris | 9.28 |
42 | Girard | 9.09 |
50 | Muñoz | 8.91 |
53 | Hughes | 8.91 |
There are many other ways to use the data in this subquery, depending on the user's needs — or even to build a bigger subquery, if necessary.
If using the subquery for this purpose is too frequent, it might be interesting to create a view in the database, depending on its architecture, and use this new view as a new consolidated source. Check with your data engineering team!
Conclusion
SQL subqueries are a very important tool, not only for data scientists but also for anyone working with SQL on a regular basis — it's definitely worth the time it takes to understand them.
In this article, we covered the following:
- How and when to use a subquery
- Using a subquery to create a new column in the main query
- Using a subquery as a filter
- Using a subquery as a new source of data