June 10, 2022

SQL Subqueries: A Beginner’s Guide (with Code Examples)

Every data scientist needs to know their way around a SQL database, including subqueries.

In this article, we'll cover the basics of SQL subqueries, their syntax, how they can be useful, and when and how to use them when querying a database.

This article assumes that you have some basic knowledge about selecting data using SQL, such as grouping data, aggregate functions, filtering, and basic joins. If you don't know the basics, please go through our Complete SQL Tutorial for Beginners before starting.

What's a Subquery

A subquery is nothing more than a query inside another query. We mostly use them to add a new column to the main query result, to create a filter, or to create a consolidated source from which to select the data.

The subquery will always be between parentheses, and it can appear in different places within the main query, depending on the objective — usually within the SELECT, FROM or WHERE cluases. Also, the number of subqueries is unlimited, which means you can have as many nested queries as you need.

The Database

To write some real SQL code, we'll use the Chinook database as an example. This is a sample database available for multiple types of databases.

The database contains information about a fictional digital music shop, such as data about the artists, songs, playlists, music genres, and albums from the music shop, as well as information on the shop's employees, customers, and purchases.

This is the schema of the database, so you can have a better understanding of how the queries we'll write work:

Subquery to Create a New Column

The first use case of a subquery consists of using it to add a new column to your main query's output. This is what the syntax will look like:


SELECT column_1,
       columns_2,
       (SELECT
            ...
        FROM table_2
        GROUP BY 1)
FROM table_1
GROUP BY 1

Let's see a practical example.

Here we want to see the number of playlists in the app to which the users added each song.

The main query returns two columns: the name of the song and the number of playlists to which users added it. It's the second column that demands a subquery. The subquery is necessary here because we have to match the track_id assigned to the playlist with the track_id in the track table and then count them for each track.


SELECT t.name,
    (SELECT 
         count(playlist_id)
    FROM playlist_track pt
    WHERE pt.track_id = t.track_id
    ) as number_of_playlists
FROM track t 
GROUP BY 1
ORDER BY number_of_playlists DESC
LIMIT 50

We then get this output:

name	number_of_playlists
A Midsummer Night's Dream, Op.61 Incidental Music: No.7 Notturno	5
Aria Mit 30 Veränderungen, BWV 988 "Goldberg Variations": Aria	5
Ave Maria	5
Carmen: Overture	5
Carmina Burana: O Fortuna	5
Cavalleria Rusticana \ Act \ Intermezzo Sinfonico	5
Concerto for Piano No. 2 in F Minor, Op. 21: II. Larghetto	5
Concerto for Violin, Strings and Continuo in G Major, Op. 3, No. 9: I. Allegro	5
Das Lied Von Der Erde, Von Der Jugend	5
Die Walküre: The Ride of the Valkyries	5
Die Zauberflöte, K.620: "Der Hölle Rache Kocht in Meinem Herze"	5
Fantasia On Greensleeves	5
Intoitus: Adorate Deum	5
Jupiter, the Bringer of Jollity	5
Karelia Suite, Op.11: 2. Ballade (Tempo Di Menuetto)	5
Koyaanisqatsi	5
Lamentations of Jeremiah, First Set \ Incipit Lamentatio	5
Metopes, Op. 29: Calypso	5
Miserere mei, Deus	5

Doing Math

When creating a new column, a subquery might also be useful to perform some calculations. Of course, if this is the case, the output of the subquery must be a single number.

In the next query, we want to know the percentage of tracks of each genre in our database. The syntax is basically the same as in the last example, only that the subquery will be a part of creating the new column and not the entire new column.

For this task, we need to divide the number of songs in each genre by the total number of songs in that track table. We can easily access the total number of tracks with this query:


SELECT 
    count(*) as total_tracks
FROM track

| total_tracks |
|--------------|
| 3503         |

We can find the total number of tracks per genre with the following query:


SELECT 
    g.name as genre,
    count(t.track_id) as number_of_tracks
FROM genre g
INNER JOIN track t on g.genre_id = t.genre_id
GROUP BY 1
ORDER BY 2 DESC

genre	number_of_tracks
Rock	1297
Latin	579
Metal	374
Alternative & Punk	332
Jazz	130
TV Shows	93
Blues	81
Classical	74
Drama	64
R&B/Soul	61
Reggae	58
Pop	48
Soundtrack	43
Alternative	40
Hip Hop/Rap	35
Electronica/Dance	30
Heavy Metal	28
World	28
Sci Fi & Fantasy	26
Easy Listening	24
Comedy	17
Bossa Nova	15
Science Fiction	13
Rock And Roll	12
Opera	1

If we combine these two queries such that the first one will be the subquery, the output is the percentage of songs per genre:


SELECT 
    g.name as genre,
    round(cast(count(t.track_id) as float) / (SELECT count(*) FROM track), 2) as perc
FROM genre g
INNER JOIN track t on g.genre_id = t.genre_id
GROUP BY 1
ORDER BY 2 DESC

genre	perc
Rock	0.37
Latin	0.17
Metal	0.11
Alternative & Punk	0.09
Jazz	0.04
TV Shows	0.03
Blues	0.02
Classical	0.02
Drama	0.02
R&B/Soul	0.02
Reggae	0.02
Alternative	0.01
Easy Listening	0.01
Electronica/Dance	0.01
Heavy Metal	0.01
Hip Hop/Rap	0.01
Pop	0.01
Sci Fi & Fantasy	0.01
Soundtrack	0.01
World	0.01
Bossa Nova	0
Comedy	0
Opera	0
Rock And Roll	0
Science Fiction	0

Subquery as Filtering

Using a SQL subquery as a filter to the main query is one of my favorite use cases. In this scenario, the subquery will be in the WHERE clause, and we can use operators such as IN, =, <>, >, and < to filter, depending on the output of the subquery.

This is the syntax:


SELECT
    column_1,
    columns_2
FROM table_1
WHERE column_1 in 
    (SELECT
        ...
    FROM table_2)

In our example, let's say we want to know how many customers with at least U$100 spent on the store each employee is assigned to. Let's do this in two steps.

First, let's just get the number of customers for each employee. That's a simple query.

SELECT employee_id,
       e.last_name,
       count(distinct customer_id) as number_of_customers
FROM employee e 
INNER JOIN customer c on e.employee_id = c.support_rep_id
GROUP BY 1,2
ORDER BY 3 DESC

This is the output:

employee_id	last_name	number_of_customers
3	Peacock	21
4	Park	20
5	Johnson	18

Now, let's see which clients have spent at least U$100 in the store. This is the query:


SELECT
    c.customer_id,
    round(sum(i.total), 2) as total
FROM customer c
INNER JOIN invoice i on c.customer_id = i.customer_id
GROUP BY c.customer_id
HAVING sum(i.total) > 100
ORDER BY 2 DESC

This is the output:

customer_id	total
5	144.54
6	128.7
46	114.84
58	111.87
1	108.9
13	106.92
34	102.96

Now, in order to combine these two queries, the first one will be the main query, and the second one will be in the WHERE clause to filter the main query.

This is how it works:


SELECT employee_id,
       e.last_name,
       count(distinct customer_id) as number_of_customers
FROM employee e 
INNER JOIN customer c on e.employee_id = c.support_rep_id
WHERE customer_id in (
        SELECT
            c.customer_id
        FROM customer c
        INNER JOIN invoice i on c.customer_id = i.customer_id
        GROUP BY c.customer_id
        HAVING sum(i.total) > 100)
GROUP BY 1, 2
ORDER BY 3 DESC

This is the final output:

employee_id	last_name	number_of_customers
3	Peacock	3
4	Park	3
5	Johnson	1

Notice two important points:

We removed the total_purchased column when placing the query 2 in the WHERE clause of the main query. That's because we want this query to return only one column, which is the one the main query uses as a filter. Had we not done that, we would see an error message like this (depending on the version of SQL):

    sub-select returns 2 columns - expected 1

We used the IN operator. As the name says, we wanted to check which customers were IN the list of columns with over U$100 in purchases.

To use a math operator such as =, or <>, the subquery should return a number, not a column. That's not the case in this example, but we can easily adapt the code to work like that when necessary.

Subquery as a New Table

The last approach to using a SQL subquery that we'll see in this article is using it to create a new, consolidated source from which you can extract data.

We use this approach when the main query becomes too complex and we want to keep our code readable and organized — and also when we will use this new source of data repetitively for different purposes and we don't want to rewrite it over and over.

It usually looks like this:


SELECT
    column_1,
    column_2
FROM
    (SELECT 
        ...
    FROM table_1
    INNER JOIN table_2)
WHERE column_1 > 100

As an example, this will be our subquery:


SELECT c.customer_id,
       c.last_name,
       c.country,
       c.state,
       count(i.customer_id) as number_of_purchases,
       round(sum(i.total), 2) as total_purchased,
       (SELECT 
            count(il.track_id) n_tracks
        FROM invoice_line il   
        INNER JOIN invoice i on i.invoice_id = il.invoice_id
        WHERE i.customer_id = c.customer_id
        ) as count_tracks
FROM customer c 
INNER JOIN invoice i on i.customer_id = c.customer_id
GROUP BY 1, 2, 3, 4
ORDER BY 6 DESC

Its result is this new table:

customer_id	last_name	country	state	number_of_purchases	total_purchased	count_tracks
5	Wichterlová	Czech Republic	None	18	144.54	146
6	Holý	Czech Republic	None	12	128.7	130
46	O'Reilly	Ireland	Dublin	13	114.84	116
58	Pareek	India	None	13	111.87	113
1	Gonçalves	Brazil	SP	13	108.9	110
13	Ramos	Brazil	DF	15	106.92	108
34	Fernandes	Portugal	None	13	102.96	104
3	Tremblay	Canada	QC	9	99.99	101
42	Girard	France	None	11	99.99	101
17	Smith	USA	WA	12	98.01	99
50	Muñoz	Spain	None	11	98.01	99
53	Hughes	United Kingdom	None	11	98.01	99
57	Rojas	Chile	None	13	97.02	98
20	Miller	USA	CA	12	95.04	96
37	Zimmermann	Germany	None	10	94.05	95
22	Leacock	USA	FL	12	92.07	93
21	Chase	USA	NV	11	91.08	92
30	Francis	Canada	ON	13	91.08	92
26	Cunningham	USA	TX	12	86.13	87
36	Schneider	Germany	None	11	85.14	86
27	Gray	USA	AZ	9	84.15	85
2	Köhler	Germany	None	11	82.17	83
12	Almeida	Brazil	RJ	11	82.17	83
35	Sampaio	Portugal	None	16	82.17	83
55	Taylor	Australia	NSW	10	81.18	82

In this new table, we have consolidated the ID, last name, country, state, number of purchases, the total amount of dollars spent, and number of tracks purchased for each customer in the database.

Now, we can see which users in the United States have purchased at least 50 songs:


SELECT  
    new_table.*
FROM
    (SELECT c.customer_id,
        c.last_name,
        c.country,
        c.state,
        count(i.customer_id) as number_of_purchases,
        round(sum(i.total), 2) as total_purchased,
        (SELECT 
                count(il.track_id) n_tracks
            FROM invoice_line il   
            INNER JOIN invoice i on i.invoice_id = il.invoice_id
            WHERE i.customer_id = c.customer_id
            ) as count_tracks
    FROM customer c 
    INNER JOIN invoice i on i.customer_id = c.customer_id
    GROUP BY 1, 2, 3, 4
    ORDER BY 6 DESC) as new_table
WHERE 
    new_table.count_tracks >= 50
    AND new_table.country = 'USA'

Notice that we just need to select the columns and apply the filters we want in the SQL subquery.

This is the output:

customer_id	last_name	country	state	number_of_purchases	total_purchased	count_tracks
17	Smith	USA	WA	12	98.01	99
20	Miller	USA	CA	12	95.04	96
22	Leacock	USA	FL	12	92.07	93
21	Chase	USA	NV	11	91.08	92
26	Cunningham	USA	TX	12	86.13	87
27	Gray	USA	AZ	9	84.15	85
18	Brooks	USA	NY	8	79.2	80
25	Stevens	USA	WI	10	76.23	77
16	Harris	USA	CA	8	74.25	75
28	Barnett	USA	UT	10	72.27	73
24	Ralston	USA	IL	8	71.28	72
23	Gordon	USA	MA	10	66.33	67
19	Goyer	USA	CA	9	54.45	55

We can also see the number of users with at least 50 songs purchased per state:


SELECT  
    state,
    count(*)
FROM
    (SELECT c.customer_id,
        c.last_name,
        c.country,
        c.state,
        count(i.customer_id) as number_of_purchases,
        round(sum(i.total), 2) as total_purchased,
        (SELECT 
                count(il.track_id) n_tracks
            FROM invoice_line il   
            INNER JOIN invoice i on i.invoice_id = il.invoice_id
            WHERE i.customer_id = c.customer_id
            ) as count_tracks
    FROM customer c 
    INNER JOIN invoice i on i.customer_id = c.customer_id
    GROUP BY 1, 2, 3, 4
    ORDER BY 6 DESC) as new_table
WHERE 
    new_table.count_tracks >= 50
    AND new_table.country = 'USA'  
GROUP BY new_table.state
ORDER BY 2 desc

Now, we only need to add the aggregation function count and the GROUP BY clause. We keep working with the subquery as if it were a new source of data.

The output:

state	count(*)
CA	3
AZ	1
FL	1
IL	1
MA	1
NV	1
NY	1
TX	1
UT	1
WA	1
WI	1

It's also possible to use this new SQL table to do some math and select the top 10 users with the highest average amount of money spent by order:


SELECT  
    customer_id,
    last_name,
    round(total_purchased / number_of_purchases, 2) as avg_purchase
FROM
    (SELECT c.customer_id,
        c.last_name,
        c.country,
        c.state,
        count(i.customer_id) as number_of_purchases,
        round(sum(i.total), 2) as total_purchased,
        (SELECT 
                count(il.track_id) n_tracks
            FROM invoice_line il   
            INNER JOIN invoice i on i.invoice_id = il.invoice_id
            WHERE i.customer_id = c.customer_id
            ) as count_tracks
    FROM customer c 
    INNER JOIN invoice i on i.customer_id = c.customer_id
    GROUP BY 1, 2, 3, 4
    ORDER BY 6 DESC) as new_table
ORDER BY 3 DESC
LIMIT 10

We use two of the columns in the subquery to perfom the calculation and get this result:

customer_id	last_name	avg_purchase
3	Tremblay	11.11
6	Holý	10.72
29	Brown	10.15
18	Brooks	9.9
37	Zimmermann	9.4
27	Gray	9.35
16	Harris	9.28
42	Girard	9.09
50	Muñoz	8.91
53	Hughes	8.91

There are many other ways to use the data in this subquery, depending on the user's needs — or even to build a bigger subquery, if necessary.

If using the subquery for this purpose is too frequent, it might be interesting to create a view in the database, depending on its architecture, and use this new view as a new consolidated source. Check with your data engineering team!

Conclusion

SQL subqueries are a very important tool, not only for data scientists but also for anyone working with SQL on a regular basis — it's definitely worth the time it takes to understand them.

In this article, we covered the following:

How and when to use a subquery
Using a subquery to create a new column in the main query
Using a subquery as a filter
Using a subquery as a new source of data

SQL Subqueries: A Beginner’s Guide (with Code Examples)

What's a Subquery

The Database

Subquery to Create a New Column

Doing Math

Subquery as Filtering

Subquery as a New Table

Conclusion

Top 20 SQL JOINs Interview Questions and Answers

How to Learn SQL: The Best Way

SQL Subqueries: A Beginner’s Guide (with Code Examples)

What's a Subquery

The Database

Subquery to Create a New Column

Doing Math

Subquery as Filtering

Subquery as a New Table

Conclusion

More learning resources

Top 20 SQL JOINs Interview Questions and Answers

How to Learn SQL: The Best Way