June 10, 2022

SQL Subqueries: A Beginner’s Guide (with Code Examples)

Every data scientist needs to know their way around a SQL database, including subqueries. Here’s an introduction.

In this article, we'll cover the basics of SQL subqueries, their syntax, how they can be useful, and when and how to use them when querying a database.

This article assumes that you have some basic knowledge about selecting data using SQL, such as grouping data, aggregate functions, filtering, and basic joins.

What's a Subquery

A subquery is nothing more than a query inside another query. We mostly use them to add a new column to the main query result, to create a filter, or to create a consolidated source from which to select the data.

The subquery will always be between parentheses, and it can appear in different places within the main query, depending on the objective — usually within the SELECT, FROM or WHERE cluases. Also, the number of subqueries is unlimited, which means you can have as many nested queries as you need.

The Database

To write some real SQL code, we'll use the Chinook database as an example. This is a sample database available for multiple types of databases. 

The database contains information about a fictional digital music shop, such as data about the artists, songs, playlists, music genres, and albums from the music shop, as well as information on the shop's employees, customers, and purchases.

This is the schema of the database, so you can have a better understanding of how the queries we'll write work:

chinook-schema.svg

Subquery to Create a New Column

The first use case of a subquery consists of using it to add a new column to your main query's output. This is what the syntax will look like:


SELECT column_1,
       columns_2,
       (SELECT
            ...
        FROM table_2
        GROUP BY 1)
FROM table_1
GROUP BY 1

Let's see a practical example.

Here we want to see the number of playlists in the app to which the users added each song.

The main query returns two columns: the name of the song and the number of playlists to which users added it. It's the second column that demands a subquery. The subquery is necessary here because we have to match the track_id assigned to the playlist with the track_id in the track table and then count them for each track.


SELECT t.name,
    (SELECT 
         count(playlist_id)
    FROM playlist_track pt
    WHERE pt.track_id = t.track_id
    ) as number_of_playlists
FROM track t 
GROUP BY 1
ORDER BY number_of_playlists DESC
LIMIT 50

We then get this output:

name number_of_playlists
A Midsummer Night's Dream, Op.61 Incidental Music: No.7 Notturno 5
Aria Mit 30 Veränderungen, BWV 988 "Goldberg Variations": Aria 5
Ave Maria 5
Carmen: Overture 5
Carmina Burana: O Fortuna 5
Cavalleria Rusticana \ Act \ Intermezzo Sinfonico 5
Concerto for Piano No. 2 in F Minor, Op. 21: II. Larghetto 5
Concerto for Violin, Strings and Continuo in G Major, Op. 3, No. 9: I. Allegro 5
Das Lied Von Der Erde, Von Der Jugend 5
Die Walküre: The Ride of the Valkyries 5
Die Zauberflöte, K.620: "Der Hölle Rache Kocht in Meinem Herze" 5
Fantasia On Greensleeves 5
Intoitus: Adorate Deum 5
Jupiter, the Bringer of Jollity 5
Karelia Suite, Op.11: 2. Ballade (Tempo Di Menuetto) 5
Koyaanisqatsi 5
Lamentations of Jeremiah, First Set \ Incipit Lamentatio 5
Metopes, Op. 29: Calypso 5
Miserere mei, Deus 5

Doing Math

When creating a new column, a subquery might also be useful to perform some calculations. Of course, if this is the case, the output of the subquery must be a single number.

In the next query, we want to know the percentage of tracks of each genre in our database. The syntax is basically the same as in the last example, only that the subquery will be a part of creating the new column and not the entire new column.

For this task, we need to divide the number of songs in each genre by the total number of songs in that track table. We can easily access the total number of tracks with this query:


SELECT 
    count(*) as total_tracks
FROM track

| total_tracks |
|--------------|
| 3503         |

We can find the total number of tracks per genre with the following query:


SELECT 
    g.name as genre,
    count(t.track_id) as number_of_tracks
FROM genre g
INNER JOIN track t on g.genre_id = t.genre_id
GROUP BY 1
ORDER BY 2 DESC
genre number_of_tracks
Rock 1297
Latin 579
Metal 374
Alternative & Punk 332
Jazz 130
TV Shows 93
Blues 81
Classical 74
Drama 64
R&B/Soul 61
Reggae 58
Pop 48
Soundtrack 43
Alternative 40
Hip Hop/Rap 35
Electronica/Dance 30
Heavy Metal 28
World 28
Sci Fi & Fantasy 26
Easy Listening 24
Comedy 17
Bossa Nova 15
Science Fiction 13
Rock And Roll 12
Opera 1

If we combine these two queries such that the first one will be the subquery, the output is the percentage of songs per genre:


SELECT 
    g.name as genre,
    round(cast(count(t.track_id) as float) / (SELECT count(*) FROM track), 2) as perc
FROM genre g
INNER JOIN track t on g.genre_id = t.genre_id
GROUP BY 1
ORDER BY 2 DESC
genre perc
Rock 0.37
Latin 0.17
Metal 0.11
Alternative & Punk 0.09
Jazz 0.04
TV Shows 0.03
Blues 0.02
Classical 0.02
Drama 0.02
R&B/Soul 0.02
Reggae 0.02
Alternative 0.01
Easy Listening 0.01
Electronica/Dance 0.01
Heavy Metal 0.01
Hip Hop/Rap 0.01
Pop 0.01
Sci Fi & Fantasy 0.01
Soundtrack 0.01
World 0.01
Bossa Nova 0
Comedy 0
Opera 0
Rock And Roll 0
Science Fiction 0

Subquery as Filtering

Using a SQL subquery as a filter to the main query is one of my favorite use cases. In this scenario, the subquery will be in the WHERE clause, and we can use operators such as IN, =, <>, >, and < to filter, depending on the output of the subquery.

This is the syntax:


SELECT
    column_1,
    columns_2
FROM table_1
WHERE column_1 in 
    (SELECT
        ...
    FROM table_2)

In our example, let's say we want to know how many customers with at least U$100 spent on the store each employee is assigned to. Let's do this in two steps.

First, let's just get the number of customers for each employee. That's a simple query.

SELECT employee_id,
       e.last_name,
       count(distinct customer_id) as number_of_customers
FROM employee e 
INNER JOIN customer c on e.employee_id = c.support_rep_id
GROUP BY 1,2
ORDER BY 3 DESC

This is the output:

employee_id last_name number_of_customers
3 Peacock 21
4 Park 20
5 Johnson 18

Now, let's see which clients have spent at least U$100 in the store. This is the query:


SELECT
    c.customer_id,
    round(sum(i.total), 2) as total
FROM customer c
INNER JOIN invoice i on c.customer_id = i.customer_id
GROUP BY c.customer_id
HAVING sum(i.total) > 100
ORDER BY 2 DESC

This is the output:

customer_id total
5 144.54
6 128.7
46 114.84
58 111.87
1 108.9
13 106.92
34 102.96

Now, in order to combine these two queries, the first one will be the main query, and the second one will be in the WHERE clause to filter the main query.

This is how it works:


SELECT employee_id,
       e.last_name,
       count(distinct customer_id) as number_of_customers
FROM employee e 
INNER JOIN customer c on e.employee_id = c.support_rep_id
WHERE customer_id in (
        SELECT
            c.customer_id
        FROM customer c
        INNER JOIN invoice i on c.customer_id = i.customer_id
        GROUP BY c.customer_id
        HAVING sum(i.total) > 100)
GROUP BY 1, 2
ORDER BY 3 DESC

This is the final output:

employee_id last_name number_of_customers
3 Peacock 3
4 Park 3
5 Johnson 1

Notice two important points:

  1. We removed the total_purchased column when placing the query 2 in the WHERE clause of the main query. That's because we want this query to return only one column, which is the one the main query uses as a filter. Had we not done that, we would see an error message like this (depending on the version of SQL):
    sub-select returns 2 columns - expected 1
  1. We used the IN operator. As the name says, we wanted to check which customers were IN the list of columns with over U$100 in purchases.

To use a math operator such as =, or <>, the subquery should return a number, not a column. That's not the case in this example, but we can easily adapt the code to work like that when necessary.

Subquery as a New Table

The last approach to using a SQL subquery that we'll see in this article is using it to create a new, consolidated source from which you can extract data.

We use this approach when the main query becomes too complex and we want to keep our code readable and organized — and also when we will use this new source of data repetitively for different purposes and we don't want to rewrite it over and over.

It usually looks like this:


SELECT
    column_1,
    column_2
FROM
    (SELECT 
        ...
    FROM table_1
    INNER JOIN table_2)
WHERE column_1 > 100

As an example, this will be our subquery:


SELECT c.customer_id,
       c.last_name,
       c.country,
       c.state,
       count(i.customer_id) as number_of_purchases,
       round(sum(i.total), 2) as total_purchased,
       (SELECT 
            count(il.track_id) n_tracks
        FROM invoice_line il   
        INNER JOIN invoice i on i.invoice_id = il.invoice_id
        WHERE i.customer_id = c.customer_id
        ) as count_tracks
FROM customer c 
INNER JOIN invoice i on i.customer_id = c.customer_id
GROUP BY 1, 2, 3, 4
ORDER BY 6 DESC

Its result is this new table:

customer_id last_name country state number_of_purchases total_purchased count_tracks
5 Wichterlová Czech Republic None 18 144.54 146
6 Holý Czech Republic None 12 128.7 130
46 O'Reilly Ireland Dublin 13 114.84 116
58 Pareek India None 13 111.87 113
1 Gonçalves Brazil SP 13 108.9 110
13 Ramos Brazil DF 15 106.92 108
34 Fernandes Portugal None 13 102.96 104
3 Tremblay Canada QC 9 99.99 101
42 Girard France None 11 99.99 101
17 Smith USA WA 12 98.01 99
50 Muñoz Spain None 11 98.01 99
53 Hughes United Kingdom None 11 98.01 99
57 Rojas Chile None 13 97.02 98
20 Miller USA CA 12 95.04 96
37 Zimmermann Germany None 10 94.05 95
22 Leacock USA FL 12 92.07 93
21 Chase USA NV 11 91.08 92
30 Francis Canada ON 13 91.08 92
26 Cunningham USA TX 12 86.13 87
36 Schneider Germany None 11 85.14 86
27 Gray USA AZ 9 84.15 85
2 Köhler Germany None 11 82.17 83
12 Almeida Brazil RJ 11 82.17 83
35 Sampaio Portugal None 16 82.17 83
55 Taylor Australia NSW 10 81.18 82

In this new table, we have consolidated the ID, last name, country, state, number of purchases, the total amount of dollars spent, and number of tracks purchased for each customer in the database.

Now, we can see which users in the United States have purchased at least 50 songs:


SELECT  
    new_table.*
FROM
    (SELECT c.customer_id,
        c.last_name,
        c.country,
        c.state,
        count(i.customer_id) as number_of_purchases,
        round(sum(i.total), 2) as total_purchased,
        (SELECT 
                count(il.track_id) n_tracks
            FROM invoice_line il   
            INNER JOIN invoice i on i.invoice_id = il.invoice_id
            WHERE i.customer_id = c.customer_id
            ) as count_tracks
    FROM customer c 
    INNER JOIN invoice i on i.customer_id = c.customer_id
    GROUP BY 1, 2, 3, 4
    ORDER BY 6 DESC) as new_table
WHERE 
    new_table.count_tracks >= 50
    AND new_table.country = 'USA'  

Notice that we just need to select the columns and apply the filters we want in the SQL subquery.

This is the output:

customer_id last_name country state number_of_purchases total_purchased count_tracks
17 Smith USA WA 12 98.01 99
20 Miller USA CA 12 95.04 96
22 Leacock USA FL 12 92.07 93
21 Chase USA NV 11 91.08 92
26 Cunningham USA TX 12 86.13 87
27 Gray USA AZ 9 84.15 85
18 Brooks USA NY 8 79.2 80
25 Stevens USA WI 10 76.23 77
16 Harris USA CA 8 74.25 75
28 Barnett USA UT 10 72.27 73
24 Ralston USA IL 8 71.28 72
23 Gordon USA MA 10 66.33 67
19 Goyer USA CA 9 54.45 55

We can also see the number of users with at least 50 songs purchased per state:


SELECT  
    state,
    count(*)
FROM
    (SELECT c.customer_id,
        c.last_name,
        c.country,
        c.state,
        count(i.customer_id) as number_of_purchases,
        round(sum(i.total), 2) as total_purchased,
        (SELECT 
                count(il.track_id) n_tracks
            FROM invoice_line il   
            INNER JOIN invoice i on i.invoice_id = il.invoice_id
            WHERE i.customer_id = c.customer_id
            ) as count_tracks
    FROM customer c 
    INNER JOIN invoice i on i.customer_id = c.customer_id
    GROUP BY 1, 2, 3, 4
    ORDER BY 6 DESC) as new_table
WHERE 
    new_table.count_tracks >= 50
    AND new_table.country = 'USA'  
GROUP BY new_table.state
ORDER BY 2 desc

Now, we only need to add the aggregation function count and the GROUP BY clause. We keep working with the subquery as if it were a new source of data.

The output:

state count(*)
CA 3
AZ 1
FL 1
IL 1
MA 1
NV 1
NY 1
TX 1
UT 1
WA 1
WI 1

It's also possible to use this new SQL table to do some math and select the top 10 users with the highest average amount of money spent by order:


SELECT  
    customer_id,
    last_name,
    round(total_purchased / number_of_purchases, 2) as avg_purchase
FROM
    (SELECT c.customer_id,
        c.last_name,
        c.country,
        c.state,
        count(i.customer_id) as number_of_purchases,
        round(sum(i.total), 2) as total_purchased,
        (SELECT 
                count(il.track_id) n_tracks
            FROM invoice_line il   
            INNER JOIN invoice i on i.invoice_id = il.invoice_id
            WHERE i.customer_id = c.customer_id
            ) as count_tracks
    FROM customer c 
    INNER JOIN invoice i on i.customer_id = c.customer_id
    GROUP BY 1, 2, 3, 4
    ORDER BY 6 DESC) as new_table
ORDER BY 3 DESC
LIMIT 10

We use two of the columns in the subquery to perfom the calculation and get this result:

customer_id last_name avg_purchase
3 Tremblay 11.11
6 Holý 10.72
29 Brown 10.15
18 Brooks 9.9
37 Zimmermann 9.4
27 Gray 9.35
16 Harris 9.28
42 Girard 9.09
50 Muñoz 8.91
53 Hughes 8.91

There are many other ways to use the data in this subquery, depending on the user's needs — or even to build a bigger subquery, if necessary.

If using the subquery for this purpose is too frequent, it might be interesting to create a view in the database, depending on its architecture, and use this new view as a new consolidated source. Check with your data engineering team!

Conclusion

SQL subqueries are a very important tool, not only for data scientists but also for anyone working with SQL on a regular basis — it's definitely worth the time it takes to understand them.

In this article, we covered the following:

  • How and when to use a subquery
  • Using a subquery to create a new column in the main query
  • Using a subquery as a filter
  • Using a subquery as a new source of data
Otávio Simões Silveira

About the author

Otávio Simões Silveira

Otávio is an economist and data scientist from Brazil. In his free time, he writes about Python and Data Science on the internet. You can find him at LinkedIn.