SQL Joins: Working with Databases
SQL joins don’t have to be this challenging!
When first learning SQL, it’s common to work with data in a single table. In the real world, databases generally have data in more than one table. If we want to be able to work with that data, we’ll have to combine multiple tables within a query. In this SQL joins post, we’ll learn how to use joins to select data from multiple tables.
We’ll assume that you know the fundamentals of working in SQL including filtering, sorting, aggregation, and subqueries. If you don’t, our SQL Fundamentals course teaches all of these concepts, and you can sign up and start that course for free.
The Factbook Database
We're going to be using a version of the CIA World Factbook (Factbook) database that has two tables. The first table is called facts
, and each row represents a country from the Factbook. Here are the first 5 rows of the facts
table:
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate |
---|---|---|---|---|---|---|---|---|---|---|
1 | af | Afghanistan | 652230 | 652230 | 0 | 32564342 | 2.32 | 38.57 | 13.89 | 1.51 |
2 | al | Albania | 28748 | 27398 | 1350 | 3029278 | 0.30 | 12.92 | 6.58 | 3.30 |
3 | ag | Algeria | 2381741 | 2381741 | 0 | 39542166 | 1.84 | 23.67 | 4.31 | 0.92 |
4 | an | Andorra | 468 | 468 | 0 | 85580 | 0.12 | 8.13 | 6.96 | 0.00 |
5 | ao | Angola | 1246700 | 1246700 | 0 | 19625353 | 2.78 | 38.78 | 11.49 | 0.46 |
In addition to the facts
table is a second table called cities
which contains information on major urban areas from countries in the Factbook (for the rest of this post, we'll use the word 'cities' to mean the same as 'major urban areas'). Let's take a look at the first few rows of this table and a description of what each column represents:
id | name | population | capital | facts_id |
---|---|---|---|---|
1 | Oranjestad | 37000 | 1 | 216 |
2 | Saint John'S | 27000 | 1 | 6 |
3 | Abu Dhabi | 942000 | 1 | 184 |
4 | Dubai | 1978000 | 0 | 184 |
5 | Sharjah | 983000 | 0 | 184 |
id
- A unique ID for each city.name
- The name of the city.population
- The population of the city.capital
- Whether the city is a capital city: 1 if it is, 0 if it isn't.facts_id
- The ID of the country, from the facts table.
The last column is of particular interest to us, as it is a column of data that also exists in our original facts
table. This link between tables is important as it's used to combine the data in our queries. Below is a schema diagram, which shows the two tables in our database, the columns within them and how the two are linked.
The line in the schema diagram clearly shows the link between the id
column in the facts
table and the facts_id
column in the cities table.
If you'd like to download the database to follow along on your own computer you can download the dataset as a SQLite database.
Our First SQL Join
The most common way to join data using SQL is using an inner join. The syntax for an inner join is:
SELECT [column_names] FROM [table_name_one]
INNER JOIN [table_name_two] ON [join_constraint];
The inner join clause is made up of two parts:
INNER JOIN
, which tells the SQL engine the name of the table you wish to join in your query, and that you wish to use an inner join.ON
, which tells the SQL engine what columns to use to join the two tables.
Joins are usually used in a query after the FROM
clause. Let's look at a basic inner join where we combine the data from both of our tables.
SELECT * FROM facts
INNER JOIN cities ON cities.facts_id = facts.id
LIMIT 5;
Let's look at the line of the query with the join in it:
INNER JOIN cities
: This tells the SQL engine that we wish to join the cities table to our query using an inner join.ON cities.facts_id = facts.id
: This tells the SQL engine which columns to use when joining the data, following the syntax table_name.column_name.
You might presume that SELECT * FROM facts
will mean that the query returns only columns from the facts
table, however the *
wildcard when used with a join will give you all columns from both tables. Here is the result of this query:
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate | id | name | population | capital | facts_id |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
216 | aa | Aruba | 180 | 180 | 0 | 112162 | 1.33 | 12.56 | 8.18 | 8.92 | 1 | Oranjestad | 37000 | 1 | 216 |
6 | ac | Antigua and Barbuda | 442 | 442 | 0 | 92436 | 1.24 | 15.85 | 5.69 | 2.21 | 2 | Saint John'S | 27000 | 1 | 6 |
184 | ae | United Arab Emirates | 83600 | 83600 | 0 | 5779760 | 2.58 | 15.43 | 1.97 | 12.36 | 3 | Abu Dhabi | 942000 | 1 | 184 |
184 | ae | United Arab Emirates | 83600 | 83600 | 0 | 5779760 | 2.58 | 15.43 | 1.97 | 12.36 | 4 | Dubai | 1978000 | 0 | 184 |
184 | ae | United Arab Emirates | 83600 | 83600 | 0 | 5779760 | 2.58 | 15.43 | 1.97 | 12.36 | 5 | Sharjah | 983000 | 0 | 184 |
This query gives us all columns from both tables and every row where there is a match between the id
column from facts
and the facts_id
from cities
, limited to the first 5 rows.
Understanding SQL Inner Joins
We've now joined the two tables to give us extra information about each row in cities
. Let's take a closer look at how this inner join works.
An inner join works by including only rows from each table that have a match as specified using the ON
clause. Let's look at a diagram of how our join from the previous screen works. We have included a selection of rows which best illustrate the join:
Our inner join will include:
- Rows from the
cities
table that have acities.facts_id
that matches afacts.id
fromfacts
.
Our inner join will not include:
- Rows from the
cities
table that have acities.facts_id
that don't match anyfacts.id
from facts. - Rows from the facts table that have a facts.id that don't match any
cities.facts_id
from cities.
You can see this represented as a Venn diagram:
We already know how to use aliases to specify custom names for columns, eg:
SELECT AVG(population) AS average_population
We can also create aliases for table names, which makes queries with joins easier to both read and write. Instead of:
SELECT * FROM facts
INNER JOIN cities ON cities.facts_id = facts.id
We can write:
SELECT * FROM facts AS f
INNER JOIN cities AS c ON c.facts_id = f.id
Just like with column names, using AS
is optional. We can get the same result by writing:
SELECT * FROM facts f
INNER JOIN cities c ON c.facts_id = f.id
We can also combine aliases with wildcards - for instance, using the aliases created above, c.*
would give us all columns from the table cities
.
While our query from the previous screen included both columns from the ON
clause, we don't need to use either column from our ON
clause in our final list of columns. This is useful as it means we can show only the information we're interested in, rather than having to include the two join columns every time.
Let's use what we've learned to build on our original query. We'll:
- Join
cities
tofacts
using anINNER JOIN
. - Use aliases for table names.
- Include, in order:
- All columns from
cities
. - The name column from
facts
aliased tocountry_name
.
- All columns from
- Include only the first 5 rows.
SELECT
c.*,
f.name country_name
FROM facts f
INNER JOIN cities c ON c.facts_id = f.id
LIMIT 5;
id | name | population | capital | facts_id | country_name |
---|---|---|---|---|---|
1 | Oranjestad | 37000 | 1 | 216 | Aruba |
2 | Saint John'S | 27000 | 1 | 6 | Antigua and Barbuda |
3 | Abu Dhabi | 942000 | 1 | 184 | United Arab Emirates |
4 | Dubai | 1978000 | 0 | 184 | United Arab Emirates |
5 | Sharjah | 983000 | 0 | 184 | United Arab Emirates |
Practicing Inner Joins in SQL
Let's practice writing a query to answer a question from our database using an inner join. Say we want to produce a table of countries and their capital cities from our database using what we've learned so far. Our first step is to think about what columns we'll need in our final query. We'll need:
- The
name
column fromfacts
- The
name
column fromcities
Given that we've identified that we need data from two tables, we need to think about how to join them. The schema diagram from earlier indicated that there is only one column in each table that links them together, so we can use an inner join with those columns to join the data.
So far, thinking through our question we can already write most of our query (it's almost identical to the previous query we wrote):
SELECT
f.name,
c.name
FROM cities c
INNER JOIN facts f ON f.id = c.facts_id;
The last part of our process is to make sure we have the correct rows. From the previous two screens we know that a query like this will return all rows from cities
that have a corresponding match from facts
in the facts_id
column. We're only interested in the capital cities from the cities table, so we'll need to use a WHERE
clause on the capital
column, which has a value of 1
if the city is a capital, and 0
if it isn't:
WHERE c.capital = 1
We can now put this all together to write a query that answers our question. We'll limit it to just the first 10 rows so the amount of output is managable.
SELECT
f.name country,
c.name capital_city
FROM cities c
INNER JOIN facts f ON f.id = c.facts_id
WHERE c.capital = 1
LIMIT 10;
country | capital_city |
---|---|
Aruba | Oranjestad |
Antigua and Barbuda | Saint John'S |
United Arab Emirates | Abu Dhabi |
Afghanistan | Kabul |
Algeria | Algiers |
Azerbaijan | Baku |
Albania | Tirana |
Armenia | Yerevan |
Andorra | Andorra La Vella |
Angola | Luanda |
Left Joins in SQL
As we mentioned earlier, an inner join will not include any rows where there is not a mutual match from both tables. This means there could be information we are not seeing in our query where rows don't match.
We can use SQL queries to explore this:
SELECT COUNT(DISTINCT(name)) FROM facts;
count
261 |
SELECT COUNT(DISTINCT(facts_id)) FROM cities;
count
210 |
By running these two queries, we can see that there are some countries in the facts
table that don't have corresponding cities in the cities
table, which indicates we may have some incomplete data.
Let's look at how we can create a query to explore the missing data using a new type of join— the left join.
A left join includes all the rows that an inner join will select, plus any rows from the first (or left) table that don't have a match in the second table. We can see this represented as a Venn diagram.
Let's look at an example by replacing INNER JOIN
with LEFT JOIN
from the first query we wrote, and looking at the same selection of rows from our earlier diagram
SELECT * FROM facts
LEFT JOIN cities ON cities.facts_id = facts.id
Here we can see that for the rows where facts.id
doesn't match any values in cities.facts_id
(237, 238, 240, and 244), the rows are still included in the results. When this happens, all of the columns from the cities
table are populated with null values.
We can use these null values to filter our results to just the countries that don't exist in cities
with a WHERE
clause. When making a comparison to null in SQL, we use the IS
keyword, rather than the =
sign. If we want to select rows where a column is null we can write:
WHERE column_name IS NULL
If we want to select rows where a column name isn't null, we use:
WHERE column_name IS NOT NULL
Let's use a left join to explore the countries that don't exist in the cities
table.
SELECT
f.name country,
f.population
FROM facts f
LEFT JOIN cities c ON c.facts_id = f.id
WHERE c.name IS NULL;
country | population |
---|---|
Kosovo | 1870981 |
Monaco | 30535 |
Nauru | 9540 |
San Marino | 33020 |
Singapore | 5674472 |
Holy See (Vatican City) | 842 |
Taiwan | 23415126 |
European Union | 513949445 |
Ashmore and Cartier Islands | |
Christmas Island | 1530 |
Cocos (Keeling) Islands | 596 |
Coral Sea Islands | |
Heard Island and McDonald Islands | |
Norfolk Island | 2210 |
Hong Kong | 7141106 |
Macau | 592731 |
Clipperton Island | |
French Southern and Antarctic Lands | |
Saint Barthelemy | 7237 |
Saint Martin | 31754 |
Curacao | 148406 |
Sint Maarten | 39689 |
Cook Islands | 9838 |
Niue | 1190 |
Tokelau | 1337 |
Bouvet Island | |
Jan Mayen | |
Svalbard | 1872 |
Akrotiri | 15700 |
British Indian Ocean Territory | |
Dhekelia | 15700 |
Gibraltar | 29258 |
Guernsey | 66080 |
Jersey | 97294 |
Montserrat | 5241 |
Pitcairn Islands | 48 |
South Georgia and South Sandwich Islands | |
Navassa Island | |
Wake Island | |
United States Pacific Island Wildlife Refuges | |
Antarctica | 0 |
Gaza Strip | 1869055 |
Paracel Islands | |
Spratly Islands | |
West Bank | 2785366 |
Arctic Ocean | |
Atlantic Ocean | |
Indian Ocean | |
Pacific Ocean | |
Southern Ocean | |
World | 7256490011 |
Looking through the results of the query we wrote in the previous screen, we can see a number of different reasons that countries don't have corresponding values in cities
:
- Countries with small populations and/or no major urban areas (which are defined as having populations of over 750,000), eg San Marino, Kosovo, and Nauru.
- City-states, such as Monaco and Singapore.
- Territories that are not themselves countries, such as Hong Kong, Gibraltar, and the Cook Islands.
- Regions & Oceans that aren't countries, such as the European Union and the Pacific Ocean.
- Genuine cases of missing data, such as Taiwan.
It's important whenever you use inner joins to be mindful that you might be excluding important data, especially if you are joining based on columns that aren't linked in the database schema.
Right Joins and Outer Joins
There are two less-common join types SQLite does not support that you should be aware of. The first is a right join. A right join, as the name indicates, is exactly the opposite of a left join. While the left join includes all rows in the table before the JOIN clause, the right join includes all rows in the new table in the JOIN clause. We can see a right join in the Venn diagram below:
The following two queries, one using a left join and one using a right join, produce identical results.
SELECT f.name country, c.name city
FROM facts f
LEFT JOIN cities c ON c.facts_id = f.id
LIMIT 5;
SELECT f.name country, c.name city
FROM cities c
RIGHT JOIN facts f ON f.id = c.facts_id
LIMIT 5;
The main reason a right join would be used is when you are joining more than two tables. In these cases, using a right join is preferable because it can avoid restructuring your whole query to join one table. Outside of this, right joins are used reasonably rarely, so for simple joins it's better to use a left join than a right as it will be easier for your query to be read and understood by others.
The other join type not supported by SQLite is a full outer join. A full outer join will include all rows from the tables on both sides of the join. We can see a full outer join in the Venn diagram below:
Like right joins, full outer joins are reasonably uncommon. The standard SQL syntax for a full outer join is:
SELECT f.name country, c.name city
FROM cities c
FULL OUTER JOIN facts f ON f.id = c.facts_id
LIMIT 5;
When joining cities
and facts
with a full outer join, the result will be be the same as our left and right joins above, because there are no values in cities.facts_id
that don't exist in facts.id
.
Let's look at the Venn diagrams of each join type side by side, which should help you compare the differences of each of the four joins we've discussed so far.
Next, let's practice using joins to answer some questions about our data.
Finding the Most Populous Capital Cities
Previously, we've used column names when specifying order for our query results, like so:
SELECT
name,
migration_rate
FROM facts
ORDER BY migration_rate desc;
There is a handy shortcut we can use in our queries which lets us skip the column names, and instead use the order in which the columns appear in the SELECT
clause. In this instance, migration_rate
is the second column in our SELECT
clause so we can just use 2
instead of the column name:
SELECT
name,
migration_rate
FROM facts
ORDER BY 2 desc;
You can use this shortcut in either the ORDER BY
or GROUP BY
clauses. Be mindful that you want to ensure your queries are still readable, so typing the full column name may be better for more complex queries.
Let's use what we've learned to produce a list of the top 10 capital cities by population. Because we are not interested in countries from facts
that don't have corresponding cities in cities
, we should use an INNER JOIN
.
SELECT
c.name capital_city,
f.name country,
c.population
FROM facts f
INNER JOIN cities c ON c.facts_id = f.id
WHERE c.capital = 1
ORDER BY 3 DESC
LIMIT 10;
capital_city | country | population |
---|---|---|
Tokyo | Japan | 37217000 |
New Delhi | India | 22654000 |
Mexico City | Mexico | 20446000 |
Beijing | China | 15594000 |
Dhaka | Bangladesh | 15391000 |
Buenos Aires | Argentina | 13528000 |
Manila | Philippines | 11862000 |
Moscow | Russia | 11621000 |
Cairo | Egypt | 11169000 |
Jakarta | Indonesia | 9769000 |
Combining SQL Joins with Subqueries
Subqueries can be used to substitute parts of queries, allowing us to find the answers to more complex questions. We can also join to the result of a subquery, just like we could a table.
Here's an example of a using a join and a subquery to produce a table of countries and their capital cities, like we did earlier in the lesson.
Reading subqueries can be overwhelming at first, so we'll break down what happens in this example in several steps. The important thing to remember is that the result of any subquery is always calculated first, so we read from the inside out.
- The subquery, in the red box, is calculated first. This simple query selects all columns from
cities
, filtering rows that are marked as capital cities by having a value forcapital
of 1. - The
INNER JOIN
joins the subquery result, aliased asc
, to thefacts
table based on theON
clause. - Two columns are selected from the results of the join:
f.name
, aliased ascountry
.c.name
, aliased ascapital_city
.
- The results are limited to the first 10 rows.
Below is the output of this query:
country | capital_city |
---|---|
Aruba | Oranjestad |
Antigua and Barbuda | Saint John'S |
United Arab Emirates | Abu Dhabi |
Afghanistan | Kabul |
Algeria | Algiers |
Azerbaijan | Baku |
Albania | Tirana |
Armenia | Yerevan |
Andorra | Andorra La Vella |
Angola | Luanda |
Using this example as a model, we'll write a similar query to find non-capital cities with populations of over 10 million.
SELECT
c.name city,
f.name country,
c.population population
FROM facts f
INNER JOIN (
SELECT * FROM cities
WHERE capital = 0
AND population > 10000000
) c ON c.facts_id = f.id
ORDER BY 3 DESC;
city | country | population |
---|---|---|
New York-Newark | United States | 20352000 |
Shanghai | China | 20208000 |
Sao Paulo | Brazil | 19924000 |
Mumbai | India | 19744000 |
Marseille-Aix-en-Provence | France | 14890100 |
Kolkata | India | 14402000 |
Karachi | Pakistan | 13876000 |
Los Angeles-Long Beach-Santa Ana | United States | 13395000 |
Osaka-Kobe | Japan | 11494000 |
Istanbul | Turkey | 11253000 |
Lagos | Nigeria | 11223000 |
Guangzhou | China | 10849000 |
SQL Challenge: Complex Query with Joins and Subqueries
Let's take everything we've learned before and use it to write a more complex query. It's not uncommon to find that 'thinking in SQL' takes a bit of getting used to, so don't be discouraged if this query seems difficult to understand at first. It will get easier with practice!
When you're writing complex queries with joins and subqueries, it helps to follow this process:
- Think about what data you need in your final output
- Work out which tables you'll need to join, and whether you will need to join to a subquery.
- If you need to join to a subquery, write the subquery first.
- Then start writing your SELECT clause, followed by the join and any other clauses you will need.
- Don't be afraid to write your query in steps, running it as you go— for instance you can run your subquery as a 'stand alone' query first to make sure it looks like you want before writing the outer query.
We will be writing a query to find the countries where the urban center (city) population is more than half of the country's total population. There are multiple ways to write this query but we'll step through one approach.
We can start by writing a query to sum all the urban populations for cities in each country. We can do this without a join by grouping on the facts_id
(we'll use a limit in our example below to make the output managable):
SELECT
facts_id,
SUM(population) urban_pop
FROM cities
GROUP BY 1
LIMIT 5;
facts_id | urban_pop |
---|---|
1 | 3097000 |
10 | 172000 |
100 | 1127000 |
101 | 5000 |
102 | 546000 |
Next, we'll join the facts
table to that subquery, selecting the country name, urban population and total population (again, we've used a limit to keep things tidy):
SELECT
f.name country,
c.urban_pop,
f.population total_pop
FROM facts f
INNER JOIN (
SELECT
facts_id,
SUM(population) urban_pop
FROM cities
GROUP BY 1
) c ON c.facts_id = f.id
LIMIT 5;
country | urban_pop | total_pop |
---|---|---|
Afghanistan | 3097000 | 32564342 |
Austria | 172000 | 8665550 |
Libya | 1127000 | 6411776 |
Liechtenstein | 5000 | 37624 |
Lithuania | 546000 | 2884433 |
Lastly, we'll create a new column that divides the urban population by the total population, and use a WHERE
and ORDER BY
to filter/rank the results:
SELECT
f.name country,
c.urban_pop,
f.population total_pop
FROM facts f
INNER JOIN (
SELECT
facts_id,
SUM(population) urban_pop
FROM cities
GROUP BY 1
) c ON c.facts_id = f.id
LIMIT 5;
country | urban_pop | total_pop | urban_pct |
---|---|---|---|
Uruguay | 1672000 | 3341893 | 0.500315 |
Congo, Republic of the | 2445000 | 4755097 | 0.514185 |
Brunei | 241000 | 429646 | 0.560927 |
New Caledonia | 157000 | 271615 | 0.578024 |
Virgin Islands | 60000 | 103574 | 0.579296 |
Falkland Islands (Islas Malvinas) | 2000 | 3361 | 0.595061 |
Djibouti | 496000 | 828324 | 0.598800 |
Australia | 13789000 | 22751014 | 0.606083 |
Iceland | 206000 | 331918 | 0.620635 |
Israel | 5226000 | 8049314 | 0.649248 |
United Arab Emirates | 3903000 | 5779760 | 0.675288 |
Puerto Rico | 2475000 | 3598357 | 0.687814 |
Bahamas, The | 254000 | 324597 | 0.782509 |
Kuwait | 2406000 | 2788534 | 0.862819 |
Saint Pierre and Miquelon | 5000 | 5657 | 0.883861 |
Guam | 169000 | 161785 | 1.044596 |
Northern Mariana Islands | 56000 | 52344 | 1.069846 |
American Samoa | 64000 | 54343 | 1.177705 |
You can see that while our final query is complex, it's much easier to understand if you build it step-by-step.
SQL Joins: Next Steps
In this post dedicated to SQL joins, we learned:
- The difference between inner and left joins.
- The role of right and outer joins
- How to choose which join is appropriate for your task.
- Using joins with subqueries, aggregate functions, and other SQL techniques.
Other resources that might interest you include our Complete Guide to SQL, SQL Cheat Sheet, our article on SQL certification, our rundown of SQL interview questions for job interviews, and of course our interactive SQL courses. Click below to sign up and get started for free!