January 18, 2021

SQL Joins: Working with Databases

SQL joins don’t have to be this challenging!

When first learning SQL, it’s common to work with data in a single table. In the real world, databases generally have data in more than one table. If we want to be able to work with that data, we’ll have to combine multiple tables within a query. In this SQL joins post, we’ll learn how to use joins to select data from multiple tables.

We’ll assume that you know the fundamentals of working in SQL including filtering, sorting, aggregation, and subqueries. If you don’t, our SQL Fundamentals course teaches all of these concepts, and you can sign up and start that course for free.

The Factbook Database

We're going to be using a version of the CIA World Factbook (Factbook) database that has two tables. The first table is called facts, and each row represents a country from the Factbook. Here are the first 5 rows of the facts table:

id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
1 af Afghanistan 652230 652230 0 32564342 2.32 38.57 13.89 1.51
2 al Albania 28748 27398 1350 3029278 0.30 12.92 6.58 3.30
3 ag Algeria 2381741 2381741 0 39542166 1.84 23.67 4.31 0.92
4 an Andorra 468 468 0 85580 0.12 8.13 6.96 0.00
5 ao Angola 1246700 1246700 0 19625353 2.78 38.78 11.49 0.46

In addition to the facts table is a second table called cities which contains information on major urban areas from countries in the Factbook (for the rest of this post, we'll use the word 'cities' to mean the same as 'major urban areas'). Let's take a look at the first few rows of this table and a description of what each column represents:

id name population capital facts_id
1 Oranjestad 37000 1 216
2 Saint John'S 27000 1 6
3 Abu Dhabi 942000 1 184
4 Dubai 1978000 0 184
5 Sharjah 983000 0 184
  • id - A unique ID for each city.
  • name - The name of the city.
  • population - The population of the city.
  • capital - Whether the city is a capital city: 1 if it is, 0 if it isn't.
  • facts_id - The ID of the country, from the facts table.

The last column is of particular interest to us, as it is a column of data that also exists in our original facts table. This link between tables is important as it's used to combine the data in our queries. Below is a schema diagram, which shows the two tables in our database, the columns within them and how the two are linked.

schema diagram

The line in the schema diagram clearly shows the link between the id column in the facts table and the facts_id column in the cities table.

If you'd like to download the database to follow along on your own computer you can download the dataset as a SQLite database.

Our First SQL Join

The most common way to join data using SQL is using an inner join. The syntax for an inner join is:

SELECT [column_names] FROM [table_name_one]
INNER JOIN [table_name_two] ON [join_constraint];

The inner join clause is made up of two parts:

  • INNER JOIN, which tells the SQL engine the name of the table you wish to join in your query, and that you wish to use an inner join.
  • ON, which tells the SQL engine what columns to use to join the two tables.

Joins are usually used in a query after the FROM clause. Let's look at a basic inner join where we combine the data from both of our tables.

SELECT * FROM facts
INNER JOIN cities ON cities.facts_id = facts.id
LIMIT 5;

Let's look at the line of the query with the join in it:

  • INNER JOIN cities: This tells the SQL engine that we wish to join the cities table to our query using an inner join.
  • ON cities.facts_id = facts.id: This tells the SQL engine which columns to use when joining the data, following the syntax table_name.column_name.

You might presume that SELECT * FROM facts will mean that the query returns only columns from the facts table, however the * wildcard when used with a join will give you all columns from both tables. Here is the result of this query:

id code name area area_land area_water population population_growth birth_rate death_rate migration_rate id name population capital facts_id
216 aa Aruba 180 180 0 112162 1.33 12.56 8.18 8.92 1 Oranjestad 37000 1 216
6 ac Antigua and Barbuda 442 442 0 92436 1.24 15.85 5.69 2.21 2 Saint John'S 27000 1 6
184 ae United Arab Emirates 83600 83600 0 5779760 2.58 15.43 1.97 12.36 3 Abu Dhabi 942000 1 184
184 ae United Arab Emirates 83600 83600 0 5779760 2.58 15.43 1.97 12.36 4 Dubai 1978000 0 184
184 ae United Arab Emirates 83600 83600 0 5779760 2.58 15.43 1.97 12.36 5 Sharjah 983000 0 184

This query gives us all columns from both tables and every row where there is a match between the id column from facts and the facts_id from cities, limited to the first 5 rows.

Understanding SQL Inner Joins

We've now joined the two tables to give us extra information about each row in cities. Let's take a closer look at how this inner join works.

An inner join works by including only rows from each table that have a match as specified using the ON clause. Let's look at a diagram of how our join from the previous screen works. We have included a selection of rows which best illustrate the join:

inner join

Our inner join will include:

  • Rows from the cities table that have a cities.facts_id that matches a facts.id from facts.

Our inner join will not include:

  • Rows from the cities table that have a cities.facts_id that don't match any facts.id from facts.
  • Rows from the facts table that have a facts.id that don't match any cities.facts_id from cities.

You can see this represented as a Venn diagram:

Inner join Venn diagram

We already know how to use aliases to specify custom names for columns, eg:

SELECT AVG(population) AS average_population

We can also create aliases for table names, which makes queries with joins easier to both read and write. Instead of:

SELECT * FROM facts
INNER JOIN cities ON cities.facts_id = facts.id

We can write:

SELECT * FROM facts AS f
INNER JOIN cities AS c ON c.facts_id = f.id

Just like with column names, using AS is optional. We can get the same result by writing:

SELECT * FROM facts f
INNER JOIN cities c ON c.facts_id = f.id

We can also combine aliases with wildcards - for instance, using the aliases created above, c.* would give us all columns from the table cities.

While our query from the previous screen included both columns from the ON clause, we don't need to use either column from our ON clause in our final list of columns. This is useful as it means we can show only the information we're interested in, rather than having to include the two join columns every time.

Let's use what we've learned to build on our original query. We'll:

  • Join cities to facts using an INNER JOIN.
  • Use aliases for table names.
  • Include, in order:
    • All columns from cities.
    • The name column from facts aliased to country_name.
  • Include only the first 5 rows.
SELECT
    c.*,
    f.name country_name
FROM facts f
INNER JOIN cities c ON c.facts_id = f.id
LIMIT 5;
id name population capital facts_id country_name
1 Oranjestad 37000 1 216 Aruba
2 Saint John'S 27000 1 6 Antigua and Barbuda
3 Abu Dhabi 942000 1 184 United Arab Emirates
4 Dubai 1978000 0 184 United Arab Emirates
5 Sharjah 983000 0 184 United Arab Emirates

Practicing Inner Joins in SQL

Let's practice writing a query to answer a question from our database using an inner join. Say we want to produce a table of countries and their capital cities from our database using what we've learned so far. Our first step is to think about what columns we'll need in our final query. We'll need:

  • The name column from facts
  • The name column from cities

Given that we've identified that we need data from two tables, we need to think about how to join them. The schema diagram from earlier indicated that there is only one column in each table that links them together, so we can use an inner join with those columns to join the data.

So far, thinking through our question we can already write most of our query (it's almost identical to the previous query we wrote):

SELECT
    f.name,
    c.name
FROM cities c
INNER JOIN facts f ON f.id = c.facts_id;

The last part of our process is to make sure we have the correct rows. From the previous two screens we know that a query like this will return all rows from cities that have a corresponding match from facts in the facts_id column. We're only interested in the capital cities from the cities table, so we'll need to use a WHERE clause on the capital column, which has a value of 1 if the city is a capital, and 0 if it isn't:

WHERE c.capital = 1

We can now put this all together to write a query that answers our question. We'll limit it to just the first 10 rows so the amount of output is managable.

SELECT
    f.name country,
    c.name capital_city
FROM cities c
INNER JOIN facts f ON f.id = c.facts_id
WHERE c.capital = 1
LIMIT 10;
country capital_city
Aruba Oranjestad
Antigua and Barbuda Saint John'S
United Arab Emirates Abu Dhabi
Afghanistan Kabul
Algeria Algiers
Azerbaijan Baku
Albania Tirana
Armenia Yerevan
Andorra Andorra La Vella
Angola Luanda

Left Joins in SQL

As we mentioned earlier, an inner join will not include any rows where there is not a mutual match from both tables. This means there could be information we are not seeing in our query where rows don't match.

We can use SQL queries to explore this:

SELECT COUNT(DISTINCT(name)) FROM facts;

count

261
SELECT COUNT(DISTINCT(facts_id)) FROM cities;

count

210

By running these two queries, we can see that there are some countries in the facts table that don't have corresponding cities in the cities table, which indicates we may have some incomplete data.

Let's look at how we can create a query to explore the missing data using a new type of join— the left join.

A left join includes all the rows that an inner join will select, plus any rows from the first (or left) table that don't have a match in the second table. We can see this represented as a Venn diagram.

Venn diagram left join

Let's look at an example by replacing INNER JOIN with LEFT JOIN from the first query we wrote, and looking at the same selection of rows from our earlier diagram

SELECT * FROM facts
LEFT JOIN cities ON cities.facts_id = facts.id

left join

Here we can see that for the rows where facts.id doesn't match any values in cities.facts_id (237, 238, 240, and 244), the rows are still included in the results. When this happens, all of the columns from the cities table are populated with null values.

We can use these null values to filter our results to just the countries that don't exist in cities with a WHERE clause. When making a comparison to null in SQL, we use the IS keyword, rather than the = sign. If we want to select rows where a column is null we can write:

WHERE column_name IS NULL

If we want to select rows where a column name isn't null, we use:

WHERE column_name IS NOT NULL

Let's use a left join to explore the countries that don't exist in the cities table.

SELECT
    f.name country,
    f.population
FROM facts f
LEFT JOIN cities c ON c.facts_id = f.id
WHERE c.name IS NULL;
country population
Kosovo 1870981
Monaco 30535
Nauru 9540
San Marino 33020
Singapore 5674472
Holy See (Vatican City) 842
Taiwan 23415126
European Union 513949445
Ashmore and Cartier Islands
Christmas Island 1530
Cocos (Keeling) Islands 596
Coral Sea Islands
Heard Island and McDonald Islands
Norfolk Island 2210
Hong Kong 7141106
Macau 592731
Clipperton Island
French Southern and Antarctic Lands
Saint Barthelemy 7237
Saint Martin 31754
Curacao 148406
Sint Maarten 39689
Cook Islands 9838
Niue 1190
Tokelau 1337
Bouvet Island
Jan Mayen
Svalbard 1872
Akrotiri 15700
British Indian Ocean Territory
Dhekelia 15700
Gibraltar 29258
Guernsey 66080
Jersey 97294
Montserrat 5241
Pitcairn Islands 48
South Georgia and South Sandwich Islands
Navassa Island
Wake Island
United States Pacific Island Wildlife Refuges
Antarctica 0
Gaza Strip 1869055
Paracel Islands
Spratly Islands
West Bank 2785366
Arctic Ocean
Atlantic Ocean
Indian Ocean
Pacific Ocean
Southern Ocean
World 7256490011

Looking through the results of the query we wrote in the previous screen, we can see a number of different reasons that countries don't have corresponding values in cities:

  • Countries with small populations and/or no major urban areas (which are defined as having populations of over 750,000), eg San Marino, Kosovo, and Nauru.
  • City-states, such as Monaco and Singapore.
  • Territories that are not themselves countries, such as Hong Kong, Gibraltar, and the Cook Islands.
  • Regions & Oceans that aren't countries, such as the European Union and the Pacific Ocean.
  • Genuine cases of missing data, such as Taiwan.

It's important whenever you use inner joins to be mindful that you might be excluding important data, especially if you are joining based on columns that aren't linked in the database schema.

Right Joins and Outer Joins

There are two less-common join types SQLite does not support that you should be aware of. The first is a right join. A right join, as the name indicates, is exactly the opposite of a left join. While the left join includes all rows in the table before the JOIN clause, the right join includes all rows in the new table in the JOIN clause. We can see a right join in the Venn diagram below:

Venn diagram of a right join

The following two queries, one using a left join and one using a right join, produce identical results.

SELECT f.name country, c.name city
FROM facts f
LEFT JOIN cities c ON c.facts_id = f.id
LIMIT 5;
SELECT f.name country, c.name city
FROM cities c
RIGHT JOIN facts f ON f.id = c.facts_id
LIMIT 5;

The main reason a right join would be used is when you are joining more than two tables. In these cases, using a right join is preferable because it can avoid restructuring your whole query to join one table. Outside of this, right joins are used reasonably rarely, so for simple joins it's better to use a left join than a right as it will be easier for your query to be read and understood by others.

The other join type not supported by SQLite is a full outer join. A full outer join will include all rows from the tables on both sides of the join. We can see a full outer join in the Venn diagram below:

Venn diagram of a full outer join

Like right joins, full outer joins are reasonably uncommon. The standard SQL syntax for a full outer join is:

SELECT f.name country, c.name city
FROM cities c
FULL OUTER JOIN facts f ON f.id = c.facts_id
LIMIT 5;

When joining cities and facts with a full outer join, the result will be be the same as our left and right joins above, because there are no values in cities.facts_id that don't exist in facts.id.

Let's look at the Venn diagrams of each join type side by side, which should help you compare the differences of each of the four joins we've discussed so far.

Join Venn Diagram

Next, let's practice using joins to answer some questions about our data.

Finding the Most Populous Capital Cities

Previously, we've used column names when specifying order for our query results, like so:

SELECT
    name,
    migration_rate
FROM facts
ORDER BY migration_rate desc;

There is a handy shortcut we can use in our queries which lets us skip the column names, and instead use the order in which the columns appear in the SELECT clause. In this instance, migration_rate is the second column in our SELECT clause so we can just use 2 instead of the column name:

SELECT
    name,
    migration_rate
FROM facts
ORDER BY 2 desc;

You can use this shortcut in either the ORDER BY or GROUP BY clauses. Be mindful that you want to ensure your queries are still readable, so typing the full column name may be better for more complex queries.

Let's use what we've learned to produce a list of the top 10 capital cities by population. Because we are not interested in countries from facts that don't have corresponding cities in cities, we should use an INNER JOIN.

SELECT
    c.name capital_city,
    f.name country,
    c.population
FROM facts f
INNER JOIN cities c ON c.facts_id = f.id
WHERE c.capital = 1
ORDER BY 3 DESC
LIMIT 10;
capital_city country population
Tokyo Japan 37217000
New Delhi India 22654000
Mexico City Mexico 20446000
Beijing China 15594000
Dhaka Bangladesh 15391000
Buenos Aires Argentina 13528000
Manila Philippines 11862000
Moscow Russia 11621000
Cairo Egypt 11169000
Jakarta Indonesia 9769000

Combining SQL Joins with Subqueries

Subqueries can be used to substitute parts of queries, allowing us to find the answers to more complex questions. We can also join to the result of a subquery, just like we could a table.

Here's an example of a using a join and a subquery to produce a table of countries and their capital cities, like we did earlier in the lesson.

subqueries

Reading subqueries can be overwhelming at first, so we'll break down what happens in this example in several steps. The important thing to remember is that the result of any subquery is always calculated first, so we read from the inside out.

  • The subquery, in the red box, is calculated first. This simple query selects all columns from cities, filtering rows that are marked as capital cities by having a value for capital of 1.
  • The INNER JOIN joins the subquery result, aliased as c, to the facts table based on the ON clause.
  • Two columns are selected from the results of the join:
    • f.name, aliased as country.
    • c.name, aliased as capital_city.
  • The results are limited to the first 10 rows.

Below is the output of this query:

country capital_city
Aruba Oranjestad
Antigua and Barbuda Saint John'S
United Arab Emirates Abu Dhabi
Afghanistan Kabul
Algeria Algiers
Azerbaijan Baku
Albania Tirana
Armenia Yerevan
Andorra Andorra La Vella
Angola Luanda

Using this example as a model, we'll write a similar query to find non-capital cities with populations of over 10 million.

SELECT
    c.name city,
    f.name country,
    c.population population
FROM facts f
INNER JOIN (
            SELECT * FROM cities
            WHERE capital = 0
            AND population > 10000000
           ) c ON c.facts_id = f.id
ORDER BY 3 DESC;
city country population
New York-Newark United States 20352000
Shanghai China 20208000
Sao Paulo Brazil 19924000
Mumbai India 19744000
Marseille-Aix-en-Provence France 14890100
Kolkata India 14402000
Karachi Pakistan 13876000
Los Angeles-Long Beach-Santa Ana United States 13395000
Osaka-Kobe Japan 11494000
Istanbul Turkey 11253000
Lagos Nigeria 11223000
Guangzhou China 10849000

SQL Challenge: Complex Query with Joins and Subqueries

Let's take everything we've learned before and use it to write a more complex query. It's not uncommon to find that 'thinking in SQL' takes a bit of getting used to, so don't be discouraged if this query seems difficult to understand at first. It will get easier with practice!

When you're writing complex queries with joins and subqueries, it helps to follow this process:

  • Think about what data you need in your final output
  • Work out which tables you'll need to join, and whether you will need to join to a subquery.
    • If you need to join to a subquery, write the subquery first.
  • Then start writing your SELECT clause, followed by the join and any other clauses you will need.
  • Don't be afraid to write your query in steps, running it as you go— for instance you can run your subquery as a 'stand alone' query first to make sure it looks like you want before writing the outer query.

We will be writing a query to find the countries where the urban center (city) population is more than half of the country's total population. There are multiple ways to write this query but we'll step through one approach.

We can start by writing a query to sum all the urban populations for cities in each country. We can do this without a join by grouping on the facts_id (we'll use a limit in our example below to make the output managable):

SELECT
    facts_id,
    SUM(population) urban_pop
FROM cities
GROUP BY 1
LIMIT 5;
facts_id urban_pop
1 3097000
10 172000
100 1127000
101 5000
102 546000

Next, we'll join the facts table to that subquery, selecting the country name, urban population and total population (again, we've used a limit to keep things tidy):

SELECT
    f.name country,
    c.urban_pop,
    f.population total_pop
FROM facts f
INNER JOIN (
            SELECT
                facts_id,
                SUM(population) urban_pop
            FROM cities
            GROUP BY 1
           ) c ON c.facts_id = f.id
LIMIT 5;
country urban_pop total_pop
Afghanistan 3097000 32564342
Austria 172000 8665550
Libya 1127000 6411776
Liechtenstein 5000 37624
Lithuania 546000 2884433

Lastly, we'll create a new column that divides the urban population by the total population, and use a WHERE and ORDER BY to filter/rank the results:

SELECT
    f.name country,
    c.urban_pop,
    f.population total_pop
FROM facts f
INNER JOIN (
            SELECT
                facts_id,
                SUM(population) urban_pop
            FROM cities
            GROUP BY 1
           ) c ON c.facts_id = f.id
LIMIT 5;
country urban_pop total_pop urban_pct
Uruguay 1672000 3341893 0.500315
Congo, Republic of the 2445000 4755097 0.514185
Brunei 241000 429646 0.560927
New Caledonia 157000 271615 0.578024
Virgin Islands 60000 103574 0.579296
Falkland Islands (Islas Malvinas) 2000 3361 0.595061
Djibouti 496000 828324 0.598800
Australia 13789000 22751014 0.606083
Iceland 206000 331918 0.620635
Israel 5226000 8049314 0.649248
United Arab Emirates 3903000 5779760 0.675288
Puerto Rico 2475000 3598357 0.687814
Bahamas, The 254000 324597 0.782509
Kuwait 2406000 2788534 0.862819
Saint Pierre and Miquelon 5000 5657 0.883861
Guam 169000 161785 1.044596
Northern Mariana Islands 56000 52344 1.069846
American Samoa 64000 54343 1.177705

You can see that while our final query is complex, it's much easier to understand if you build it step-by-step.

SQL Joins: Next Steps

In this post dedicated to SQL joins, we learned:

  • The difference between inner and left joins.
  • The role of right and outer joins
  • How to choose which join is appropriate for your task.
  • Using joins with subqueries, aggregate functions, and other SQL techniques.

Other resources that might interest you include our Complete Guide to SQL, SQL Cheat Sheet, our article on SQL certification, our rundown of SQL interview questions for job interviews, and of course our interactive SQL courses. Click below to sign up and get started for free!

Celeste Grupman

About the author

Celeste Grupman

Celeste Grupman is the CEO at Dataquest She is passionate about creating affordable access to high-quality skills training for students across the globe.