Course Preview: Introduction to SQL

In our introduction to SQL mission, you'll learn the basics of SQL and explore a data set.

SQL skills are critical for any kind of data work, because they're generally necessary for querying the databases virtually all companies use to store their data. SQL is also one of the most common technical skills for companies to test during the job application process.

(Read more about why you need to learn SQL).

In this mission, which is the first lesson of our SQL Fundamentals course, you will learn the basic components of a SQL query, including commands such as:

  •  SELECT and FROM to select the data you want
  • WHERE to filter your data
  •  AND and OR to filter by multiple conditions at once
  •  ORDER BY and LIMIT to control your output and results

Sign up for a free account to try this course right now in our interactive app. You'll be writing real SQL queries right in your browser to pull, filter, and analyze data from a real database.

sql-fundamentals

Course Info:

Beginner

The median completion time for this course is 6.3 hours. View Details

This course includes five missions and one guided project.  It is the 11th course in the Data Analyst in Python path and the Data Scientist in Python path.

START LEARNING FREE

Scroll down to preview this SQL lesson!

Course Content Preview: Fundamentals of SQL and Databases

Structured Query Language, or SQL, is more than forty years old, and it is one of the most popular technologies used by data professionals, including data analysts, data scientists, and data engineers. Understanding the fundamentals of a more general-purpose language like Python or R is critical for working with data, but knowing SQL helps data professionals do more with their data. And if working with R or Python is one of your goals, SQL can help gather insights from data.

Here are a few key reasons why learning SQL will help anybody interested in working with data.

SQL is everywhere.

Almost all of the biggest names in tech use SQL — which is pronounced either “sequel” or “S.Q.L.” Companies like Facebook, Google, and Amazon have built their own high-performance database systems, but even their data teams use SQL to query data and perform data analysis. And it’s not just tech companies: companies big and small around the world use SQL.

SQL enables us to pull data from many sources.

In many real-word situations, data is distributed across many sources. SQL allows us to select specific data and transform it to fit our needs. For example, working with spreadsheets can be difficult if the data we need to answer our question is distributed across many files. SQL allows us to structure our data in a way that makes it accessible from one place.

sql-database

SQL data is structured into multiple, connected tables.

SQL is here to stay.

The Stack Overflow annual Developer Survey, which is the largest and most comprehensive survey of programmers around the world, consistently reveals that SQL is one of the most popular technologies used today.

Check out this Dataquest blog post if you'd like to learn more about why it's important to learn SQL. And if you'd like to learn more about how to learn SQL online with Dataquest, check out this blog post for some tips and tricks to learn SQL online.

So, let's learn about the language itself and how you can use it to query data.

Tired of reading? Learn by doing! 

Sign up for a free account and start writing real SQL queries, right in your browser.

Before we get started, let's learn a little about databases. 

When we work with data stored on our computers, we load the data from files like spreadsheets (and text files in several different formats). Working with files solely on our computer is fine most of the time, but we run into problems when we consider a few questions:

  • What if the data is too big to fit into a single spreadsheet file?
  • What if you share the data with team members and keep it updated?
  • What if there's sensitive information in your data that needs protection?

Thankfully, these problems already have a solution: the database. A database structures data just like a spreadsheet file by organizing data in different tables, which are comprised of rows and columns. In the example below, we see a table in a spreadsheet where each row represents a chess player and each column gives information about the player.

SQL-SpREADSHEET

A database can store much more data more securely than a spreadsheet or a text file. Unlike simply opening a spreadsheet, we actually have to "ask" for data from the database.

We primarily interact with a database using a database management system (DBMS) — a computer program to help users interact with data by giving the computer instructions through the DBMS.

We'll begin learning SQL with the DBMS SQLite. SQLite is a lightweight DBMS, and it is the most popular database in the world. 

Here, we're exploring data from the American Community Survey on job outcome statistics based on college majors that we loaded into a SQLite database.

Here are its first three rows:

indexRankMajor_codeMajorMajor_categoryTotalSample_sizeMenWomenShareWomenEmployedFull_timePart_timeFull_time_year_roundUnemployedUnemployment_rateMedianP25thP75thCollege_jobsNon_college_jobsLow_wage_jobs
012419PETROLEUM ENGINEERINGEngineering23393620572820.120564344197618492701207370.018380527110000950001250001534364193
122416MINING AND MINERAL ENGINEERINGEngineering7567679770.10185185199999999640556170388850.11724137975000550009000035025750
232415METALLURGICAL ENGINEERINGEngineering85637251310.153037383648558133340160.02409638673000500001050004561760

In this table, each row represents a major, and each column gives us some information about that major. Head to the dataset page to become familiar with what each column represents.

sql table

Below is a sample SQL instruction (usually called a query). This query selects all columns from the recent_grads table.


SELECT *
FROM recent_grads;

The output of this query would be the entirety of the recent_grads table. The process you used to visualize recent_grads breaks down into two steps:

  1. Write a SQL query that expresses the request "fetch all the data in the table."
  2. Ask the SQLite DBMS software to run the code and display the results.

This query is an example of computer code. Writing computer code is called programming.

SQL is one of many programming languages, and just like natural languages (e.g. English), code in SQL has to follow a defined structure and vocabulary. To display recent_grads, you would run the following SQL query:


SELECT * 
FROM recent_grads;

In the query above specifies the following:

  • The columns we wanted using SELECT * (the symbol * selects all the columns).
  • The table we wanted to query using FROM recent_grads.

The order of the different words in this query and the space between SELECT, *, FROM, and recent_grads are crucial features of SQL syntax. If we don't follow the syntax, the database will probably not return the information we want. 

The ; character signals the end of the query, but it isn't mandatory.

Here's a visual breakdown of the different components of the query:

sql_select_breakdown

You may have noticed that SELECT and FROM use uppercase letters. This isn't required, but it makes your code easier to read.

Since we often read code more often than we write it, it's common for coders to follow certain conventions, so that programs written by different people look the same, reducing the amount of work it takes to read them.

A couple of other elements that aren't required are the line change and indentation right before FROM. The reason why we change lines and indent this query is the same as above: stylistic conventions.

You may also notice that some words are highlighted. This happens because they are reserved words, (i.e., they are words that serve a particular purpose, so you can't or shouldn't use them for anything else).

We'll be following this SQL Style Guide in this write-up. This is a great resource for familiarizing yourself with more stylistic conventions.

Tired of reading? Learn by doing! 

Sign up for a free account and start writing real SQL queries, right in your browser.

One thing to keep in mind is that even if you write two queries, you will only see the result of the second one.

This isn't because the queries are the same. It's a quirk of the SQLite database that only the last query will display visually. If we want to see the results of multiple queries, we can run each query by itself.

This table has 173 rows and over 20 columns. For a computer, this isn't much information. For a human, however, it's difficult to make sense of this much data.

Let's look at how to preview a table without displaying it completely.

In practice, you will often need to access a database without any documentation. In this situation, you'll have to rely on the surrounding context of the database and on your own exploration.

Some tables have millions and millions of rows, so a task as simple as displaying a table can take a very long time, and if you're just trying to explore the table, it isn't really useful to see all of it.

Fortunately, SQL allows us to limit the number of rows we see by using the LIMIT clause.

Think of a clause as an optional reserved word that doesn't need to be in the code for the query to execute successfully.

Here's how you can use it to retrieve the first three rows (that we saw in a previous screen).


SELECT *
  FROM recent_grads
LIMIT 3;

In the following animation, we see all the columns being selected, and then we see only the first three rows being included.

SQL is a database-dependent language. This means that the SQL that you're learning here (for SQLite) isn't the same as SQL for other DBMSs, like PostgreSQL or Oracle.

However, the different versions of SQL (also called dialects or flavors) are more alike than they are different.

Although LIMIT functions with all SQL dialects, it sometimes needs a different reserved word. For example, in T-SQL (Microsoft's SQL flavor), we would write the query above like this:


SELECT TOP 3, *
FROM recent_grads;

Pay attention to these differences when switching between DBMSs.

Now, let's look at LIMIT.

Since we're writing some queries to display the entirety of the recent_grads table, or the limit by the number of rows, here's the result we get:

indexRankMajor_codeMajorMajor_categoryTotalSample_sizeMenWomenShareWomenEmployedFull_timePart_timeFull_time_year_roundUnemployedUnemployment_rateMedianP25thP75thCollege_jobsNon_college_jobsLow_wage_jobs
012419PETROLEUM ENGINEERINGEngineering23393620572820.120564344197618492701207370.018380527110000950001250001534364193
122416MINING AND MINERAL ENGINEERINGEngineering7567679770.101851852640556170388850.11724137975000550009000035025750
232415METALLURGICAL ENGINEERINGEngineering85637251310.153037383648558133340160.02409638673000500001050004561760
342417NAVAL ARCHITECTURE AND MARINE ENGINEERINGEngineering12581611231350.1073131967581069150692400.0501253137000043000800005291020
452405CHEMICAL ENGINEERINGEngineering3226028921239110210.341630502256942317051801669716720.061097712650005000075000183144440972

Often, we'll only want to look at data from specific columns. To return only the Major column, we need to add the specific column name in the SELECT statement part of the query (instead of using the * character to return all columns):


SELECT Major
FROM recent_grads;

This will return all of the values in the Major column. Here's an animation illustrating the process of running this query:

We can specify multiple columns this way, and the results table will preserve the order of the columns:


SELECT Major, Major_category
FROM recent_grads;

Notice that we use commas to separate column names, and after the last column name there is no comma. When SQL finds a comma, it expects a column to follow, so we need to make sure we don't include a comma after the last column. Here's a visual breakdown of this query:

SQLite accepts the columns with any choice of case; however, for answer checking purposes — and also for style and consistency —you should specify the column names.

Tired of reading? Learn by doing! 

Sign up for a free account and start writing real SQL queries, right in your browser.

Next, let's look at how to use SQL to answer a more specific question: which majors had students who were mostly women?

The SQL workflow translates the question we want to answer to the subset of data we want from the database. To determine which majors had mostly students who were women, we want the following subset:

  • Only the Major columns
  • Only the rows where ShareWomen is greater than 0.5 (corresponding to 50%)

To filter rows by specific criteria, we can use the WHERE statement. A WHERE statement commonly uses three things:

  1. The column we want the database to filter on: ShareWomen
  2. A comparison operator that specifies how we want to compare a value in a column: >=
  3. The value against which we want the database to compare each value: 0.5

Here are the comparison operators we can use:

  • Less than: <
  • Less than or equal to: <=
  • Greater than: >
  • Greater than or equal to: >=
  • Equal to: =
  • Not equal to: != or <>

To return only the values where ShareWomen is greater than or equal to 0.5, we can use the following WHERE clause:


SELECT Major
FROM recent_grads
WHERE ShareWomen >= 0.5;

Running this query will return the following results table (the first five results only):

Major
ACTUARIAL SCIENCE
COMPUTER SCIENCE
ENVIRONMENTAL ENGINEERING
NURSING
INDUSTRIAL PRODUCTION TECHNOLOGIES

Here's a breakdown of the different components:

sql where breakdown

We express the specific column we want in the SELECT part of the query and the specific rows we want in the WHERE part. Note that most database systems require that the SELECT and FROM statements come first, before WHERE or any other statements.

Here's a query that will return majors where students who were men outnumbered students who were women:


SELECT Major, ShareWomen 
FROM recent_grads
WHERE ShareWomen < 0.5;

The comparison value after the < operator must be either text or a number, depending on the field. Because ShareWomen is a numeric column, we just write the number 0.5.

For text values, we need to enclose the value in quotes. For example, if we wanted to select only the rows where the Major_category equaled Engineering, we would write the following:


SELECT Major
FROM recent_grads
WHERE Major_category = 'Engineering';

We can also use the AND operator to combine multiple filter criteria. For example, to determine which engineering majors had a majority of students who were women, we specify two filtering criteria:


SELECT Major
FROM recent_grads
WHERE Major_category = 'Engineering'
AND ShareWomen > 0.5;
Major
ENVIRONMENTAL ENGINEERING
INDUSTRIAL PRODUCTION TECHNOLOGIES

Above, we can confirm that only two majors met these criteria!

Tired of reading? Learn by doing! 

Sign up for a free account and start writing real SQL queries, right in your browser.


Boolean Conditions

We can use the AND operator to specify that our filter needs to pass two Boolean conditions. Both of the conditions had to evaluate to True for the record to appear in the result set. If we wanted to specify a filter that meets either of the conditions instead, we would use the OR operator.


SELECT [column1, column2,...]
  FROM [table1]
 WHERE [condition1]
OR [condition2];

We won't go into more detail regarding OR because we use the OR and AND operators in similar ways.

One other important feature is that we don't need to compare a column with a value, but we can also compare columns to other columns.

For example, we've been using the condition WHERE ShareWomen > 0.5. We can obtain an equivalent condition by using WHERE Men < Women.

There's a certain class of questions that we can't answer using only the techniques we learned so far. For example, if we wanted to write a query that returned all Engineering majors that either had mostly graduates who were women or an unemployment rate below 5.1%, we would need to use parentheses to express this more complex logic.

The three raw conditions we'll need are the following:

Major_category = 'Engineering'

ShareWomen >= 0.5

Unemployment_rate < 0.051

What the SQL query looks like using parentheses:


SELECT Major, Major_category, ShareWomen, Unemployment_rate
  FROM recent_grads
 WHERE (Major_category = 'Engineering')
AND (ShareWomen > 0.5 OR Unemployment_rate < 0.051);

You may notice that we have enclosed the logic we want to evaluate together in parentheses. This is very similar to how we group mathematical calculations together in a particular order. The parentheses make it clear to the database that we want all of the rows where both of the expressions in the statements evaluate to True:

(Major_category = 'Engineering') -> True or False

(ShareWomen > 0.5 OR Unemployment_rate < 0.051) -> True or False

If we had written the WHERE statement without any parentheses, the database would guess what our intentions are, and it would actually execute the following query instead:

WHERE (Major_category = 'Engineering' AND ShareWomen > 0.5) OR (Unemployment_rate < 0.051)

Leaving the parentheses out implies that we want the calculation to happen from left to right in the order we wrote the logic, so we wouldn't get the data we want.

The results of every query we've written so far would have returned in order according to the Rank column. Earlier, we wrote a query that returned all of the columns and didn't filter rows according to any specific criteria (SELECT * FROM recent_grads LIMIT 5):

indexRankMajor_codeMajorMajor_categoryTotalSample_sizeMenWomenShareWomenEmployedFull_timePart_timeFull_time_year_roundUnemployedUnemployment_rateMedianP25thP75thCollege_jobsNon_college_jobsLow_wage_jobs
012419PETROLEUM ENGINEERINGEngineering23393620572820.120564344197618492701207370.018380527110000950001250001534364193
122416MINING AND MINERAL ENGINEERINGEngineering7567679770.101851852640556170388850.11724137975000550009000035025750
232415METALLURGICAL ENGINEERINGEngineering85637251310.153037383648558133340160.02409638673000500001050004561760
342417NAVAL ARCHITECTURE AND MARINE ENGINEERINGEngineering12581611231350.1073131967581069150692400.0501253137000043000800005291020
452405CHEMICAL ENGINEERINGEngineering3226028921239110210.341630502256942317051801669716720.061097712650005000075000183144440972

If we include the Rank column in our query, the results are ordered by the Rank column as well:

RankMajorMajor_categoryShareWomenUnemployment_rate
1PETROLEUM ENGINEERINGEngineering0.1205643440.018380527
3METALLURGICAL ENGINEERINGEngineering0.1530373830.024096386
4NAVAL ARCHITECTURE AND MARINE ENGINEERINGEngineering0.1073131960.050125313
14MATERIALS SCIENCEEngineering0.3108202850.023042836
15ENGINEERING MECHANICS PHYSICS AND SCIENCEEngineering0.1839851890.006334343
17INDUSTRIAL AND MANUFACTURING ENGINEERINGEngineering0.3434732180.042875544
24MATERIALS ENGINEERING AND MATERIALS SCIENCEEngineering0.2926070040.027788805
31ENVIRONMENTAL ENGINEERINGEngineering0.5585480090.093588575
39INDUSTRIAL PRODUCTION TECHNOLOGIESEngineering0.750472590.028308097
51ENGINEERING AND INDUSTRIAL MANAGEMENTEngineering0.1741225050.03365166

As the questions we want to answer get more complex, we want more control over the ordering of the results. We can specify the order using the ORDER BY clause. For example, we may want to understand which majors that met the criteria in the WHERE statement had the lowest unemployment rate:


SELECT Rank, Major, Major_category, ShareWomen, Unemployment_rate
FROM recent_grads
WHERE (Major_category = 'Engineering')
AND (ShareWomen > 0.5 OR Unemployment_rate < 0.051)
ORDER BY Unemployment_rate;

This will return the results in ascending order (increasing) by the Unemployment_rate column:

RankMajorMajor_categoryShareWomenUnemployment_rate
15ENGINEERING MECHANICS PHYSICS AND SCIENCEEngineering0.1839851890.006334343
1PETROLEUM ENGINEERINGEngineering0.1205643440.018380527
14MATERIALS SCIENCEEngineering0.3108202850.023042836
3METALLURGICAL ENGINEERINGEngineering0.1530373830.024096386
24MATERIALS ENGINEERING AND MATERIALS SCIENCEEngineering0.2926070040.027788805
39INDUSTRIAL PRODUCTION TECHNOLOGIESEngineering0.750472590.028308097
51ENGINEERING AND INDUSTRIAL MANAGEMENTEngineering0.1741225050.03365166
17INDUSTRIAL AND MANUFACTURING ENGINEERINGEngineering0.3434732180.042875544
4NAVAL ARCHITECTURE AND MARINE ENGINEERINGEngineering0.1073131960.050125313
31ENVIRONMENTAL ENGINEERINGEngineering0.5585480090.093588575

If we instead want the results ordered by the same column but in descending order, we can add the DESC keyword:


SELECT Rank, Major, Major_category, ShareWomen, Unemployment_rate
FROM recent_grads
WHERE (Major_category = 'Engineering')
AND (ShareWomen > 0.5 OR Unemployment_rate < 0.051)
ORDER BY Unemployment_rate DESC;

Here's the result of that query:

RankMajorMajor_categoryShareWomenUnemployment_rate
31ENVIRONMENTAL ENGINEERINGEngineering0.5585480090.093588575
4NAVAL ARCHITECTURE AND MARINE ENGINEERINGEngineering0.1073131960.050125313
17INDUSTRIAL AND MANUFACTURING ENGINEERINGEngineering0.3434732180.042875544
51ENGINEERING AND INDUSTRIAL MANAGEMENTEngineering0.1741225050.03365166
39INDUSTRIAL PRODUCTION TECHNOLOGIESEngineering0.750472590.028308097
24MATERIALS ENGINEERING AND MATERIALS SCIENCEEngineering0.2926070040.027788805
3METALLURGICAL ENGINEERINGEngineering0.1530373830.024096386
14MATERIALS SCIENCEEngineering0.3108202850.023042836
1PETROLEUM ENGINEERINGEngineering0.1205643440.018380527
15ENGINEERING MECHANICS PHYSICS AND SCIENCEEngineering0.1839851890.006334343

Tired of reading? Learn by doing! 

Sign up for a free account and start writing real SQL queries, right in your browser.

The kind of queries we learned here and that you can keep studying for the next few courses are one of four types of SQL commands we can give:

In the next course, we'll focus on data query language (DQL). DQL is the part of SQL that allows users to extract data from databases. Data engineers use the remaining types of instructions to create and maintain databases.

Here are a few things to note:

  • We rarely linked to SQLite documentation, because it's a bit challenging to understand when you're just beginning. Sites like W3 Schools and SQL ZOO are more friendly for looking up SQL commands.
  • We learned about clauses, statements, keywords, and operators in SQL. Here's a diagram describing the difference between each term:

The ability to quickly iterate on queries as you think of new questions is the appeal of SQL. The SQL workflow lets data professionals focus on asking and answering questions, instead of lower-level programming concepts. There's a clear separation of concerns between the engine that stores, organizes, and retrieves the data and the language that lets people interface with the data without worrying about the underlying mechanics.

As the scale of data has increased, engineers have maintained the SQL interface while changing the database engine. This allows people who need to ask and answer questions easily transfer their SQL experience, even as database technologies change. For example, the Presto project lets you query using SQL but use data from database systems like MySQL, from a distributed file system like HDFS, and more.

In the next mission, we'll learn how to compute summary statistics and perform reductions on the same data in SQL.

You've learned it. Now do it, right in your browser!

Sign up for a free account and start running real SQL queries, right in your browser.