Blog – Dataquest https://www.dataquest.io Aquire the skills you need to start and advance your data science career. Mon, 19 Apr 2021 20:02:34 +0000 en-US hourly 1 https://wordpress.org/?v=5.7.1 https://www.dataquest.io/wp-content/uploads/2020/07/rocket-circle-only.png Blog – Dataquest https://www.dataquest.io 32 32 Data Analyst Skills – 8 Skills You Need to Get a Job https://www.dataquest.io/blog/data-analyst-skills/ Sun, 18 Apr 2021 14:55:36 +0000 https://www.dataquest.io/?p=13793 What are 5 real-world tasks that cover most of the skills someone needs to be hired as a data analyst?

The post Data Analyst Skills – 8 Skills You Need to Get a Job appeared first on Dataquest.

]]>
data-analyst-skills

What is a Data Analyst?

A data analyst is someone who uses technical skills to analyze data and report insights.

On a typical day, a data analyst might use SQL skills to pull data from a company database, use programming skills to analyze that data, and then use communication skills to report their results to a larger audience.

It's a fulfilling job that pays well. Being a data analyst also provides experience that can be beneficial for stepping into more advanced roles like data scientist.

How to Become a Data Analyst

  1. 1
    Learn the technical skills (SQL and some data analysis with Python or R)
  2. 2
    Learn the fundamentals of statistics
  3. 3
    Build data analysis projects that showcase your hard and soft skills

So you've decided you want to be a data analyst. Or maybe your goal is to be a data scientist, but you know many entry-level jobs are analyst roles. In either case, you're going to need to master data analyst skills to get you where you want to go.

But what are those skills? What are the things you need to know? In this article, you'll learn the eight key skills you'll need to get a job as a data analyst.

What Skills Does a Data Analyst Need?

We'll be focusing on skills and not on tools (like Python, R, SQL, Excel, Tableau, etc.) Our focus will be what you'll need to do as a data analyst, not how you do those things.

Tools — the how — will vary depending on the exact role, the company that hires you, and the industry you end up working in. You can take the data analyst skills from this article and apply them using the tools that you're learning with, or that suit the industry you're looking to break into.

The research for this article was taken from the planning for our Dataquest Data Analyst paths. To make sure we teach the right mix of skills, we did a lot of research to understand what data analysts really do.

This research included interviews with data analysts, data scientists, and recruiters/hiring managers for data roles. We also conducted a review of existing research on the topic.

1: Data Cleaning and Preparation

Research shows that data cleaning and preparation accounts for around 80% of the work of data professionals. This makes it perhaps the key skill for anyone serious about getting a job in data.

Commonly, a data analyst will need to retrieve data from one or more sources and prepare the data so it is ready for numerical and categorical analysis. Data cleaning also involves handling missing and inconsistent data that may affect your analysis.

Data cleaning isn’t always considered “sexy”, but preparing data can actually be a lot of fun when treated as a problem-solving exercise. In any case, it's where most data projects start, so it's a key skill you'll need if you're going to become a data analyst.

data on a computer screen

2: Data Analysis and Exploration

It might sound funny to list “data analysis” in a list of required data analyst skills. But analysis itself is a specific skill that needs to be mastered.

At its core, data analysis means taking a business question or need and turning it into a data question. Then, you'll need to transform and analyze data to extract an answer to that question.

Another form of data analysis is exploration. Data exploration is looking to find interesting trends or relationships in the data that could bring value to a business.

Exploration might be guided by an original business question, but it also might be relatively unguided. By looking to find patterns and blips in the data, you may stumble across an opportunity for the business to decrease costs or increase growth!

3: Statistical Knowledge

A strong foundation in probability and statistics is an important data analyst skill. This knowledge will help guide your analysis and exploration and help you understand the data that you're working with.

Additionally, understanding stats will help you make sure your analysis is valid and will help you avoid common fallacies and logical errors.

The exact level of statistical knowledge required will vary depending on the demands of your particular role and the data you're working with. For example, if your company relies on probabilistic analysis, you'll need a much more rigorous understanding of those areas than you would otherwise.

4: Creating Data Visualizations

Data visualizations make trends and patterns in data easier to understand. Humans are visual creatures, and most people aren’t going to be able to get meaningful insight by looking at a giant spreadsheet of numbers. As a data analyst, you'll need to be able to create plots and charts to help communicate your data and findings visually.

This means creating clean, visually compelling charts that will help others understand the data. It also means avoiding things that are either difficult to interpret (like pie charts) or can be misleading (like manipulating axis values).

Visualizations can also be an important part of data exploration. Sometimes there are things that you can see visually in the data that can hide when you just look at the numbers.

Same stats, different graphs

Data with the same statistics can produce radically different plots (source)

It's very rare to find data role that doesn't require data visualization, making it a key data analyst skill.

5: Creating Dashboards and/or Reports

As a data analyst, you'll need to empower others within your organization to use data to make key decisions. By building dashboards and reports, you’ll be giving others access to important data by removing technical barriers.

This might take the form of a simple chart and table with date filters, all the way up to a large dashboard containing hundreds of data points that are interactive and update automatically.

Job requirements can vary a lot from position to position, but almost every data analyst job is going to involve producing reports on your findings and/or building dashboards to showcase them.

6: Writing and Communication Skills

The ability to communicate in multiple formats is a key data analyst skill. Writing, speaking, explaining, listening— strong communication skills across all of these areas will help you succeed.

Communication is key in collaborating with your colleagues. For example, in a kickoff meeting with business stakeholders, careful listening skills are needed to understand the analyses they require. Similarly, during your project, you may need to be able to explain a complex topic to non-technical teammates.

Written communication is also incredibly important — you'll almost certainly need to write up your analysis and recommendations. 

Being clear, direct, and easily understood is a skill that will advance your career in data. It may be a “soft” skill, but don’t underestimate it — the best analytical skills in the world won’t be worth much unless you can explain what they mean and convince your colleagues to act on your findings.

7: Domain Knowledge

Domain knowledge is understanding things that are specific to the particular industry and company that you work for. For example, if you're working for a company with an online store, you might need to understand the nuances of e-commerce. In contrast, if you're analyzing data about mechanical systems, you might need to understand those systems and how they work.

Domain knowledge changes from industry to industry, so you may find yourself needing to research and learn quickly. No matter where you work, if you don't understand what you're analyzing it's going to be difficult to do it effectively, making domain knowledge a key data analyst skill.

This is certainly something that you can learn on the job, but if you know a specific industry or area you’d like to work in, building as much understanding as you can up front will make you a more attractive job applicant and a more effective employee once you do get the job.

8: Problem-Solving

As a data analyst, you're going to run up against problems, bugs, and roadblocks every day. Being able to problem-solve your way out of them is a key skill.

You might need to research a quirk of some software or coding language that you're using. Your company might have resource constraints that force you to be innovative in how you approach a problem. The data you're using might be incomplete. Or you might need to perform some “good enough” analysis to meet a looming deadline.

Whatever the circumstances, strong problem-solving skills are going to be an incredible asset for any data analyst.

Other Data Analyst Skills

The exact definition of “data analyst” varies a lot depending on whom you ask, so it's possible not all of these skills will be required for every data analyst job. 

Similarly, there may be skills some companies will require that aren't on this list. Our focus here was to find the set of skills that most data analyst roles require in order to build the very best data analyst learning paths for our students.

Data Analyst Job Breakdown

So far in this article, we've looked at critical skills for data scientists from a broad perspective. Now, let's dig a little deeper into some of the specifics.

If you're looking for a job as a data analyst, what kinds of things will you need on your resume? And how much can you expect to get paid if you get the job? Let's take a look at some of the specifics.

Data Analyst Qualifications

Generally speaking, employers will expect data analysts to have a bachelors degree in something, and a degree in a quantitative/STEM field may help. However, a degree is not required. Data analysts are in high demand, and employers are concerned primarily with an applicant's actual skills — if you have the right skills and the projects to prove it, you can get a data analyst job without a degree.

People often ask whether some kind of data science certificate is required or helpful for getting jobs in data. The answer is no. Employers are primarily concerned with skills, and when we spoke to dozens of people who hire in this field, not a single one of them mentioned wanting to see certificates.

Certificate programs can be helpful if they teach you necessary skills, but employers aren't going to be scanning your resume looking for a data analyst certificate. Nor are they likely to care much about any certificates you've earned. They'll be looking for proof of actual skills.

Data Analyst Requirements

Above, we've talked about the skills data analysts need, and we've explained why you probably don't need any paper qualifications to become a data analyst.

What you do need, though, is proof of the skills you have. Simply listing that you know SQL and Python on your resume is not enough, even if those are the job requirements listed in the job posting. You need to prove you have those skills.

The easiest way to do this is with prior work experience, of course — if an employer can see you've already worked as a data analyst, and call up your old boss for confirmation, you're in good shape.

But if you're reading this article, you probably don't have prior experience in the field. In that case, what you need is a portfolio of data analysis projects that potential employers can peruse. Having an active Github account with relevant projects (and linking to this account from your resume) is probably the quickest and easiest way to set up a portfolio.

The projects you showcase should be your best, and they should demonstrate that you have the skills listed in this article. Use a format like a Jupyter Notebook or R Markdown document to showcase your code along with written explanations and charts that a non-technical hiring manager or recruiter can understand. (Remember, you need to be showcasing your communication skills in addition to the technical skills you used to do the analysis).

The more relevant you can make these projects to the companies where you're applying, the better your chances will be of getting a call back.

Data Analyst Salaries

According to Indeed.com as of April 6, 2021, the average data analyst in the United States earns a salary of $72,945, plus a yearly bonus of $2,500. 

Experienced data analysts at top companies can make significantly more, however. Senior data analysts at companies such as Facebook and Target reported salaries of around $130,000 as of April 2021.

Sample Data Analyst Jobs

We looked at some open "Data Analyst" jobs (as of April 2021) and pulled together some lists of the technical and non-technical skills they list. As you can see, what's required does vary a bit from company to company.

FAANG Company:

  • Excel skills
  • SQL skills
  • Data visualization
  • Communication (heavily emphasized)

Major Insurance Company:

  • SQL skills
  • Communication
  • Python experience (preferred but not required)

Major Political Organization:

  • Excel skills
  • SQL skills
  • Communication

Major Car Company:

  • SQL skills
  • Python skills
  • Web-based data visualization
  • Communication

Popular Social Media Platform:

  • Background in statistics
  • SQL skills (expert)
  • Python or R (familiar)
  • Communication

Major University System:

  • SQL skills
  • Scripting ability in some language (SQL, Python, PowerShell, etc.)
  • Communication

That's just a tiny sample of what's available. And of course, we've simplified these job postings, boiling them down to just the most essential listed skills.

Even though there are differences, it's clear the same skills are required for many of these jobs:

  • Data analysis and data visualization, although the specific tools required vary from job to job). Two other skills we listed above, data cleaning and building dashboards, also fall under these headings.
  • Communication — every job posting mentioned this.
  • Statistics — most of the postings mentioned preferring someone with at least some knowledge of statistics, although many didn't list it as a requirement.
  • Problem solving or analytical thinking skills were also mentioned frequently.

Becoming a Data Analyst, Step by Step

If you're serious about becoming a data analyst and you're starting from scratch, don't worry! You can do this. It helps to take a step-by-step approach.

  1. 1
    Learn the basics of programming in R or Python. At Dataquest, our data analyst learning paths in Pythonand R are designed to help you start from scratch, even if you've never written a line of code before.
  2. 2
    Start building projects. As early as possible, start putting together data projects. These early projects will help you solidify the skills you're learning and keep you motivated. You should keep building projects of increasing difficulty and complexity as you work through the later steps here (Dataquest's learning paths have built-in guided projects to help with this).
  3. 3
    Learn SQL (and other technical skills). Different data analyst jobs will have different specific requirements, but almost any analyst job will require some SQL skills. We've written a bit about why SQL skills are critical, so don't skip that, but there are other technical skills that can make your life easier, too. At Dataquest, our data analyst learning paths will take you through all of these skills in a logical sequence, so each skill builds on the previous one and you don't have to worry about what to learn next.
  4. 4
    Share your work and engage with the community. This will help you learn, collaborate, and start building a "personal brand" as a data analyst. Sharing your work can feel intimidating, but you never know what kinds of job offers can come from the right person happening to come across a cool project you've shared.
  5. 5
    Push your boundaries. Once you've mastered the basics, be sure you keep pushing with your projects so that you're learning new skills. Don't fall into the trap of doing similar projects over and over again because you're comfortable doing them. Try to include at least one thing you've never done before in each new project, or go back to old projects and try to improve them or add complexity.

You Can Become a Data Analyst!

In this article, we’ve covered what you need to learn to become a data analyst. If you want to learn the how, and build the technical skill set you need to successfully get a data analyst job, check out our interactive online data analysis courses.

 Without ever leaving your web browser, you’ll write real code to analyze real-world data as you learn interactively using our proven approach.

Learn, then code

At Dataquest, our vision is to become the world's first option for learning data skills. In order to achieve that, we craft our curriculum to teach the students the skills students they need to get jobs in data.

Specifically, the data analyst skills we’ve covered in this article are the basis for our two “data analyst” learning paths:

You can start both paths for free and start your journey to being a data analyst today!

The post Data Analyst Skills – 8 Skills You Need to Get a Job appeared first on Dataquest.

]]>
Learn R the Right Way in 5 Steps https://www.dataquest.io/blog/learn-r-for-data-science/ Wed, 14 Apr 2021 08:00:00 +0000 https://dq.t79ae38x-liquidwebsites.com/2018/10/31/learn-r-for-data-science/ R is in an increasingly popular language for data analysis and data science. Here's how you can learn R and be sure it sticks so you can get the career you want.

The post Learn R the Right Way in 5 Steps appeared first on Dataquest.

]]>
learn r the right way and jump past the cliff of boring

R is an increasingly popular programming language, particularly in the world of data analysis and data science. You may have even heard people say that it's easy to learn R! But  easy is relative. Learning R can be a frustrating challenge if you’re not sure how to approach it.

If you’ve struggled to learn R or another programming language in the past, you’re definitely not alone. And it’s not a failure on your part, or some inherent problem with the language.

Usually, it’s the result of a mismatch between what’s motivating you to learn and how you’re actually learning.

This mismatch causes big problems when you’re learning any programming language, because it takes you straight to a place we like to call the cliff of boring.

What is the cliff of boring? It’s the mountain of boring coding syntax and dry practice problems you’re generally asked to work through before you can get to the good stuff — the stuff you actually want to do.

learn r for data science - the cliff of boring

The cliff of boring is a metaphor, but it really can feel like you're looking at this sometimes.

Nobody signs up to learn a programming language because they love syntax. Yet many learning resources, from textbooks to online courses, are written with the idea that students need to master all of the key areas of R syntax before they can do any real work with it.

This is the process that causes new learners to drop off in droves:

  1. You get excited about learning a programming language because you want to do something with it.
  2. You try to start learning and are immediately led to this huge wall of complicated, boring stuff.
  3. You struggle through some of the boring stuff with no idea how it relates to the thing you actually want to do.

Is it any wonder that many people quit when this is the default learning experience?

Don't misunderstand me — there’s no way around learning syntax, in R or any other programming language.

But there is a way to avoid the cliff of boring.

It’s a shame that so many students drop off at the cliff, because R is absolutely worth learning! In fact, R has some big advantages over other language for anyone who’s interested in learning data science:

  • The R tidyverse ecosystem makes all sorts of everyday data science tasks very straightforward.
  • Data visualization in R can be both simple and very powerful.
  • R was built to perform statistical computing.
  • The online R community is one of the friendliest and most inclusive of all programming communities.
  • The RStudio integrated development environment (IDE) is a powerful tool for programming with R because all of your code, results, and visualizations are together in one place. With RStudio Cloud you can program in R using RStudio using your web browser.

And of course, learning R can be great for your career. Data science is a fast-growing field with high average salaries (check out how much your salary could increase).

And tons of companies and organizations use R for data science work! Here's a very short sample of some of the companies using R (from Hired.com as of April 2021):

  • SpaceX
  • Google
  • Starbucks
  • Fitbit
  • Kraft Heinz
  • Hulu
  • Amazon
  • iRobot
  • Ubisoft
  • Allstate
  • Twitch
  • AT&T
  • Salesforce
  • Pfizer
  • General Motors
  • Northrop Grumman
  • Ralph Lauren
  • Goldman Sachs

This list is just the tip of the iceberg — thousands and thousands of companies all across the globe hire people with R skills, and R is very in demand in academia and government, as well. Even from this short list, it's clear that someone with R skills could work in almost any industry they wanted.

Big tech, finance, video games, big pharma, insurance, fashion — every industry needs people who can work with data, and that means that every industry has use for R programming skills.

So how can you get them?

Step 1. Find Your Motivation for Learning R

Before you crack a textbook, sign up for a learning platform, or click play on your first tutorial video, spend some time to really think about why you want to learn R, and what you’d like to do with it.

  • What data are you interested in working with? 
  • What projects would you enjoy building? 
  • What questions do you want to answer?

Find something that motivates you in the process. This will help you define your end goal, and it will help you get to that end goal without boredom.

Try to go deeper than “becoming a data scientist.” There are all kinds of data scientists who work on a huge variety of problems and projects. Are you interested in analyzing language? Predicting the stock market? Digging deep into sports statistics? What’s the thing you want to do with your new skills that’s going to keep you motivated as you work to learn R?

Pick one or two things that interest you and that you’re willing to stick with. Gear your learning towards them and build projects with your interests in mind.

Figuring out what motivates you will help you figure out an end goal, and a path that gets you there without boredom. You don’t have to figure out an exact project, just a general area you’re interested in as you prepare to learn R.

Pick an area you’re interested in, such as:

  • Data Science / Data Analysis
  • Data visualization
  • Predictive modeling / machine learning
  • Statistics
  • Reproducible reports
  • Dashboard reports

Create three-dimensional data visualizations in R with rayshader

Step 2. Learn the Basic Syntax

Unfortunately, there’s no way to completely avoid this step. Syntax is a programming language is even more important than syntax in human language. If someone says “I’m the store going to,” their English-language syntax is wrong, but you can probably still understand what they mean. Unfortunately, computers are far less forgiving when they interpret your code.

However, learning syntax is boring, so your goal must be to spend as little time as possible doing syntax learning. Instead, learn as much of the syntax as you can while working on real-world problems that interest you so that there’s something to keep you motivated even though the syntax itself isn’t all that exciting.

Here are some resources for learning the basics of R:

  • Codecademy — does a good job of teaching basic syntax.
  • Dataquest: Introduction to R Programming — We built Dataquest to help data science students avoid the cliff of boring by integrating real-world data and real data science problems right off the bat. We think learning the syntax in the context of working on real problems makes it more interesting, and our interactive platform challenges you to really apply what you’re learning, checking your work as you go.
  • R for Data Science — One of the most useful resources for learning R and tidyverse tools. Available in print from O’Reilly or for free online.
  • R Style Guide — This shouldn’t be your primary learning resource but it can be a helpful reference.
  • RStudio Education - RStudio is the most popular integrated development environment (IDE) for programming with R. Their education page for beginners contains useful resources including tutorials, books, and webinars.
  • RStudio Cloud Primers - Start coding in R without installing any software with cloud-based tutorials from RStudio.

The quicker you can get to working on projects, the faster you will learn R. You can always refer to a variety of resources for learning and double-checking syntax if you get stuck later. But your goal should be to spend a couple of weeks on this phase, at most.

The RStudio Cheatsheets are great reference guides for R syntax:

Step 3. Work on Structured Projects

Once you’ve got enough syntax under your belt, you’re ready to move on to structured projects more independently. Projects are a great way to learn, because they let you apply what you’ve already learned while generally also challenging you to learn new things and solve problems as you go. Plus, building projects will help you put together a portfolio you can show to future employers later down the line.

You probably don’t want to dive into totally unique projects just yet. You’ll get stuck a lot, and the process could be frustrating. Instead look for structured projects until you can build up a bit more experience and raise your comfort level.

If you choose to learn R with Dataquest, this is built right into our curriculum — nearly every one of our data science courses ends with a guided project that challenges you to synthesize and apply what you’re learning. These projects provide some structure, so you’re not totally on your own, but they’re more open-ended than regular course content to allow you to experiment, synthesize your skills in new ways, and make mistakes.

If you’re not studying with Dataquest, there are plenty of other structured projects out there for you to work on. Let’s look at some good resources for projects in each area:

Data science / Data analysis

  • Dataquest — Teaches you R and data science interactively. You analyze a series of interesting datasets ranging from CIA documents to WNBA player stats.
  • R for Data Science - by Hadley Wickham and Garrett Grolemund is an excellent R resource with motivating and challenging exercises. 
  • TidyTuesday - A semi-structured, weekly social data project in R where budding r practitioners clean, wrangle, tidy, and plot a new dataset every Tuesday. New datasets are posted weekly. Results are shared on Twitter using the hashtag #tidytuesday.

Data visualization

  • ggplot2 - One of the most popular tools for data visualization in R is the ggplot2 package. The Data visualisation chapter from R for Data Science is a great place to learn the basics of data visualization with ggplot2. The chapter on Graphics for communication is a great resource for making graphics look more professional.
  • rayshader - build two-dimensional and three-dimensional maps in R with the rayshader package. You can also transform graphics developed with ggplot2 into 3D with rayshader.

Predictive modeling / machine learning

Statistics

Reproducible reports

Dashboard reports

  • Shiny Dashboard Tutorials - make dashboards in R with shiny dashboards using these tutorials from RStudio.
  • Shiny Gallery -  check out this gallery from RStudio for some Shiny Dashboard inspiration and examples.

Step 4. Build Projects on Your Own

Once you’ve finished some structured projects, you’re probably ready to move on to the next stage of learning R: doing your own unique data science projects. It’s hard to know how much you’ve really learned until you step out and try to do something by yourself. Working on unique projects that interest you will give you a great idea not only of how far you’ve come but also of what you might want to learn next.

And although you’ll be building your own project, you won’t be working alone. You’ll still be referring to resources for help and learning new techniques and approaches as you work. With R in particular, you may find that there’s a package dedicated to helping with the exact sort of project you’re working on, so taking on a new project sometimes also means you’re learning a new R package.

What do you do if you get stuck? Do what the pros do, and ask for help! Here are some great resources for finding help with your R projects:

  • StackOverflow — Whatever your question is, it has probably been asked here before, and if it hasn’t, you can ask it yourself. You can find questions tagged with R here.
  • Google — Believe it or not, this is probably the most commonly-used tool of every experienced programmer. When you encounter an error that you don’t understand, a quick Google search of the error message will often point you towards the answer.
  • Twitter — It may be surprising to learn, but Twitter is an excellent resource getting help on R-related issues. Twitter is also a great resource for R-related news and updates from the world's leading R practitioners. The R community on Twitter is centralized around the #rstats hashtag.
  • Dataquest’s Learning Community — With a free student account you can join our learning community and ask technical questions that your fellow students or Dataquest’s data scientists can answer.

What sorts of projects should you build? As with the structured projects, these projects should be guided by the answers you came up with in step 1. Work on projects and problems that interest you. If you’re interested in climate change, for example, find some climate data to work with and start digging around for insights.

It’s best to start small rather than trying to take on a gigantic project that will never get finished. If what interests you most is a huge project, try to break it down into smaller pieces and tackle them one at a time.

Here are some ideas for projects that you can consider:

  • Expand on one of the structured projects you built before to add new features or deeper analysis.
  • Go to meetups or hook up with other R coders online and join a project that’s already underway.
  • Find an open-source package to contribute to (R has tons of great open source packages!)
  • Find an interesting project someone else made with R on Github and try to extend or expand on it. Or, find a project someone else made in another language and try to recreate it using R.
  • Read the news and look for interesting stories that might have available data you could dig into for a project.
  • Check out our list of free data sets for data science projects and see what available data inspires you to start building!

Here are some more project ideas in the topic areas that we've discussed:

Data science / Data analysis

  • A script to automate data entry.
  • A tool to scrape data from the web.

Data Visualization

  • A map that visualizes election polling by state, or region.
  • A collection of plots that depict the real-estate sale or rental trends in your area.

Predictive modeling / machine learning

  • An algorithm that predicts the weather where you live.
  • A tool that predicts the stock market.
  • An algorithm that automatically summarizes news articles.

Statistics

  • A model that predicts the cost of a Uber trips in your area.

Reproducible reports

  • A report of Covid-19 trends in your area in an R Markdown report that can be updated when new data becomes available.
  • A summary report of performance data for your favorite sports team.

Dashboard reports

  • A map of the live locations of buses in your area.
  • A stock market summary.
  • A Covid-19 tracker, like this one.
  • A summary of your personal spending habits.

Think of the projects like a series of steps — each one should set the bar a little higher, and be a little more challenging than the one before.

Step 5. Ramp Up the Difficulty

Working on projects is great, but if you want to learn R then you need to ensure that you keep learning. You can do a lot with just data visualization, for example, but that doesn’t mean you should build 20 projects in a row that only use your data visualization skills. Each project should be a little tougher and a little more complex than the previous one. Each project should challenge you to learn something you didn’t know before.

If you’re not sure exactly how to do that, here are some questions you can ask yourself to apply more complexity and difficulty to any project you’re considering:

  • Can you teach a novice how to make this project by (for example) writing a tutorial? Trying to teach something to someone else will quickly show you how well you really understand it, and it can be surprisingly challenging!
  • Can you scale up your project so that it can handle more data? A lot more data?
  • Can you improve its performance? Could it run faster?
  • Can you improve the visualization? Can you make it clearer? Can you make it interactive?
  • Can you make it predictive?

Never Stop Learning R

Learning a programming language is kind of like learning a second spoken language — you will reach a point of comfort and fluency, but you’ll never really be done learning. Even experienced data scientists who’ve been working with R for years are still learning new things, because the language itself is evolving, and new packages make new things possible all the time.

It’s important to stay curious and keep learning, but don’t forget to look back and appreciate how far you’ve come from time to time, too.

Learning R is definitely a challenge even if you take this approach. But if you can find the right motivation and keep yourself engaged with cool projects, I think anybody can reach a high level of proficiency.

We hope this guide is useful to you on your journey. If you have any other resources to suggest, please let us know!

And if you’re looking for a learning platform that integrates these lessons directly into the curriculum, you’re in luck, because we built one. Our Data Analyst in R path is an interactive course sequence that’s designed to take anyone from total beginner to job-qualified in R and SQL.

And all of our lessons are designed to keep you engaged by challenging you to solve data science problems using real-world data.

Common R Questions:


Is it hard to learn R?

Learning R can certainly be challenging, and you're likely to have frustrating moments. Staying motivated to keep learning is one of the biggest challenges.

However, if you take the step-by-step approach we've outlined here, you should find that it's easy to power through frustrating moments, because you'll be working on projects that genuinely interest you.

Can you learn R for free?

There are lots of free R learning resources out there — here at Dataquest, we have a bunch of free R tutorials and our interactive data science learning platform, which teaches R, is free to sign up for and includes many free missions.

The internet is full of free R learning resources! The downside to learning for free is that to learn what you want, you'll probably need to patch together a bunch of different free resources. You'll spend extra time researching what you need to learn next, and then finding free resources that teach it. Platforms that cost money may offer better teaching methods (like the interactive, in-browser coding Dataquest offers), and they also save you the time of having to find and build your own curriculum.

Can you learn R from scratch (with no coding experience)?

Yes. At Dataquest, we've had many learners start with no coding experience and go on to get jobs as data analysts, data scientists, and data engineers. R is a great language for programming beginners to learn, and you don't need any prior experience with code to pick it up. 

Nowadays, R is easier to learn than ever thanks to the tidyverse collection of packages. The tidyverse is a collection of powerful tools for accessing, cleaning, manipulating, analyzing, and visualizing data with R. This Dataquest tutorial provides a great introduction to the tidyverse.

How long does it take to learn R?

Learning a programming language is a bit like learning a spoken language — you're never really done, because programming languages evolve and there's always more to learn! However, you can get to a point of being able to write simple-but-functional R code pretty quickly.

How long it takes to get to job-ready depends on your goals, the job you're looking for, and how much time you can dedicate to study. But for some context, Dataquest learners we surveyed in 2020 reported reaching their learning goals in less than a year — many in less than six months — with less than ten hours of study per week.

Do you need an R certification to find work?

We've written about certificates in depth, but the short answer is: probably not. Different companies and industries have different standards, but in data science, certificates don't carry much weight. Employers care about the skills you have — being able to show them a GitHub full of great R code is much more important than being able to show them a certificate.

Is R a good language to learn in 2021?

Yes. R is a popular and flexible language that's used professionally in a wide variety of contexts. We teach R for data analysis and machine learning, for example, but if you wanted to apply your R skills in another area, R is used in finance, academia, and business, just to name a few.

Moreover, R data skills can be really useful even if you have no aspiration to become a full-time data scientist or programmer. Having some data analysis skills with R can be useful for a wide variety of jobs — if you work with spreadsheets, chances are there are things you could be doing faster and better with a little R knowledge. 

How much money do R programmers make?

This is difficult to answer, because most people with R skills work in research or data science, and they have other technical skills like SQL, too. Ziprecruiter lists the average R developer salary as $130,000 in the US (as of April 2021).

The average salary for a data scientist is pretty similar — $121,000 according to Indeed.com as of April 2021.

Should I learn base R or tidyverse first?

This is a popular debate topic in the R community. Here at Dataquest, we teach a mix of base R and tidyverse methods in our Introduction to Data Analysis in R course. We are big fans of the tidyverse because it is powerful, intuitive, and fun to use.

But to have a complete understanding of tidyverse tools, you'll need to understand some base R syntax and have an understanding of data types in R. For these reasons, we find it most effective to teach a mix of base R and tidyverse methods in our introductory R courses.

learn-r-testamonial-headshot

I needed a resource for beginners; something to walk me through the basics with clear, detailed instructions. That is exactly what I got in Dataquest’s Introduction to R course.

Because of Dataquest, I started graduate school with a strong foundation in R, which I use every day while working with data.

Ryan Quinn - Doctoral Student at Boston University

The post Learn R the Right Way in 5 Steps appeared first on Dataquest.

]]>
11 Real World Applications for Python Skills https://www.dataquest.io/blog/real-world-python-use-cases/ Mon, 12 Apr 2021 19:18:54 +0000 https://www.dataquest.io/?p=28733 Python is one of the most frequently-recommended programming languages. You’ve probably heard people say that’s because it’s relatively easy to learn — and that’s true! But is Python actually useful? What are some of the real-world applications for Python skills once you’ve got them?In this post, we’ll look at some of the most common use-cases […]

The post 11 Real World Applications for Python Skills appeared first on Dataquest.

]]>
real world applications for python are varied. This is an engineer writing python code

Python is one of the most frequently-recommended programming languages. You’ve probably heard people say that’s because it’s relatively easy to learn — and that’s true! But is Python actually useful? What are some of the real-world applications for Python skills once you’ve got them?

In this post, we’ll look at some of the most common use-cases for Python. We’ll also look at a few situations where Python probably isn’t the best choice.

That said, it’s important to keep in mind that Python is an incredibly versatile language. People use it for all kinds of things. The broad real-world use cases we’ll cover here are really just the tip of the iceberg!

Who uses Python today?

The short answer: millions of developers, along with a lot of other folks. A 2019 estimate put the number of Python developers at 8.2 million. StackOverflow’s 2020 developer survey ranks Python as one of the most popular and widely-used languages among developers. And as of April 2021, Indeed.com is listing nearly 100,000 open jobs that require Python.

Of course, there are also quite a lot of people who use Python that wouldn’t be captured by these sorts of statistics and surveys. Python isn’t just used by developers! It’s used by marketers, researchers, data scientists, kids, hobbyists, IT professionals, and all sorts of other people. You don’t have to make your entire living writing Python to get some real benefits from learning it!

At a professional level, though, Python is very useful. For example:

What companies use Python?

Here’s just a short list of a few of the companies that use Python:

  1. Google and subsidiaries like Youtube use Python for a wide variety of things. In fact, Youtube was built using mostly Python!
  2. Industrial Light and Magic, the company behind the special effects of Star Wars and hundreds of other films, has been using Python for years for its CGI and lighting work.
  3. Facebook and subsidiaries like Instagram use Python for various elements of their infrastructure. Instagram is built entirely using Python and its Django framework.
  4. iRobot, the folks who make the Roomba vacuum, use Python to develop the software for their robots.
  5. NASA and associated institutions like the Jet Propulsion Lab use Python for research and scientific purposes.
  6. Netflix uses Python for server-side data analysis and for a wide variety of back-end apps that help keep the massive streaming service online.
  7. Reddit runs on Python and its web.py framework.
  8. IBM, Intel, and a variety of other hardware companies use Python for hardware testing.
  9. Chase, Goldman Sachs, and many other financial firms use Python for financial analysis and market forecasting.
  10. Quora is yet another huge social media platform that’s built using lots of Python.

And that’s just the tip of the iceberg! In fact, these days most large companies are probably using Python at some level. A good way to check is to search a job site like LinkedIn or Indeed for company name + python. Often, you’ll find that companies are looking for people with Python skills.

So, everybody’s using Python. What are they using Python to do? Let’s dive into some real-world applications for Python.

How is Python used in the real world?

1. Data Analytics

As companies across every industry collect more and more data, they need people who can make sense of it. Often, that means hiring data analysts with Python skills.

Python is popular for data analysis work because of powerful libraries like numpy and pandas, which make data cleaning and analysis tasks relatively straightforward, even when working with massive datasets. There are also Python libraries that support a wide variety of other data analytics tasks, from scraping the web with Beautiful Soup to visualizing data with Matplotlib.

Software tools like Jupyter Notebook make it easy for data analysts to create easy-to-repeat analyses, or add text and visualizations that make their work understandable even to people without coding skills.

Example use case: An ecommerce website wants to understand its users better. A data analyst at the company could use Python to analyze the company’s sales, highlight predictable trends, and uncover areas for improvement.

2. Data Science/”AI”

Python is also incredibly popular for more advanced data work in the realm of machine learning. Powerful libraries like scikit-learn and TensorFlow make implementing popular machine learning algorithms very straightforward, and more specialized libraries exist to help with a wide variety of specific machine learning tasks from image recognition to content generation.

Almost anything you see being discussed as “AI” in the news is some sort of machine learning implementation. And an awful lot of that machine learning is being done with Python.

Example use case: A video streaming platform wants to increase user engagement and stickiness. A data science team could use Python to build a predictive model that recommends videos to users based on factors such as their watch history, viewership habits, what videos other users with similar habits watched, etc.

3. Web Development

As evidenced in the list of companies above, Python is a very popular language for web app development. Many of the websites you use every day were built using Python and popular Python web frameworks such as Django and Flask. Although the pages themselves are are rendered with HTML and CSS, Python underlies these visual elements on many sites, driving functionality, managing databases, user accounts, and much more.

Example use case: A company needs to build a new version of its website with specific features. A web developer could build the new site with Python and Django, using the flexibility and power they offer to support any specific or custom features the company needs.

4. Game Development

Python is used in the development of indie video games, thanks to the existence of convenient libraries such as PyGame. (Noticing a pattern? Whatever your use case is, there’s probably already a few Python libraries out there designed to help with it).

Python isn’t used as frequently in the development of higher-budget games – if your goal is to build a photorealistic 3D world, Python’s relatively slow speed and relatively high memory usage mean it’s not the most ideal language for doing that. Python is sometimes used to build the systems that underlie these games, though. Games including Battlefield 2, Eve Online, The Sims 3, Civilization IV, and World of Tanks use Python, although none of them was coded entirely in Python.

Example use case: A small team wants to build a creative indie side-scrolling game. The devs could choose to work with Python to take advantage of the convenience of PyGame, and the relative ease of learning how to do new things in Python.

5. Software Development

Python is widely used in software development, across a wide variety of real-world applications. The line between software development and web development is a bit blurry these days, since almost all software is built to work on the web even when there’s also a desktop app. Dropbox is a good example of a modern software development company that does both, and Python was used to build Dropbox’s desktop app. Similarly, Spotify has both web and desktop apps, and Python was used to build a number of the background services that make them work.

Of course, Python is also used at many companies to develop internal software tools.

Example use case: A company plans to build a new email client. The developers choose to use Python because they know they’ll be able to create web and desktop clients using Python and its relevant libraries.

6. Data Engineering

Many of the Python libraries that make it a great option for data analysts and data scientists also make Python an important language for data engineers. Data engineers use Python for tasks such as building pipelines, combining datasets, cleaning data, working with APIs, automating various data processes, etc.

Example use case: A company has a lot of data, but it’s stored in various formats and databases, making it time-consuming for analysts to find and work with. A data engineer could use their Python skills to build a pipeline that automates collection from the various sources, joins and cleans the data, and makes it easier for analysts to access and filter.

7. Robotics

Python is a popular language in the field of robotics, both among hobbyists and professionals. On the hobbyist end of the spectrum, Python is frequently used together with the Raspberry Pi hardware platform, which allows for flexible and affordable experimentation. In business, Python is one of the languages commonly used for robotic process automation (RPA), and it’s been used to do things like code industrial robot arms that can work in tandem with each other.

Example use case: A company orders a number of robotic arms for a manufacturing facility. Engineers could use Python to program their behavior, taking advantage of the language’s high-level readability to make it easier for everyone to understand what the arms are meant to be doing.

8. Automation

Python is great for automating repetitive tasks, and there are almost endless real world use cases for Python automation. For example, Python is a popular tool in DevOps because it makes automating systems and processes efficient and transparent. But outside the realm of software development, it’s also widely used to automate everything from complex systems to simple, personal processes like filling in a spreadsheet or responding to emails.

Example use case: A company reports its sales in monthly Excel spreadsheets from each region that have to be manually combined to build company-wide quarterly reports. Rather than do this time-intensive task by hand, an employee writes a Python script that combines all of the spreadsheets and produces each quarterly report automatically.

9. Hardware interfacing and control

Python’s ability to control hardware goes beyond robotics — in fact, it is used in all sorts of real world hardware-control applications. For example, this convenient Python library makes using Python for a variety of industrial control applications possible.

Example use case: An engineer at a company needs to write software to control a complex HVAC system. They could write code in Python that can send commands to and receive data from the system’s sensors and hardware controllers.

10. Education and training

Because it’s a very high-level, “readable” language that also has a variety of practical uses, Python is a very popular first language for people who want to learn programming. The wide variety of Python tutorials, videos, interactive courses, and other educational materials available for Python make it arguably the easiest programming language to learn.

Example use case: A company wants its data analysis team to be able to move beyond the limitations of Excel and SQL. They choose to get the team training in Python, knowing that they will have a wide variety of learning resources to choose from.

11. Personal convenience

In this article, we’ve been mostly focused on business use cases for Python. But many of Python’s commercial use cases are also applicable on a personal level, too. Python can be used to analyze your own data, to automate boring or repetitive elements of your job, or even to create art!

Example use case: Here’s a very personal example — when I wanted to stop myself from sitting for long hours at a stretch, I used my beginner Python skills to write a little script that would pop up alerts at whatever interval I wanted, play a sound of my choosing, and prompt me to do a little exercise based on parameters that I could tweak.

What is Python not good for?

Python is a great and versatile language, but it’s not the best solution for everything. Here are a couple examples of areas where Python might not be the best choice or the most common choice for real-world commercial applications.

Mobile app development

While you certainly can develop mobile apps for Python, you’ll need to make use of third-party layers to make them work across Android and iOS phones. These extra layers can make Python apps less efficient, which means Python isn’t always the best choice for mobile app development (although depending on your specific app’s requirements, it might be fine).

If you are interested in developing mobile apps with Python, a variety of options exist. One of the most popular is the Kivy framework.

Of course, you can always use the web development power of Python together with a framework like Django to make web apps that work well in mobile browsers, too.

Things that require high speed or high memory usage

In part because Python is a high-level language, it’s not always the fastest or most efficient option. For many use cases, this distinction won’t matter — you’ll never notice the extra tenth of a millisecond you might gain from using C++. But if, for example, you’re working on a high-speed 3D-rendered video game, Python’s speed and memory constraints will probably be too limiting.

Similarly, if you’re doing something like writing an operating system, Python isn’t a great choice because it’s inefficiencies will be layered on top of each other as users run programs within the main program that is the OS.

When high speed and memory performance is critical, Python probably isn’t the best option. However, in many cases — including all of the use cases described above — the minor sacrifices we make in speed and efficiency by using Python are far outweighed by the conveniences it offers.

Where can you learn Python skills?

  • Youtube. There are thousands of free Python tutorials on Youtube, covering almost every conceivable use case.
  • Dataquest. Interactive courses are a great option that makes it easier to get started, since you don’t have to figure out how to install and run Python locally.
  • Udemy. If you learn well from video lectures, there are hundreds of Python courses to learn from here.
  • Coursera. University-branded video lecture courses that cover a number of different Python topics are available.
  • EdX. University-branded Python video lecture courses are available here too.
  • Books. Many Python books, including popular ones like this, are available for free if you don’t mind reading them on a device.
  • Classes and bootcamps. There are many in-person learning options for Python, too, although these tend to be the most expensive way to learn.

The post 11 Real World Applications for Python Skills appeared first on Dataquest.

]]>
Data Engineer, Data Analyst, Data Scientist — What’s the Difference? https://www.dataquest.io/blog/data-analyst-data-scientist-data-engineer/ Mon, 12 Apr 2021 07:00:00 +0000 https://dq.t79ae38x-liquidwebsites.com/2017/06/15/data-analyst-data-scientist-data-engineer/ In the fast-growing field of data, the "big three" job roles are data engineer, data analyst, and data scientist. Figure out which is the best fit for you.

The post Data Engineer, Data Analyst, Data Scientist — What’s the Difference? appeared first on Dataquest.

]]>
what job role is best, data analyst, data scientist, or data engineer?

Data engineer, data analyst, and data scientist — these are job titles you'll often hear mentioned together when people are talking about the fast-growing field of data science.

There are plenty of other job titles in data science and data analytics too. But here, we're going to talk about:

  1. 1
    The "big three" roles (data analyst, data scientist, and data engineer)
  2. 2
    How they differ from each other
  3. 3
    Which role is best for you

Although precisely how these roles are defined can vary from company to company, there are big differences between what you might be doing each day as a data analyst, data scientist, or data engineer.

We're going to dig into each of these specific roles in more depth.

What is a Data Analyst?

Data analysts deliver value to their companies by taking data, using it to answer questions, and communicating the results to help make business decisions.

Common tasks done by data analysts include data cleaning, performing analysis and creating data visualizations.

Depending on the industry, the data analyst could go by a different title (e.g. Business Analyst, Business Intelligence Analyst, Operations Analyst, Database Analyst). Regardless of title, the data analyst is a generalist who can fit into many roles and teams to help others make better data-driven decisions.

What do data analysts do?

The data analyst has the potential to turn a traditional business into a data-driven one.  Their core responsibility is to help others track progress and optimize their focus.

How can a marketer use analytics data to help launch their next campaign? How can a sales representative better identify which demographics to target? How can a CEO better understand the underlying reasons behind recent company growth? These are all questions that the data analyst provides the answer to by performing analysis and presenting the results.  

While often data analyst positions are "entry level" jobs in the wider field of data, not all analysts are junior level. As effective communicators with mastery over technical tools, data analysts are critical for companies that have segregated technical and business teams. 

An effective data analyst will take the guesswork out of business decisions and help the entire organization thrive. The data analyst must be an effective bridge between different teams by analyzing new data, combining different reports, and translating the outcomes. In turn, this is what allows the organization to maintain an accurate pulse check on its growth. 

The nature of the skills required will depend on the company's specific needs, but these are some common tasks: 

  • Cleaning and organizing raw data. 
  • Using descriptive statistics to get a big-picture view of their data. 
  • Analyzing interesting trends found in the data. 
  • Creating visualizations and dashboards to help the company interpret and make decisions with the data. 
  • Presenting the results of a technical analysis to business clients or internal teams. 

The data analyst brings significant value to both the technical and non-technical sides of an organization. Whether running exploratory analyses or explaining executive dashboards, the analyst fosters a greater connection between teams. 

How much money do data analysts make?

As the most entry-level of the "big three" data roles, data analysts typically earn less than data scientists or data analysts. According to Indeed.com as of April 6, 2021, the average data analyst in the United States earns a salary of $72,945, plus a yearly bonus of $2,500. 

Experienced data analysts at top companies can make significantly more, however. Senior data analysts at companies such as Facebook and Target reported salaries of around $130,000 as of April 2021.

Data roles, including data analyst roles, also sometimes come with stock options and other non-salary-based compensation.

Sound interesting to you? Start learning on our Data Analyst career paths:

What is a Data Scientist?

A data scientist is a specialist who applies their expertise in statistics and building machine learning models to make predictions and answer key business questions.

A data scientist still needs to be able to clean, analyze, and visualize data, just like a data analyst. However, a data scientist will have more depth and expertise in these skills, and will also be able to train and optimize machine learning models.

What do data scientists do?

The data scientist is an individual who can provide immense value by tackling more open-ended questions and leveraging their knowledge of advanced statistics and algorithms. If the analyst focuses on understanding data from the past and present perspectives, then the scientist focuses on producing reliable predictions for the future.

The data scientist will uncover hidden insights by leveraging both supervised (e.g. classification, regression) and unsupervised learning (e.g. clustering, neural networks, anomaly detection) methods toward their machine learning models. They are essentially training mathematical models that will allow them to better identify patterns and derive accurate predictions.

The following are examples of work performed by data scientists:

  • Evaluating statistical models to determine the validity of analyses.
  • Using machine learning to build better predictive algorithms.
  • Testing and continuously improving the accuracy of machine learning models.
  • Building data visualizations to summarize the conclusion of an advanced analysis.

Data scientists bring an entirely new approach and perspective to understanding data. While an analyst may be able to describe trends and translate those results into business terms, the scientist will raise new questions and be able to build models to make predictions based on new data.

How much money do data scientists make?

Data science salaries can vary quite a lot, since the role itself varies from company to company. According to Indeed.com as of April 6, 2021, the average data scientist in the United States earns a salary of $121,050

Experienced data scientists at top companies can make significantly more. Senior data analysts at companies such as Twitter reported salaries of around $178,000 as of April 2021.

Data scientists who focus on building machine learning skills can also look at machine learning engineer roles, which command an average yearly salary of $149,924 in the United States as of April 2021.

Sound good to you? Start learning on our Data Scientist career path:

What is a Data Engineer?

Data engineers build and optimize the systems that allow data scientists and analysts to perform their work.

Every company depends on its data to be accurate and accessible to individuals who need to work with it. The data engineer ensures that any data is properly received, transformed, stored, and made accessible to other users.

What do data engineers do?

The data engineer establishes the foundation that the data analysts and scientists build upon. Data engineers are responsible for constructing data pipelines and often have to use complex tools and techniques to handle data at scale. Unlike the previous two career paths, data engineering leans a lot more toward a software development skill set.

At larger organizations, data engineers can have different focuses such as leveraging data tools, maintaining databases, and creating and managing data pipelines. Whatever the focus may be, a good data engineer allows a data scientist or analyst to focus on solving analytical problems, rather than having to move data from source to source.

The data engineer’s mindset is often more focused on building and optimization. The following are examples of tasks that a data engineer might be working on:

  • Building APIs for data consumption.
  • Integrating external or new datasets into existing data pipelines.
  • Applying feature transformations for machine learning models on new data.
  • Continuously monitoring and testing the system to ensure optimized performance.

How much money do data engineers make?

Data engineers are incredibly in demand at the moment, and as a result they command the highest average salary of the three roles. According to Indeed.com as of April 7, 2021, the average data engineer in the United States earns a salary of $130,287, with an additional yearly bonus of $5,000. 

Experienced data engineers at top companies can make much more. For example, senior data engineers at Netflix report salaries of more than $300,000 per year as of April 2021.

Start learning on the Data Engineer career path:

Quiz: Which Role is Best For You?

Below, we've created a quick, four-question quiz that will help give you an idea of which role might be the best fit:

Hopefully this quiz has given you an idea of where you might want to start your journey in the data science industry.

If you didn't get the answer you were hoping for, don't worry — it's just a quick quiz, and there's a lot of overlap between the skills and tasks required for all three job roles!

The real answer to the question of data analyst vs. data scientist vs. data engineer is something that only you can answer. After all, it's your career!

Your Data-Driven Career Path

Now that we’ve explored these three data-driven careers, the question remains — where do you fit in? You've already taken our quiz, but let's take a more in-depth look at how you can really decide what's best for you.

The key is to understand that these are three fundamentally different ways to work with data.

The data engineer is working on the backend, continuously improving data pipelines to ensure that the data the organization relies upon is accurate and available. They will leverage all sorts of different tools to ensure the data is processed correctly and that the right data is available to anyone who needs it.

A good data engineer saves a lot of time and effort for the rest of the organization.

The data analyst may then extract a new data set using a custom API that the engineer built and begin identifying interesting trends in that data and running analyses on anomalies. The analyst will summarize and present their results in a clear way that allows non-technical teammates to understand what the analysis means.

Finally, the data scientist will likely build upon the analyst’s initial findings and research to derive deeper insights. Whether by training machine learning models or by running advanced statistical analyses, the data scientist is going to provide a brand new perspective into not just what has happened in the past, but what may be possible for the near future.

Regardless of your specific path, curiosity is a natural prerequisite of all three of these careers. The ability to use data to ask better questions and run more precise experiments is the entire purpose of a data-driven career. Furthermore, the data science field is constantly evolving and thus, there is a great need to continuously learn more.

At Dataquest, we have educational paths available to those who are interested in pursuing data engineer, data analyst, or data scientist roles in this fast-growing sector. Sign up and start learning more about these positions for free! 

And to all the current and future data analysts, scientists, and engineers out there — good luck and keep learning! 

Have an idea which job you're most interested in?

Click the button below to check out the full learning path for each role, and start learning today!

The post Data Engineer, Data Analyst, Data Scientist — What’s the Difference? appeared first on Dataquest.

]]>
Python Practice: Free Ways To Improve Your Python Skills https://www.dataquest.io/blog/python-practice/ Tue, 06 Apr 2021 16:37:19 +0000 https://www.dataquest.io/?p=28519 Getting good Python practice can help solidify your coding skills. Here are some of the best resources for practicing Python:

The post Python Practice: Free Ways To Improve Your Python Skills appeared first on Dataquest.

]]>
getting python practice in isn't always easy! image of astronaut shooting arrows at a target with python on it

Whether you’re just getting started on your learning journey or you’re looking to brush up before a job interview, getting the right kind of Python practice can make a big difference.

Studies on learning have repeatedly shown that people learn best by doing. But where and how can you get your Python practice in?

Free interactive Python practice:

Click on any of these links to sign up for a free account and dive into an interactive practice session where you’ll be writing real code!

What’s listed above is actually just the tip of the iceberg; we have many additional free practice problems and free interactive lessons as well.

Python project ideas

  • 45 Python Project Ideas for Beginners — These are great ideas for beginner projects, but many could also be easily converted into more advanced projects if you’re a more experience Python developer looking for some practice
  • 63 Free Python Tutorials — These free Python tutorials for data science run the gamut from beginner to very advanced, and all of them make fun, easy-to-expand projects for some guided Python practice.

The web is full of thousands of other Python tutorials too. As long as you've got a solid foundation in the Python basics, you can find great practice through many of them, although their quality and accuracy levels can vary depending on the author.

Frequently Asked Questions

Where can I practice Python programming?

  1. Dataquest.io has dozens of free interactive practice questions, as well as free interactive lessons, project ideas, tutorials, and more.
  2. HackerRank is a great site for practice that’s also interactive.
  3. CodingGame is a fun platform for practice that supports Python.
  4. Edabit has Python challenges that can be good for practicing or self-testing.

You can also practice Python using all of the interactive lessons listed above

How can I practice Python at home?

  1. Install Python on your machine. You can download it directly here, or download a program like Anaconda Individual Edition that makes the process easier. Or you can find an interactive online platform like Dataquest and write code in your browser without having to install anything.
  2. Find a good Python project or some practice problems to work on.
  3. Make detailed plans. Scheduling your practice sessions will make you more likely to follow through.
  4. Join an online community. It’s always great to get help from a real person. Reddit has great Python communities, and Dataquest’s community is great if you’re learning Python data skills.

Can I learn Python in 30 days?

In 30 days, you can definitely learn enough Python to be able to build some cool things. You won’t be able to master Python that quickly, but you could learn to complete a specific project or do things like automate some aspects of your job.

Read more about how long it takes to learn Python.

Can I practice Python on mobile?

Yes, there are lots of apps that allow you to practice Python on both iOS and Android. However, this should not be your primary form of practice if you aspire to use Python in your career — it’s good to practice installing and working with Python on desktops and laptops since that’s how most professional programming work is done.

How quickly can you learn Python?

You can learn the fundamentals of Python in a weekend. If you’re diligent, you can learn enough to complete small projects and genuinely impact your work within a month or so. Mastering Python takes much longer, but you don’t need to become a master to get things done!

Read more about how long it takes to learn Python.

Learn Python the right way!

Our free guide to learning Python has helped thousands and thousands of learners, and it works with any learning platform you choose!

The post Python Practice: Free Ways To Improve Your Python Skills appeared first on Dataquest.

]]>
You Need Data Skills to Future-Proof Your Career https://www.dataquest.io/blog/data-skills-to-future-proof-your-career/ Mon, 05 Apr 2021 17:34:14 +0000 https://www.dataquest.io/?p=28360 No matter what industry you're in, you need data skills to future-proof your career.  You might be thinking: Vik is the CEO of a company that teaches data science - of course he'd say that! But stick with me for a few more paragraphs, I'll walk you through how data was key to all of the […]

The post You Need Data Skills to Future-Proof Your Career appeared first on Dataquest.

]]>
data skills can help with a wide variety of careers

No matter what industry you're in, you need data skills to future-proof your career

You might be thinking: Vik is the CEO of a company that teaches data science - of course he'd say that! But stick with me for a few more paragraphs, I'll walk you through how data was key to all of the jobs I've had.

I worked in quite a few different roles before I started Dataquest. I was a loader at UPS, a logistics supervisor at Pepsi, a US diplomat, a data science consultant, and a machine learning engineer at edX.

(If that sounds like a strange career path, you can read more of my story here: I Barely Graduated College, And That's Okay.)

Looking back, every single one of my jobs had a data component. To excel, whether I was loading boxes into trucks or creating essay grading algorithms, I needed to be able to use data.

And as time has gone on, these roles have become more and more data-centric. Someone walking into those jobs today would have even more opportunities to incorporate data into their roles than I did.

Using Data at UPS

It might surprise you to learn that a loader at UPS needs some data skills. To explain why, I'll need to discuss a bit of how UPS works.

You're probably familiar with the iconic brown UPS delivery truck. These trucks pick up packages on a route, then go back to a hub to unload.

There, the packages are sorted and loaded into tractor trailers to go to their destination hub.

But a package going from Chicago to San Francisco doesn't go straight there. It goes through intermediary hubs before it gets to the destination.

My part in this was loading packages into tractor trailers at a hub. To get the right packages onto the right truck, I would scan packages as I loaded. This enabled UPS to track how productive I was. It also enabled UPS to track packages across their hubs.

I used the scan data to track how many packages I and other people on my team loaded each night. I optimized which trucks people on my team worked in based on the data. I also used the data to make decisions about when to ask for more hires.

Data also enables UPS to optimize their routes, staff their hubs, and anticipate future demand (especially seasonal demand).

For example: you may have heard that UPS famously doesn't have its delivery trucks turn left. This is due to an algorithm to optimize routes. UPS also hires for peak season every year by using data to anticipate demand.

As UPS moves to automate its hubs, data skills are more and more important for employees to have. It's now possible to track exactly how many packages are flowing through a hub at any given time, and where they're going.

Even a small improvement in the efficiency of a hub can make UPS millions of dollars. Data skills are actually getting more important to UPS over time, and are a core part of their strategy.

Using Data at Pepsi

At Pepsi, I worked in a factory that made soda. This job required using even more data than I had used at UPS.

After the soda was made, we'd either store it in our warehouse or ship it to other warehouses. From the warehouse, we'd load the soda onto route trucks or tractor trailers to be delivered to customers (supermarkets, convenience stores, etc).

We didn't want to make too much or too little of any type of soda, so we had to forecast customer demand. We also had to know how much soda was in our warehouse, so we didn't end up with too much inventory.

My job was to figure out how much soda was in our warehouse, and then feed that into our production schedule.

Amazingly, the way we figured out how much soda was in our warehouse was to count it several times a day. We knew how much soda we were making, and how much was flowing out, but when I joined, there was no way to combine that data to make better estimates. I improved our estimates by combining these signals.

This type of data work is becoming increasingly important to Pepsi. Having too much or too little of any type of soda can cost millions, and data can improve profits. It's no surprise that Pepsi is investing heavily in data training across the company.

Using Data as a Diplomat

My job as a US diplomat at the State Department was where I interacted with data the least. But data was still a component of my job.

I interviewed applicants for immigrant and non-immigrant visas. Given how many people had been pre-qualified for interviews, I knew how many people I and other diplomats would have to interview each day.

We also tracked how many people each person interviewed daily, and our visa approval rates. Approval rates are important because you don't want to reject qualified applicants, or approve unqualified ones. This helped us optimize how we worked, although we admittedly didn't use data as much as we could have.

In my experience, government is the sector where data usage is the least sophisticated. But that's changing. The State Department is appointing a Chief Data Officer. Thousands of diplomats are now being trained in how to use data effectively.

The reason for this is that data can help diplomats be more effective — it can help them approve the right people for visas. I can even imagine a future where visa approval is an automated process, without a human in the loop.

Data can also help to produce more nuanced reporting on the economics and politics of individual countries. One example of this is the humanitarian data the State Department already publishes.

Data skills are just starting to be important at the State Department, but they are poised to become a major part of diplomacy in the near future.

You need data skills

My own career experiences illustrate that data is important whether you're in shipping, manufacturing, or government. And that's just the tip of the iceberg.

Data is transforming roles in almost every industry, including healthcare, finance, and travel. Companies like UPS and Pepsi are using data as a competitive advantage. In the next few years, these companies will need more and more people with data skills.

There are many more open data roles than people with the right skills. So companies will need to build data skills internally, not through hiring. The people who succeed and get promoted at data-savvy companies will be people who can understand and work with data effectively.

There's never been a better time to learn data skills. I made the transition right after my time at the State Department, and data skills have taken my career to places I never imagined. If you're ready to take the next step, Dataquest is a great place to start.

The post You Need Data Skills to Future-Proof Your Career appeared first on Dataquest.

]]>
Tutorial: Web Scraping with Python Using Beautiful Soup https://www.dataquest.io/blog/web-scraping-python-using-beautiful-soup/ Tue, 30 Mar 2021 18:11:34 +0000 https://www.dataquest.io/?p=28252 Learn how to scrape the web with Python! The internet is an absolutely massive source of data — data that we can access using web scraping and Python! In fact, web scraping is often the only way we can access data. There is a lot of information out there that isn't available in convenient CSV exports […]

The post Tutorial: Web Scraping with Python Using Beautiful Soup appeared first on Dataquest.

]]>
web scraping with python and beautiful soup

Learn how to scrape the web with Python!

The internet is an absolutely massive source of data — data that we can access using web scraping and Python!

In fact, web scraping is often the only way we can access data. There is a lot of information out there that isn't available in convenient CSV exports or easy-to-connect APIs. And websites themselves are often valuable sources of data — consider, for example, the kinds of analysis you could do if you could download every post on a web forum.

To access those sorts of on-page datasets, we'll have to use web scraping. 

Don’t worry if you’re still a total beginner!

In this tutorial we’re going to cover how to do web scraping with Python from scratch, starting with some answers to frequently-asked questions.

Then, we’ll work through an actual web scraping project, focusing on weather data.

web scraping weather data with python

We'll work together to scrape weather data from the web to support a weather app.

But before we start writing any Python, we've got to cover the basics! If you’re already familiar with the concept of web scraping, feel free to scroll past these questions and jump right into the tutorial!

The Fundamentals of Web Scraping:


What is Web Scraping in Python?

Some websites offer data sets that are downloadable in CSV format, or accessible via an Application Programming Interface (API). But many websites with useful data don’t offer these convenient options.

Consider, for example, the National Weather Service’s website. It contains up-to-date weather forecasts for every location in the US, but that weather data isn’t accessible as a CSV or via API. It has to be viewed on the NWS site:

nws

If we wanted to analyze this data, or download it for use in some other app, we wouldn’t want to painstakingly copy-paste everything. Web scraping is a technique that lets us use programming to do the heavy lifting. We’ll write some code that looks at the NWS site, grabs just the data we want to work with, and outputs it in the format we need.

In this tutorial, we’ll show you how to perform web scraping using Python 3 and the Beautiful Soup library. We’ll be scraping weather forecasts from the National Weather Service, and then analyzing them using the Pandas library.

But to be clear, lots of programming languages can be used to scrape the web! We also teach web scraping in R, for example. For this tutorial, though, we'll be sticking with Python.


How Does Web Scraping Work?

When we scrape the web, we write code that sends a request to the server that’s hosting the page we specified. The server will return the source code — HTML, mostly — for the page (or pages) we requested.

So far, we're essentially doing the same thing a web browser does — sending a server request with a specific URL and asking the server to return the code for that page.

But unlike a web browser, our web scraping code won't interpret the page's source code and display the page visually. Instead, we'll write some custom code that filters through the page's source code looking for specific elements we’ve specified, and extracting whatever content we’ve instructed it to extract.

For example, if we wanted to get all of the data from inside a table that was displayed on a web page, our code would be written to go through these steps in sequence:

  1. 1
    Request the content (source code) of a specific URL from the server
  2. 2
    Download the content that is returned
  3. 3
    Identify the elements of the page that are part of the table we want
  4. 4
    Extract and (if necessary) reformat those elements into a dataset we can analyze or use in whatever way we require.

If that all sounds very complicated, don't worry! Python and Beautiful Soup have built-in features designed to make this relatively straightforward.

One thing that’s important to note: from a server’s perspective, requesting a page via web scraping is the same as loading it in a web browser. When we use code to submit these requests, we might be “loading” pages much faster than a regular user, and thus quickly eating up the website owner’s server resources.


Why Use Python for Web Scraping?

As previously mentioned, it’s possible to do web scraping with many programming languages.

However, one of the most popular approaches is to use Python and the Beautiful Soup library, as we'll do in this tutorial.

Learning to do this with Python will mean that there are lots of tutorials, how-to videos, and bits of example code out there to help you deepen your knowledge once you’ve mastered the Beautiful Soup basics.


Is Web Scraping Legal?

Unfortunately, there’s not a cut-and-dry answer here. Some websites explicitly allow web scraping. Others explicitly forbid it. Many websites don’t offer any clear guidance one way or the other.

Before scraping any website, we should look for a terms and conditions page to see if there are explicit rules about scraping. If there are, we should follow them. If there are not, then it becomes more of a judgement call.

Remember, though, that web scraping consumes server resources for the host website. If we’re just scraping one page once, that isn’t going to cause a problem. But if our code is scraping 1,000 pages once every ten minutes, that could quickly get expensive for the website owner.

Thus, in addition to following any and all explicit rules about web scraping posted on the site, it’s also a good idea to follow these best practices:

Web Scraping Best Practices:

  • Never scrape more frequently than you need to.
  • Consider caching the content you scrape so that it’s only downloaded once.
  • Build pauses into your code using functions like time.sleep() to keep from overwhelming servers with too many requests too quickly.

In our case for this tutorial, the NWS’s data is public domain and its terms do not forbid web scraping, so we’re in the clear to proceed.

Learn to scrape the web with Python, right in your browser!

Our interactive APIs and Web Scraping in Python skill path will help you learn the skills you need to unlock new worlds of data with Python.

dataquest-learn-data-science-online

(No credit card required!)

The Components of a Web Page

Before we start writing code, we need to understand a little bit about the structure of a web page. We'll use the site's structure to write code that gets us the data we want to scrape, so understanding that structure is an important first step for any web scraping project.

When we visit a web page, our web browser makes a request to a web server. This request is called a GET request, since we’re getting files from the server. The server then sends back files that tell our browser how to render the page for us. These files will typically include:

  • HTML — the main content of the page.
  • CSS — used to add styling to make the page look nicer.
  • JS — Javascript files add interactivity to web pages.
  • Images — image formats, such as JPG and PNG, allow web pages to show pictures.

After our browser receives all the files, it renders the page and displays it to us.

There’s a lot that happens behind the scenes to render a page nicely, but we don’t need to worry about most of it when we’re web scraping. When we perform web scraping, we’re interested in the main content of the web page, so we look primarily at the HTML.

HTML

HyperText Markup Language (HTML) is the language that web pages are created in. HTML isn’t a programming language, like Python, though. It’s a markup language that tells a browser how to display content. 

HTML has many functions that are similar to what you might find in a word processor like Microsoft Word — it can make text bold, create paragraphs, and so on.

If you're already familiar with HTML, feel free to jump to the next section of this tutorial. Otherwise, let’s take a quick tour through HTML so we know enough to scrape effectively.

HTML consists of elements called tags. The most basic tag is the <html> tag. This tag tells the web browser that everything inside of it is HTML. We can make a simple HTML document just using this tag:


<html>
</html>

We haven’t added any content to our page yet, so if we viewed our HTML document in a web browser, we wouldn’t see anything:

Right inside an html tag, we can put two other tags: the head tag, and the body tag.

The main content of the web page goes into the body tag. The head tag contains data about the title of the page, and other information that generally isn’t useful in web scraping:


<html>
<head>
</head>
<body>
</body>
</html>

We still haven’t added any content to our page (that goes inside the body tag), so if we open this HTML file in a browser, we still won’t see anything:

You may have noticed above that we put the head and body tags inside the html tag. In HTML, tags are nested, and can go inside other tags.

We’ll now add our first content to the page, inside a p tag. The p tag defines a paragraph, and any text inside the tag is shown as a separate paragraph:


<html>
<head>
</head>
<body>
<p>
Here's a paragraph of text!
</p>
<p>
Here's a second paragraph of text!
</p>
</body>
</html>

Rendered in a browser, that HTML file will look like this:

Here’s a paragraph of text!

Here’s a second paragraph of text!

Tags have commonly used names that depend on their position in relation to other tags:

  • child — a child is a tag inside another tag. So the two p tags above are both children of the body tag.
  • parent — a parent is the tag another tag is inside. Above, the html tag is the parent of the body tag.
  • sibiling — a sibiling is a tag that is nested inside the same parent as another tag. For example, head and body are siblings, since they’re both inside html. Both p tags are siblings, since they’re both inside body.

We can also add properties to HTML tags that change their behavior. Below, we'll add some extra text and hyperlinks using the a tag.


<html>
<head>
</head>
<body>
<p>
Here's a paragraph of text!
<a href="https://www.dataquest.io">Learn Data Science Online</a>
</p>
<p>
Here's a second paragraph of text!
<a href="https://www.python.org">Python</a> </p>
</body></html>

Here’s how this will look:

Here’s a paragraph of text! Learn Data Science Online

Here’s a second paragraph of text! Python

In the above example, we added two a tags. a tags are links, and tell the browser to render a link to another web page. The href property of the tag determines where the link goes.

a and p are extremely common html tags. Here are a few others:

  • div — indicates a division, or area, of the page.
  • b — bolds any text inside.
  • i — italicizes any text inside.
  • table — creates a table.
  • form — creates an input form.

For a full list of tags, look here.

Before we move into actual web scraping, let’s learn about the class and id properties. These special properties give HTML elements names, and make them easier to interact with when we’re scraping.

One element can have multiple classes, and a class can be shared between elements. Each element can only have one id, and an id can only be used once on a page. Classes and ids are optional, and not all elements will have them.

We can add classes and ids to our example:


<html>
<head>
</head>
<body>
<p class="bold-paragraph">
Here's a paragraph of text!
<a href="https://www.dataquest.io" id="learn-link">Learn Data Science Online</a>
</p>
<p class="bold-paragraph extra-large">
Here's a second paragraph of text!
<a href="https://www.python.org" class="extra-large">Python</a>
</p>
</body>
</html>

Here’s how this will look:

Here’s a paragraph of text! Learn Data Science Online

Here’s a second paragraph of text! Python

As you can see, adding classes and ids doesn’t change how the tags are rendered at all.

The requests library

Now that we understand the structure of a web page, it's time to get into the fun part: scraping the content we want!

The first thing we’ll need to do to scrape a web page is to download the page. We can download pages using the Python requests library.

The requests library will make a GET request to a web server, which will download the HTML contents of a given web page for us. There are several different types of requests we can make using requests, of which GET is just one. If you want to learn more, check out our API tutorial.

Let’s try downloading a simple sample website, https://dataquestio.github.io/web-scraping-pages/simple.html.

We’ll need to first import the requests library, and then download the page using the requests.get method:

import requests
page = requests.get("https://dataquestio.github.io/web-scraping-pages/simple.html")
page
<Response [200]>

After running our request, we get a Response object. This object has a status_code property, which indicates if the page was downloaded successfully:

page.status_code
200

A status_code of 200 means that the page downloaded successfully. We won’t fully dive into status codes here, but a status code starting with a 2 generally indicates success, and a code starting with a 4 or a 5 indicates an error.

We can print out the HTML content of the page using the content property:

page.content
<!DOCTYPE html>
<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>

Parsing a page with BeautifulSoup

As you can see above, we now have downloaded an HTML document.

We can use the BeautifulSoup library to parse this document, and extract the text from the p tag.

We first have to import the library, and create an instance of the BeautifulSoup class to parse our document:

from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

We can now print out the HTML content of the page, formatted nicely, using the prettify method on the BeautifulSoup object.

print(soup.prettify())
<!DOCTYPE html>
<html>
    <head>
        <title>A simple example page</title>
    </head>
    <body>
        <p>Here is some simple content for this page.</p>
    </body>
</html>

This step isn't strictly necessary, and we won't always bother with it, but it can be helpful to look at prettified HTML to make the structure of the and where tags are nested easier to see.

As all the tags are nested, we can move through the structure one level at a time. We can first select all the elements at the top level of the page using the children property of soup.

Note that children returns a list generator, so we need to call the list function on it:

list(soup.children)
['html', 'n', <html> <head> <title>A simple example page</title> </head> <body> <p>Here is some simple content for this page.</p> </body> </html>]

The above tells us that there are two tags at the top level of the page — the initial <!DOCTYPE html> tag, and the <html> tag. There is a newline character (n) in the list as well. Let’s see what the type of each element in the list is:

[type(item) for item in list(soup.children)]
[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]

As we can see, all of the items are BeautifulSoup objects:

  • The first is a Doctype object, which contains information about the type of the document.
  • The second is a NavigableString, which represents text found in the HTML document.
  • The final item is a Tag object, which contains other nested tags.

The most important object type, and the one we’ll deal with most often, is the Tag object.

The Tag object allows us to navigate through an HTML document, and extract other tags and text. You can learn more about the various BeautifulSoup objects here.

We can now select the html tag and its children by taking the third item in the list:

html = list(soup.children)[2]

Each item in the list returned by the children property is also a BeautifulSoup object, so we can also call the children method on html.

Now, we can find the children inside the html tag:

list(html.children)
['n', <head> <title>A simple example page</title> </head>, 'n', <body> <p>Here is some simple content for this page.</p> </body>, 'n']

As we can see above, there are two tags here, head, and body. We want to extract the text inside the p tag, so we’ll dive into the body:

body = list(html.children)[3]

Now, we can get the p tag by finding the children of the body tag:

list(body.children)
['n', <p>Here is some simple content for this page.</p>, 'n']

We can now isolate the p tag:

p = list(body.children)[1]

Once we’ve isolated the tag, we can use the get_text method to extract all of the text inside the tag:

p.get_text()
'Here is some simple content for this page.'

Finding all instances of a tag at once

What we did above was useful for figuring out how to navigate a page, but it took a lot of commands to do something fairly simple.

If we want to extract a single tag, we can instead use the find_all method, which will find all the instances of a tag on a page.

soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')
[<p>Here is some simple content for this page.</p>]

Note that find_all returns a list, so we’ll have to loop through, or use list indexing, it to extract text:

soup.find_all('p')[0].get_text()
'Here is some simple content for this page.'

f you instead only want to find the first instance of a tag, you can use the find method, which will return a single BeautifulSoup object:

soup.find('p')
<p>Here is some simple content for this page.</p>

Searching for tags by class and id

We introduced classes and ids earlier, but it probably wasn’t clear why they were useful.

Classes and ids are used by CSS to determine which HTML elements to apply certain styles to. But when we're scraping, we can also use them to specify the elements we want to scrape.

To illustrate this principle, we’ll work with the following page:

<html>
    <head>
        <title>A simple example page</title>
    </head>
    <body>
        <div>
            <p class="inner-text first-item" id="first">
                First paragraph.
            </p>
            <p class="inner-text">
                Second paragraph.
            </p>
        </div>
            <p class="outer-text first-item" id="second">
                <b>
                First outer paragraph.
                </b>
            </p>
            <p class="outer-text">
                <b>
                Second outer paragraph.
                </b>
            </p>
    </body>
</html>

We can access the above document at the URL https://dataquestio.github.io/web-scraping-pages/ids_and_classes.html.

Let’s first download the page and create a BeautifulSoup object:

page = requests.get("https://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
soup
<html>
<head>
<title>A simple example page
</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
First paragraph.
</p><p class="inner-text">
Second paragraph.
</p></div>
<p class="outer-text first-item" id="second"><b>
First outer paragraph.
</b></p><p class="outer-text"><b>
Second outer paragraph.
</b>
</p>
</body>
</html>

Now, we can use the find_all method to search for items by class or by id. In the below example, we’ll search for any p tag that has the class outer-text:

soup.find_all('p', class_='outer-text')
[<p class="outer-text first-item" id="second"> <b> First outer paragraph. </b> </p>, <p class="outer-text"> <b> Second outer paragraph. </b> </p>]

In the below example, we’ll look for any tag that has the class outer-text:

soup.find_all(class_="outer-text")
<p class="outer-text first-item" id="second">
<b>
First outer paragraph.
</b>
</p>, <p class="outer-text">
<b>
Second outer paragraph.
</b>
</p>]

We can also search for elements by id:

soup.find_all(id="first")
[<p class="inner-text first-item" id="first">
First paragraph.
</p>]

Using CSS Selectors

We can also search for items using CSS selectors. These selectors are how the CSS language allows developers to specify HTML tags to style. Here are some examples:

  • p a — finds all a tags inside of a p tag.
  • body p a — finds all a tags inside of a p tag inside of a body tag.
  • html body — finds all body tags inside of an html tag.
  • p.outer-text — finds all p tags with a class of outer-text.
  • p#first — finds all p tags with an id of first.
  • body p.outer-text — finds any p tags with a class of outer-text inside of a body tag.

You can learn more about CSS selectors here.

BeautifulSoup objects support searching a page via CSS selectors using the select method. We can use CSS selectors to find all the p tags in our page that are inside of a div like this:

soup.select("div p")
[<p class="inner-text first-item" id="first">
First paragraph.
</p>, <p class="inner-text">
Second paragraph.
</p>]

Note that the select method above returns a list of BeautifulSoup objects, just like find and find_all.

Downloading weather data

We now know enough to proceed with extracting information about the local weather from the National Weather Service website!

The first step is to find the page we want to scrape. We’ll extract weather information about downtown San Francisco from this page.

an image of the site we will use for our python web scraping

Specifically, let's extract data about the extended forecast.

As we can see from the image, the page has information about the extended forecast for the next week, including time of day, temperature, and a brief description of the conditions.

Exploring page structure with Chrome DevTools

The first thing we’ll need to do is inspect the page using Chrome Devtools. If you’re using another browser, Firefox and Safari have equivalents.

You can start the developer tools in Chrome by clicking View -> Developer -> Developer Tools. You should end up with a panel at the bottom of the browser like what you see below. Make sure the Elements panel is highlighted:

Chrome Developer Tools

The elements panel will show you all the HTML tags on the page, and let you navigate through them. It’s a really handy feature!

By right clicking on the page near where it says “Extended Forecast”, then clicking “Inspect”, we’ll open up the tag that contains the text “Extended Forecast” in the elements panel:

The extended forecast text

We can then scroll up in the elements panel to find the “outermost” element that contains all of the text that corresponds to the extended forecasts. In this case, it’s a div tag with the id seven-day-forecast:

The div that contains the extended forecast items.

If we click around on the console, and explore the div, we’ll discover that each forecast item (like “Tonight”, “Thursday”, and “Thursday Night”) is contained in a div with the class tombstone-container.

Time to Start Scraping!

We now know enough to download the page and start parsing it. In the below code, we will:

  • Download the web page containing the forecast.
  • Create a BeautifulSoup class to parse the page.
  • Find the div with id seven-day-forecast, and assign to seven_day
  • Inside seven_day, find each individual forecast item.
  • Extract and print the first forecast item.
page = requests.get("https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id="seven-day-forecast")
forecast_items = seven_day.find_all(class_="tombstone-container")
tonight = forecast_items[0]
print(tonight.prettify())
<div class="tombstone-container">
	<p class="period-name">
		Tonight
		<br>
		<br/>
		</br>
	</p>
	<p>
		<img alt="Tonight: Mostly clear, with a low around 49. West northwest wind 12 to 17 mph decreasing to 6 to 11 mph after midnight. Winds could gust as high as 23 mph. " class="forecast-icon" src="newimages/medium/nfew.png" title="Tonight: Mostly clear, with a low around 49. West northwest wind 12 to 17 mph decreasing to 6 to 11 mph after midnight. Winds could gust as high as 23 mph. "/>
	</p>
	<p class="short-desc">
		Mostly Clear
	</p>
	<p class="temp temp-low">
		Low: 49 °F
	</p>
</div>

Extracting information from the page

As we can see, inside the forecast item tonight is all the information we want. There are four pieces of information we can extract:

  • The name of the forecast item — in this case, Tonight.
  • The description of the conditions — this is stored in the title property of img.
  • A short description of the conditions — in this case, Mostly Clear.
  • The temperature low — in this case, 49 degrees.

We’ll extract the name of the forecast item, the short description, and the temperature first, since they’re all similar:

period = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
temp = tonight.find(class_="temp").get_text()
print(period)
print(short_desc)
print(temp)
Tonight
Mostly Clear
Low: 49 °F

Now, we can extract the title attribute from the img tag. To do this, we just treat the BeautifulSoup object like a dictionary, and pass in the attribute we want as a key:

img = tonight.find("img")
desc = img['title']
print(desc)
Tonight: Mostly clear, with a low around 49. West northwest wind 12 to 17 mph decreasing to 6 to 11 mph after midnight. Winds could gust as high as 23 mph.

Extracting all the information from the page

Now that we know how to extract each individual piece of information, we can combine our knowledge with CSS selectors and list comprehensions to extract everything at once.

In the below code, we will:

  • Select all items with the class period-name inside an item with the class tombstone-container in seven_day.
  • Use a list comprehension to call the get_text method on each BeautifulSoup object.
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods
['Tonight',
'Thursday',
'ThursdayNight',
'Friday',
'FridayNight',
'Saturday',
'SaturdayNight',
'Sunday',
'SundayNight']

As we can see above, our technique gets us each of the period names, in order.

We can apply the same technique to get the other three fields:

short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]print(short_descs)print(temps)print(descs)
['Mostly Clear', 'Sunny', 'Mostly Clear', 'Sunny', 'Slight ChanceRain', 'Rain Likely', 'Rain Likely', 'Rain Likely', 'Chance Rain']
['Low: 49 °F', 'High: 63 °F', 'Low: 50 °F', 'High: 67 °F', 'Low: 57 °F', 'High: 64 °F', 'Low: 57 °F', 'High: 64 °F', 'Low: 55 °F']
['Tonight: Mostly clear, with a low around 49. West northwest wind 12 to 17 mph decreasing to 6 to 11 mph after midnight. Winds could gust as high as 23 mph. ', 'Thursday: Sunny, with a high near 63. North wind 3 to 5 mph. ', 'Thursday Night: Mostly clear, with a low around 50. Light and variable wind becoming east southeast 5 to 8 mph after midnight. ', 'Friday: Sunny, with a high near 67. Southeast wind around 9 mph. ', 'Friday Night: A 20 percent chance of rain after 11pm. Partly cloudy, with a low around 57. South southeast wind 13 to 15 mph, with gusts as high as 20 mph. New precipitation amounts of less than a tenth of an inch possible. ', 'Saturday: Rain likely. Cloudy, with a high near 64. Chance of precipitation is 70%. New precipitation amounts between a quarter and half of an inch possible. ', 'Saturday Night: Rain likely. Cloudy, with a low around 57. Chance of precipitation is 60%.', 'Sunday: Rain likely. Cloudy, with a high near 64.', 'Sunday Night: A chance of rain. Mostly cloudy, with a low around 55.']

Combining our data into a Pandas Dataframe

We can now combine the data into a Pandas DataFrame and analyze it. A DataFrame is an object that can store tabular data, making data analysis easy. If you want to learn more about Pandas, check out our free to start course here.

In order to do this, we’ll call the DataFrame class, and pass in each list of items that we have. We pass them in as part of a dictionary.

Each dictionary key will become a column in the DataFrame, and each list will become the values in the column:

import pandas as pd
weather = pd.DataFrame({
    "period": periods,
    "short_desc": short_descs,
    "temp": temps,
    "desc":descs
})
weather
desc period short_desc temp
0 Tonight: Mostly clear, with a low around 49. W… Tonight Mostly Clear Low: 49 °F
1 Thursday: Sunny, with a high near 63. North wi… Thursday Sunny High: 63 °F
2 Thursday Night: Mostly clear, with a low aroun… ThursdayNight Mostly Clear Low: 50 °F
3 Friday: Sunny, with a high near 67. Southeast … Friday Sunny High: 67 °F
4 Friday Night: A 20 percent chance of rain afte… FridayNight Slight ChanceRain Low: 57 °F
5 Saturday: Rain likely. Cloudy, with a high ne… Saturday Rain Likely High: 64 °F
6 Saturday Night: Rain likely. Cloudy, with a l… SaturdayNight Rain Likely Low: 57 °F
7 Sunday: Rain likely. Cloudy, with a high near… Sunday Rain Likely High: 64 °F
8 Sunday Night: A chance of rain. Mostly cloudy… SundayNight Chance Rain Low: 55 °F

We can now do some analysis on the data. For example, we can use a regular expression and the Series.str.extract method to pull out the numeric temperature values:

temp_nums = weather["temp"].str.extract("(?Pd+)", expand=False)
weather["temp_num"] = temp_nums.astype('int')
temp_nums
0 49
1 63
2 50
3 67
4 57
5 64
6 57
7 64
8 55
Name: temp_num, dtype: object

We could then find the mean of all the high and low temperatures:

weather["temp_num"].mean()
58.444444444444443

We could also only select the rows that happen at night:

is_night = weather["temp"].str.contains("Low")
weather["is_night"] = is_night
is_night
0 True
1 False
2 True
3 False
4 True
5 False
6 True
7 False
8 True
Name: temp, dtype: bool
weather[is_night]
0 True
1 False
2 True
3 False
4 True
5 False
6 True
7 False
8 True
Name: temp, dtype: bool
desc period short_desc temp temp_num is_night
0 Tonight: Mostly clear, with a low around 49. W… Tonight Mostly Clear Low: 49 °F 49 True
2 Thursday Night: Mostly clear, with a low aroun… ThursdayNight Mostly Clear Low: 50 °F 50 True
4 Friday Night: A 20 percent chance of rain afte… FridayNight Slight ChanceRain Low: 57 °F 57 True
6 Saturday Night: Rain likely. Cloudy, with a l… SaturdayNight Rain Likely Low: 57 °F 57 True
8 Sunday Night: A chance of rain. Mostly cloudy… SundayNight Chance Rain Low: 55 °F 55 True

Next Steps For This Web Scraping Project

If you've made it this far, congratulations! You should now have a good understanding of how to scrape web pages and extract data. Of course, there's still a lot more to learn!

If you want to go further, a good next step would be to pick a site and try some web scraping on your own. Some good examples of data to scrape are:

  • News articles
  • Sports scores
  • Weather forecasts
  • Stock prices
  • Online retailer prices

You may also want to keep scraping the National Weather Service, and see what other data you can extract from the page, or about your own city.

Alternatively, if you want to take your web scraping skills to the next level, you can check out our interactive course, which covers both the basics of web scraping and using Python to connect to APIs. With those two skills under your belt, you'll be able to collect lots of unique and interesting datasets from sites all over the web!

Learn to scrape the web with Python, right in your browser!

Our interactive APIs and Web Scraping in Python skill path will help you learn the skills you need to unlock new worlds of data with Python.

dataquest-learn-data-science-online

(No credit card required!)

The post Tutorial: Web Scraping with Python Using Beautiful Soup appeared first on Dataquest.

]]>
Data Analytics Certification: Do You Need a Certificate to Get a Job as a Data Analyst? https://www.dataquest.io/blog/data-analytics-certification/ Thu, 18 Mar 2021 16:03:53 +0000 https://www.dataquest.io/?p=28157 If you’re interested in becoming a data analyst, or even just interested in adding some data skills to your resume, you’ve probably wondered: do I need some kind of data analytics certification?Finding the real answer to this question is tricky. There are a million data analytics certificate programs out there, and they all have a […]

The post Data Analytics Certification: Do You Need a Certificate to Get a Job as a Data Analyst? appeared first on Dataquest.

]]>
data-analytics-certified

If you’re interested in becoming a data analyst, or even just interested in adding some data skills to your resume, you’ve probably wondered: do I need some kind of data analytics certification?

Finding the real answer to this question is tricky. There are a million data analytics certificate programs out there, and they all have a financial incentive to say that you need their certificate. Heck, here at Dataquest, we have a Data Analyst career path that awards you a certificate!

But here’s the honest truth: no, you do NOT need a certification to get a job as a data analyst.

Now, that doesn’t mean that data analytics certification programs aren’t valuable. But it does mean that you need to think about your investment carefully, because the certificate itself — the actual piece of paper and/or LinkedIn flair — is likely worth nothing.

We’re going to talk about different certification programs and how to assess them. But first, we probably need to explain why the certification itself won’t help you.

Employers don’t care about data analytics certificates. Here’s why.

When I was researching Dataquest’s data science career guide, I spent a lot of time talking to people in the industry about what makes a good entry-level candidate for roles like Data Analyst.

In fact, I have almost 200 pages of interview transcripts with senior data scientists, hiring managers, recruiters, etc. that are all focused on that specific subject: what makes a candidate stand out when applying for entry-level data analyst positions?

You know what word never appears once in those 200 pages? Certification.

(The word certificate doesn’t appear, either.)

The reason for this is pretty straightforward. From an employer perspective, certificates aren’t a good predictor of how effective someone will be at actually doing the job.

This is particularly true in the realm of data analytics, because very few certificate programs actually require much real data work.

MOOC platform courses, for example, typically consist of a series of video lectures, punctuated with multiple-choice and fill-in-the-blank quizzes. They may or may not have a “capstone” project at the end.

Best case scenario, seeing that certification on a resume means that the recipient has completed one data analysis project. That’s not enough to be meaningful to an employer who knows that your effectiveness on the job will be measured by successful end-to-end data analysis project completion, not by your ability to score well on multiple-choice quizzes.

For example, here's a sample question from a real IBM/Coursera MOOC on data skills. Imagine this from an employer's perspective — does being able to answer this kind of question prove an applicant knows how to actually use this algorithm?

MOOC data quiz question

Probably not.

While some certification programs are more rigorous than others, there are simply too many certifications out there for employers to bother worrying about.

When a hiring manager looks at your resume, you have about seven seconds to get their attention. They’re not going to waste their time doing research to figure out whether the certification program you chose is any good.

It’s worth mentioning that brand doesn’t matter here, either. A university degree on your resume will impress a recruiter. But a university certification? Employers are well aware that’s a very different thing. Often, university-branded certificate programs (both online and off) aren’t even operated by the university. They’re run by for-profit companies who license the university’s brand and video lecture recordings.

What do employers want to see on a data analyst’s resume?

We’ve written a lengthy guide to data science and data analysis resumes, but the most important lesson is this: employers need to see proof that you can do the work.

Nobody will pay you to do something that you’ve never done before.

The best way to prove you can do the work is relevant work experience, but if you’re looking for your first job in the field, you won’t have that. That’s OK! You can prove you’ve done the work another way: showcasing your data analysis projects.

We’ve got an in-depth guide to data analysis project portfolios too, so I won’t repeat all those lessons here. But long story short: the more relevant your projects are to the job you’re applying for, the better your chances will be.

For entry-level positions, that’s what employers are looking for in the seven seconds they spend scanning your resume. They want to see projects using the skills required for their role, doing the kinds of analyses needed for their role. Seeing that you’ve already done the kind of work they’re hiring for is far, far more important to most hiring managers than any certification.

You'll get about 7 seconds of an employer's attention on your resume. Use them wisely.

Are data analytics certifications useless? No!

None of this means that certification programs are useless, of course. It just means you need to assess them with the knowledge that the certificate brand you choose probably isn’t going to help you get a job. What will help you get a job are the skills you learn over the course of the program.

Also, it’s important to note that while certificates likely won’t help your job candidacy, they’re also not going to hurt your chances. Most employers will simply ignore them — so we don’t recommend listing them until the end of your resume — and almost no one will see them as sufficient proof that you can do the job. Some recruiters do, however, see certificates as a sign that a job candidate is actively looking to learn and improve their skill set.

Since many other applicants will have certificates too, this isn’t likely to set you apart from other candidates. Having highly relevant projects is your best chance at doing that. But listing a certificate or two to show you’re serious about learning and growing is never a bad idea.

How to assess certificate programs

The single most important thing you can get from any certification program is the skills you learn, and that should be your most important consideration. Important questions to ask include:

  • How does this program teach? Does it use video lectures? Interactive coding lessons? In-person classes? Everybody learns differently, so you probably know what works best for you, but the science suggests that generally speaking, the more hands-on the teaching method, the better.
  • What does this program teach? Does it cover the most important data analyst skills in enough depth? SQL is one area where many certificate programs skimp because it’s not exciting, but it’s the single most important skill for anyone interested in data to learn. If you don’t already have statistics knowledge, finding a program that covers basic statistics is also important.

Other important factors to consider in your decision include:

  • Cost. Certification programs can range from a few hundred dollars to tens of thousands! What kind of return can you expect on your investment?
  • Time requirements. Some certificate programs, like Dataquest’s, are self-serve — you can begin whenever you want, and study as fast or as slow as you want. Others are cohort-based and time-sensitive — you might only be able to join a class at certain times of the year, or only be able to join live classes at specific times of day.
  • Prerequisites. Some programs require specific degrees, or prior experience and/or coursework.
  • Third-party reviews. Any data analytics certification program with a half-decent marketing team can write a landing page full of happy learner quotes. But what do real learners have to say about the program? Third-party review sites like Switchup, G2, and Course Report are all good places to do some research.

When in doubt, try it out! Many platforms offer free trials, or free courses. For example, you can sign up for a free account with Dataquest and complete any of our 60+ free lessons to get a feel for the different types of content and the teaching style.

If a platform or certification program doesn't give you any opportunity to sample their product, that could be a bit of a red flag. Since many platforms and programs do allow you to "try before you buy," it hardly makes sense to spend hundreds or thousands of dollars on a learning product before you're sure their teaching style works for you!

One thing you definitely need to consider before choosing a certification program: what's your budget? Costs can vary widely.

Analytics certifications compared:

There are an absolutely huge number of data analyst certificates out there. Below, we’ll compare a few of the most popular types of certification programs, so that you have a better idea of how each option stacks up.


Dataquest

Cost: $294 (on sale) for a full year of access.

Type: Online, self-serve.

Platform: Hands-on browser-based coding interface

Topics covered: Python, SQL, statistics, command line/shell, Git

Prerequisites: None.

Time constraints: None. (Most students meet their goals in less than a year of part-time study).

Switchup.org Review Average: 4.85 out of 5


General Assembly Data Analytics

Cost: $3,950 or higher (loan options available)

Type: Online or in-person bootcamp.

Platform: In-person or online virtual classroom

Topics covered: SQL, Excel, Tableau

Prerequisites: None.

Time constraints: Must join a specific session, must attend courses at specific times. (However, new sessions start frequently so you won’t have to wait long to join).

Switchup.org Review Average: 4.28 out of 5


Thinkful Data Analytics Immersion

Cost: $12,250 or higher (loan options available)

Type: Online.

Platform: Online virtual classroom.

Topics covered: Python, SQL, Machine Learning

Prerequisites: None.

Time constraints: Full-time for four months, or part-time (20-30 hours per week) for six months.

Switchup.org Review Average: 4.65 out of 5


Springboard Data Analytics Track

Cost: $5,500 or higher (loan options available)

Type: Online.

Platform: Online virtual classroom.

Topics covered: Python, SQL

Prerequisites: None, although you do have to apply and be accepted.

Time constraints: Must wait for the next cohort to begin, then the program length is six months.

Switchup.org Review Average: 4.67 out of 5


Of course, there are many other options, but these are just examples. As you can see, there are significant differences between these programs. The most obvious one is cost — the costs here range from less than $300 to over $12,000! — but there are other meaningful differences, too.

For example, user reviews: despite being the most affordable option, Dataquest also has the highest average review score.

Time constraints also vary dramatically, from programs like Dataquest or General Assembly that you can start immediately or very soon after making your decision, to programs like Springboard that require an application process and waiting for a cohort to start.

Most important, probably: what topics are actually covered? All of the programs cover SQL — that’s a good sign! General Assembly’s program may be focused on less-technically-demanding analyst roles, with its focus on just SQL, Excel, and Tableau. On the other hand, the Thinkful program covers machine learning, which isn’t typically required for data analyst roles. Dataquest appears to be the only one of these options with substantive coverage probability and statistics.

This is not to say that any of these programs is “best.” Obviously, you’re reading this article on the Dataquest site, and we’re very proud of our platform, but we also value honesty, and there’s no way any single platform is going to be the best option for everyone.

What about test-based certifications?

There are a number of certification programs, like Certified Analytics Professional (CAP) or Cloudera's CCA Data Analyst that offer no education at all. These are tests you can take (if you're willing to pay a few hundred dollars) and you'll recieve a certification if you pass.

Are these a good investment? Generally not. There are specific jobs that may favor these certifications, but few require them. And there's no real evidence that employers are interested in them. As previously mentioned, none of the data analytics employers, recruiters, and hiring managers we spoke with mentioned certifications.

A more quantitative analysis confirms this theory. As of this writing, there are about 39,000 open data jobs listed on Indeed.com in the United States. Of these, fewer than 100 require a CAP certification, and fewer than 20 mention wanting to see CCA.

Put another way: estimating conservatively, about 99.7% of all data jobs don't require these certifications.

In fact, only 15% of the data jobs on Indeed include the word "certification" at all. Many of the certifications listed are software-specific certifications related to a company's specific tech stack. And some of that 15% is also job listings that read "Certifications: None."

Any way you look at it, the demand for generic data analytics certifications is not high. The vast majority of data jobs do not require or even mention these kinds of certificates — if you do need a certificate for a job, it's likely to be something software specific, such as an AWS certification for a company that does a lot of cloud-based data processing.


So what’s the best data analytics certification option for you? That’s going to come down to a personal decision based on factors like:

  • What is your budget?
  • How much free time do you have to study?
  • Which data analyst skills, if any, do you already have?
  • What is your desired timeline?

Whatever decision you make, though, now you’ll be making it with your eyes open. Now that you know the name on the certificate doesn’t really matter when it comes to getting a job, you’ll be free to focus more on what does matter: learning the right skills and building great projects to prove your skills to potential employers.

The post Data Analytics Certification: Do You Need a Certificate to Get a Job as a Data Analyst? appeared first on Dataquest.

]]>
SQL Operators: 6 Different Types (w/ Examples) https://www.dataquest.io/blog/sql-operators/ Tue, 16 Mar 2021 18:31:43 +0000 https://www.dataquest.io/?p=27691 We have previously covered why you need to learn SQL to get a data job in 2021, as well as publishing a full list of SQL commands to help you get started. Next, we’re going to be looking at SQL operators.We’re going to cover what exactly SQL operators are, before providing a comprehensive list of […]

The post SQL Operators: 6 Different Types (w/ Examples) appeared first on Dataquest.

]]>
sql operators reference list

We have previously covered why you need to learn SQL to get a data job in 2021, as well as publishing a full list of SQL commands to help you get started. Next, we’re going to be looking at SQL operators.

We’re going to cover what exactly SQL operators are, before providing a comprehensive list of the different types with full examples for each.

If you're trying to learn SQL and reading these types of articles makes you want to bang your head against the wall, you're not alone. 

As with any new skill, people prefer to learn in different ways. That's why we created our interactive SQL courses. Regardless of where you are in your SQL journey, we've got a course for you.

Learn SQL the right way!

  • Writing real queries
  • In your browser
  • On your schedule

Why watch video lectures when  you can learn by doing?

What are SQL operators?

An SQL operator is a special word or character used to perform tasks. These tasks can be anything from complex comparisons, to basic arithmetic operations. Think of an SQL operator as similar to how the different buttons on a calculator function.

SQL operators are primarily used within the WHERE clause of an SQL statement. This is the part of the statement that is used to filter data by a specific condition or conditions.

There are six types of SQL operators that we are going to cover: Arithmetic, Bitwise, Comparison, Compound, Logical and String.

Arithmetic operators

Arithmetic operators are used for mathematical operations on numerical data, such as adding or subtracting.

+ (Addition)

The + symbol adds two numbers together.

SELECT 10 + 10;

- (Subtraction)

The - symbol subtracts one number from another.
SELECT 10 - 10;

* (Multiplication)

The * symbol multiples two numbers together.
SELECT 10 * 10;

/ (Division)

The / symbol divides one number by another.
SELECT 10 / 10;

% (Remainder/Modulus)

The % symbol (sometimes referred to as Modulus) returns the remainder of one number divided by another.
SELECT 10 % 10;

Bitwise operators

A bitwise operator performs bit manipulation between two expressions of the integer data type. Bitwise operators convert the integers into binary bits and then perform the AND (& symbol), OR (|, ^) or NOT (~) operation on each individual bit, before finally converting the binary result back into an integer.

Just a quick reminder: a binary number in computing is a number made up of 0s and 1s.

& (Bitwise AND)

The & symbol (Bitwise AND) compares each individual bit in a value with its corresponding bit in the other value. In the following example, we are using just single bits. Because the value of @BitOne is different to @BitTwo, a 0 is returned.
DECLARE @BitOne BIT = 1
DECLARE @BitTwo BIT = 0
SELECT @BitOne & @BitTwo;

But what if we make the value of both the same? In this instance, it would return a 1.

DECLARE @BitOne BIT = 1
DECLARE @BitTwo BIT = 1
SELECT @BitOne & @BitTwo;

Obviously this is just for variables that are type BIT. What would happen if we started using numbers instead? Take the example below:

DECLARE @BitOne INT = 230
DECLARE @BitTwo INT = 210
SELECT @BitOne & @BitTwo;

The answer returned here would be 194.

You might be thinking, “How on earth is it 194?!” and that’s perfectly understandable. To explain why, we first need to convert the two numbers into their binary form:

@BitOne (230) - 11100110
@BitTwo (210) - 11010010

Now, we have to go through each bit and compare (so the 1st bit in @BitOne and the 1st bit in @BitTwo). If both numbers are 1, we record a 1. If one or both are 0, then we record a 0:

@BitOne (230) - 11100110
@BitTwo (210) - 11010010
Result        - 11000000

The binary we are left with is 11000000, which if you google is equal to a numeric value of 194.

Confused yet? Don’t worry! Bitwise operators can be confusing to understand, but they’re rarely used in practice.

&= (Bitwise AND Assignment)

The &= symbol (Bitwise AND Assignment) does the same as the Bitwise AND (&) operator but then sets the value of a variable to the result that is returned.

| (Bitwise OR)

The | symbol (Bitwise OR) performs a bitwise logical OR operation between two values. Let’s revisit our example from before:

DECLARE @BitOne INT = 230
DECLARE @BitTwo INT = 210
SELECT @BitOne | @BitTwo;

In this instance, we have to go through each bit again and compare, but this time if EITHER number is a 1, then we record a 1. If both are 0, then we record a 0:

@BitOne (230) - 11100110
@BitTwo (210) - 11010010
Result        - 11110110

The binary we are left with is 11110110, which equals a numeric value of 246.

|= (Bitwise OR Assignment)

The |= symbol (Bitwise OR Assignment) does the same as the Bitwise OR (|) operator but then sets the value of a variable to the result that is returned.

^ (Bitwise exclusive OR)

The ^ symbol (Bitwise exclusive OR) performs a bitwise logical OR operation between two values.
DECLARE @BitOne INT = 230
DECLARE @BitTwo INT = 210
SELECT @BitOne ^ @BitTwo;

In this example, we compare each bit and return 1 if one, but NOT both bits are equal to 1.

@BitOne (230) - 11100110
@BitTwo (210) - 11010010
Result        - 00110100

The binary we are left with is 00110100, which equals a numeric value of 34.

^= (Bitwise exclusive OR Assignment)

The ^= symbol (Bitwise exclusive OR Assignment) does the same as the Bitwise exclusive OR (^) operator but then sets the value of a variable to the result that is returned.

Comparison operators

A comparison operator is used to compare two values and test whether they are the same.

= (Equal to)

The = symbol is used to filter results that equal a certain value. In the below example, this query will return all customers that have an age of 20.

SELECT * FROM customers
WHERE age = 20;

!= (Not equal to)

The != symbol is used to filter results that do not equal a certain value. In the below example, this query will return all customers that don't have an age of 20.
SELECT * FROM customers
WHERE age != 20;

> (Greater than)

The > symbol is used to filter results where a column’s value is greater than the queried value. In the below example, this query will return all customers that have an age above 20.

SELECT * FROM customers
WHERE age > 20;

!> (Not greater than)

The !> symbol is used to filter results where a column’s value is not greater than the queried value. In the below example, this query will return all customers that do not have an age above 20.

SELECT * FROM customers
WHERE age !> 20;

< (Less than)

The < symbol is used to filter results where a column’s value is less than the queried value. In the below example, this query will return all customers that have an age below 20.

SELECT * FROM customers
WHERE age < 20;

!< (Not less than)

The !< symbol is used to filter results where a column’s value is not less than the queried value. In the below example, this query will return all customers that do not have an age below 20.
SELECT * FROM customers
WHERE age !< 20;

>= (Greater than or equal to)

The >= symbol is used to filter results where a column’s value is greater than or equal to the queried value. In the below example, this query will return all customers that have an age equal to or above 20.

SELECT * FROM customers
WHERE age >= 20;

<= (Less than or equal to)

The <= symbol is used to filter results where a column’s value is less than or equal to the queried value. In the below example, this query will return all customers that have an age equal to or below 20.
SELECT * FROM customers
WHERE age <= 20;

<> (Not equal to)

The <> symbol performs the exact same operation as the != symbol and is used to filter results that do not equal a certain value. You can use either, but <> is the SQL-92 standard.
SELECT * FROM customers
WHERE age <> 20;

Compound operators

Compound operators perform an operation on a variable and then set the result of the variable to the result of the operation. Think of it as doing a = a (+,-,*,etc) b.

+= (Add equals)

The += operator will add a value to the original value and store the result in the original value. The below example sets a value of 10, then adds 5 to the value and prints the result (15).

DECLARE @addValue int = 10
SET @addValue += 5
PRINT CAST(@addvalue AS VARCHAR);

This can also be used on strings. The below example will concatenate two strings together and print “dataquest”.

DECLARE @addString VARCHAR(50) =dataSET @addString += “quest”
PRINT @addString;

-= (Subtract equals)

The -= operator will subtract a value from the original value and store the result in the original value. The below example sets a value of 10, then subtracts 5 from the value and prints the result (5).

DECLARE @addValue int = 10
SET @addValue -= 5
PRINT CAST(@addvalue AS VARCHAR);

*= (Multiply equals)

The *= operator will multiple a value by the original value and store the result in the original value. The below example sets a value of 10, then multiplies it by 5 and prints the result (50).
DECLARE @addValue int = 10
SET @addValue *= 5
PRINT CAST(@addvalue AS VARCHAR);

/= (Divide equals)

The /= operator will divide a value by the original value and store the result in the original value. The below example sets a value of 10, then divides it by 5 and prints the result (2).
DECLARE @addValue int = 10
SET @addValue /= 5
PRINT CAST(@addvalue AS VARCHAR);

%= (Modulo equals)

The %= operator will divide a value by the original value and store the remainder in the original value. The below example sets a value of 25, then divides by 5 and prints the result (0).
DECLARE @addValue int = 10
SET @addValue %= 5
PRINT CAST(@addvalue AS VARCHAR);

Logical operators

Logical operators are those that return true or false, such as the AND operator, which returns true when both expressions are met.

ALL

The ALL operator returns TRUE if all of the subquery values meet the specified condition. In the below example, we are filtering all users who have an age that is greater than the highest age of users in London.
SELECT first_name, last_name, age, location
FROM users
WHERE age > ALL (SELECT age FROM users WHERE location = ‘London’);

ANY/SOME

The ANY operator returns TRUE if any of the subquery values meet the specified condition. In the below example, we are filtering all products which have any record in the orders table. The SOME operator achieves the same result.
SELECT product_name
FROM products
WHERE product_id > ANY (SELECT product_id FROM orders);

AND

The AND operator returns TRUE if all of the conditions separated by AND are true. In the below example, we are filtering users that have an age of 20 and a location of London.

SELECT *
FROM users
WHERE age = 20 AND location = ‘London’;

BETWEEN

The BETWEEN operator filters your query to only return results that fit a specified range.

SELECT *
FROM users
WHERE age BETWEEN 20 AND 30;

EXISTS

The EXISTS operator is used to filter data by looking for the presence of any record in a subquery.

SELECT name
FROM customers
WHERE EXISTS
(SELECT order FROM ORDERS WHERE customer_id = 1);

IN

The IN operator includes multiple values set into the WHERE clause.

SELECT *
FROM users
WHERE first_name IN (‘Bob’, ‘Fred’, ‘Harry’);

LIKE

The LIKE operator searches for a specified pattern in a column. (For more information on how/why the % is used here, see the section on the wildcard character operator).

SELECT *
FROM users
WHERE first_name LIKE%Bob%;

NOT

The NOT operator returns results if the condition or conditions are not true.
SELECT *
FROM users
WHERE first_name NOT IN (‘Bob’, ‘Fred’, ‘Harry’);

OR 

The OR operator returns TRUE if any of the conditions separated by OR are true.In the below example, we are filtering users that have an age of 20 or a location of London.

SELECT *
FROM users
WHERE age = 20 OR location = ‘London’;

IS NULL

The IS NULL operator is used to filter results with a value of NULL.
SELECT *
FROM users
WHERE age IS NULL;

String operators

String operators are primarily used for string concatenation (combining two or more strings together) and string pattern matching.

+ (String concatenation)

The + operator can be used to combine two or more strings together. The below example would output ‘dataquest’.
SELECTdata+ ‘quest’;

+= (String concatenation assignment)

The += is used to combine two or more strings and store the result in the original variable. The below example sets a variable of ‘data’, then adds ‘quest’ to it, giving the original variable a value of ‘dataquest’.

DECLARE @strVar VARCHAR(50)
SET @strVar =dataSET @strVar += ‘quest’
PRINT @strVar;

% (Wildcard)

The % symbol - sometimes referred to as the wildcard character - is used to match any string of zero or more characters. The wildcard can be used as either a prefix or a suffix. In the below example, the query would return any user with a first name that starts with ‘dan’.

SELECT *
FROM users
WHERE first_name LIKE ‘dan%;

[] (Character(s) matches)

The [] is used to match any character within the specific range or set that is specified between the square brackets. In the below example, we are searching for any users that have a first name that begins with a d and a second character that is somewhere in the range c to r.

SELECT *
FROM users
WHERE first_name LIKEd[c-r]%’’;

[^] (Character(s) not to match)

The [^] is used to match any character that is not within the specific range or set that is specified between the square brackets. In the below example, we are searching for any users that have a first name that begins with a d and a second character that is not a.

SELECT *
FROM users
WHERE first_name LIKEd[^a]%’’;

_ (Wildcard match one character)

The _ symbol - sometimes referred to as the underscore character - is used to match any single character in a string comparison operation. In the below example, we are searching for any users that have a first that begins with a d and has a third character that is n. The second character can be any letter.

SELECT *
FROM users
WHERE first_name LIKE ‘d_n%;

More helpful SQL resources:

Or, try the best SQL learning resource of all: interactive SQL courses you can take right in your browser. Sign up for a FREE account and start learning!

Learn SQL the right way!

  • Writing real queries
  • In your browser
  • On your schedule

Why watch video lectures when  you can learn by doing?

The post SQL Operators: 6 Different Types (w/ Examples) appeared first on Dataquest.

]]>
Why Jorge Prefers Dataquest Over DataCamp for Learning Data Analysis https://www.dataquest.io/blog/dataquest-datacamp-learn-data-analysis/ Mon, 15 Mar 2021 17:38:33 +0000 https://www.dataquest.io/?p=17304 When Jorge Varade decided he wanted to learn data analysis, he tried both DataCamp and Dataquest, and found he strongly preferred the latter. Here's why.

The post Why Jorge Prefers Dataquest Over DataCamp for Learning Data Analysis appeared first on Dataquest.

]]>
jorge-dataquest-datacamp-learn-data-analysis

A lot of data science learners are interested in the question of DataCamp versus Dataquest. Jorge Varade, who's spent time on both platforms, has a strong opinion about which one is best. 

When Jorge finished his degree in business administration, he knew he wanted to get involved with analytics. An internship at Ralph Lauren in sales and marketing analytics taught him the basics of Excel and some more advanced techniques like pivot tables, but he was hungry for more. “I was really interested in data analysis,” he says, “but I wanted to do something more related to, well, Python.”

That’s how he ended up at Belgrave Valley, a London-based bootcamp that uses Dataquest to teach data science programming skills to students.

DataCamp vs. Dataquest

He came into bootcamp without any real programming experience. He had tried messing around a little bit on his own, he says, looking at videos and trying out a few different platforms, including DataCamp. But he hadn’t really made any headway. “I was a beginner,” he says.

When he got to Belgrave Valley and started using Dataquest, that changed quickly. “What I’ve learned in Python and SQL comes from Belgrave Valley and Dataquest,” he says.

For Jorge, what made the difference was how Dataquest forced him to think and apply what he was learning at each step. That proved to be a strong contrast with the other platforms he’d tried:

I wanted to try DataCamp to see how it was. I’ve done courses from the two platforms, DataCamp and Dataquest, and in my opinion I think Dataquest is much better because it makes you make an effort. On DataCamp, the code you have to write is almost already written for you, so you don’t learn too much. Dataquest makes you use your head and apply the things that you’re learning. I prefer the way [Dataquest] teaches, I think you do it really well.

“Also the projects were really good,” he says. “I really like the type of projects you have.”

DQ vs. DC by the numbers

Jorge's point — that Dataquest provides a more meaningful learning experience by asking you to write the code — is something that we hear a lot from Dataquest learners. But we certainly don't expect you to take our word for it!

Instead, let's look at how the two sites stack up on third-party review sites to see whether most data science learners agree with Jorge:

Review Site

Dataquest

DataCamp

4.85 out of 5

4.62 out of 5

4.7 out of 5

4.3 out of 5

4.94 out of 5

4.05 out of 5

These numbers are current as of March 16, 2021, and the message is pretty clear — third-party reviewers consistently rate Dataquest above DataCamp. Jorge is not alone.

Getting a Job in Data

Since finishing his Dataquest courses at the bootcamp program, Jorge has been working as a data analyst — first on a two-week contract for a bank, and then in a full-time role analyzing auto marketing data at Mediacom. After a few months in that role, he moved into another data analyst role — this time at HelloFresh.

But he’s not resting on his laurels. He finished the Data Analyst in Python path, he says, and now he’s switching over to the Data Scientist path so he can keep adding to his skill set. “I will continue [subscribing to] Dataquest,” he says, “because I really like how you explain the courses and the content.”

For other Dataquest students aiming to get jobs as data analysts, Jorge recommends spending as much time as possible studying, and really immersing yourself in your learning. “Really focus on how Python works, how SQL works, and how the data analysis world works,” he says. “If you really like data analysis, then spend time on it.”

Want to follow in Jorge’s footsteps? Click below to get started with a free account — in less than five minutes from now, you'll be writing your first code, and on your way to becoming a data analyst!

The post Why Jorge Prefers Dataquest Over DataCamp for Learning Data Analysis appeared first on Dataquest.

]]>
How Long Does It Take to Learn SQL? https://www.dataquest.io/blog/how-long-learn-sql/ Wed, 10 Mar 2021 15:35:47 +0000 https://www.dataquest.io/?p=27961 How long does it take to learn SQL? It depends on your goals and your background, so we've broken down a variety of scenarios for you.

The post How Long Does It Take to Learn SQL? appeared first on Dataquest.

]]>
How long does it take to learn SQL?

How long does it really take to learn SQL?

SQL is a critical skill for pretty much anyone who works with data or databases. And while learning a new programming language is never a walk in the park, we’ve got some good news — it typically doesn’t take too long to get the basics of SQL down. But how long does it take to really learn SQL?

The answer really depends on you, your goals, and your background. So rather than try to give you a one-size-fits-all answer, we've gamed out a number of different scenarios. Let’s dive into the details.

What is SQL?

To understand what it takes to learn SQL, it’s important to understand what SQL actually is. In the previous section, I called it a programming language, but it would be more accurate to say that SQL is a query language.

A query language is a type of programming language that’s built for one thing: interacting with databases. When you’ve learned SQL, you’ll use it to do things such as:

  • Get the specific data you want from databases
  • Join elements from different data tables in your database together
  • Perform calculations, analysis, and filter data to answer questions

If you’ve got data that’s stored in a SQL-based database—and you probably do, most companies use some form of SQL-based database management system—SQL is the tool that’ll help you quickly select and work with the specific data you need.

What this means is that SQL isn’t a full-on programming language in the way that Python (for example) is. SQL isn’t going to be the language you use to do something like code a video game, or build a mobile app. It’s really only useful for tasks that involve working with data in databases.

But that’s a good thing! Because SQL is specifically focused on working with data, there’s less for you to learn, and most of the SQL educational materials you encounter will be focused on using SQL for common data tasks.

Why Learn SQL?

We’ve written a whole article on why you should learn SQL, with up-to-date jobs information from 2021. The full article is definitely worth reading, but in case you don’t want to click through, here’s are a few quick reasons:

  • Almost every company uses some kind of SQL-based database to store data. MySQL, Oracle, Microsoft SQL Server, etc. — all of these are SQL-based database management systems, and that means that SQL skills will be needed to work efficiently with databases at almost any company.
  • SQL enables you to work more efficiently and transparently than Excel. SQL enables you to work with huge datasets quickly, and because it’s a written language, everything that you do is transparent and easy to understand, adapt, and repeat. No hidden cell formulas to go looking for, and no more complicated VLOOKUP nightmares!
  • SQL skills are in demand. This is particularly true in the realm of data science, but even jobs in unrelated fields like marketing are increasingly asking for SQL skills, as analyzing and acting on data becomes an increasingly important part of many jobs.

How Much Time Will it Really Take to Learn SQL?

The answer to this question depends on both your background and your goals for learning SQL.

So, rather than giving you a one-size-fits-all answer, let’s break this down into a few different scenarios. Each scenario assumes you're starting from scratch with SQL, and are looking to learn up through and including the specified skill level

Feel free to jump to whichever of these subheadings describes you best:

(Note: all of the time estimates here assume you already have a full-time job and, like most adults, are limited to just a few hours a week of study time. If you can devote more time each week to studying, you’ll progress even faster).

No programming experience, and want to learn through basic SQL

Maybe your job isn’t technical, but you’re interested in learning a bit more from your company’s data, or running a few specific queries regularly to understand more about the impact of the work you’re doing. If you’ve never written code before, but you’d like to learn enough SQL to run a quick query to answer questions every now and then, this section is for you.

The fundamentals of SQL really won’t take very long to learn. Our first SQL course, for example, takes most people about an hour to complete.

Because you don’t have prior experience with programming languages, you’ll probably want to set aside a little extra time to wrap your head around everything. And you’ll definitely want to set aside some extra time for practice.

Even so, you should expect to be able to learn the fundamentals of SQL — how to query specific data tables from your database, how to select specific columns from those tables, how to do basic math with SQL, and how to limit the output your queries return — in the space of a few hours, or a weekend at most.

No programming experience, and want to learn through intermediate SQL

If you don’t have prior coding experience, but you’re expecting to use SQL pretty regularly, and take on some more complicated tasks like joining different tables together to create new tables for analysis, this section describes you.

How much time this takes will vary a bit from person to person, but you should expect it to take anywhere from a single weekend to a few weeks (we’re assuming that you have a full-time job already and are only able to study occasionally, during your free time).

If you’re studying with Dataquest, this section would map onto our first two or three SQL courses, depending on how much you need to learn for your specific use case. You can probably complete all three courses (not counting the guided projects) in around five or six hours, but you should definitely set aside extra time for practice and to work through the projects to cement your learning.

No programming experience, and want to learn through advanced SQL

If you don’t have coding experience but you’re looking to land a role that’s heavily reliant on SQL skills, like a data analyst job or perhaps even a data engineering job, this section is for you.

You’ll want to learn everything from the basics through advanced queries, but you’ll probably also want to learn skills like creating databases using PostgreSQL.

Depending on how deep you need to go, this is likely to take anywhere from a month to several months, because you’ll be learning advanced queries, but you’ll also need to cover topics like building and optimizing databases, database security, etc.

Note that if you’re looking for a job like data engineer, SQL skills are not the only thing you’ll need to learn, so the amount of time it takes you to get to job-ready will be quite a bit longer than the time it takes to learn SQL. Some data analyst jobs will also have additional technical requirements, like some knowledge of Python programming, although there are analyst jobs that only require SQL.

Prior programming experience, and want to learn through basic SQL

If you’ve already got some experience with programming languages and you just want to learn enough to query your company’s database for the right tables — maybe you’re planning to pull that data into Python or R for analysis — this section is for you.

The basics of SQL likely will take you just an hour or two to learn. You’ll probably find it to be refreshingly straightforward compared to other programming languages, as SQL is quite readable.

Prior programming experience, and want to learn through intermediate SQL

If you’ve already got some coding experience, but you anticipate using SQL fairly regularly to do things like join data tables on different columns and filter for the specific data you need, this section is for you.

Precisely how long it takes will depend on how far you want to go with SQL, but you will probably be able to comfortably work through the material our first two or three SQL courses in a week. Completing the guided projects may extend that time a little further, but you’ll probably be able to start querying your company’s database and using your new SQL skills in meaningful ways within just a few hours of beginning your study.

Prior programming experience, and want to learn through advanced SQL

If you’ve already got some coding experience but you’re looking to move into a full-time role that’s going to require a lot of SQL work, this section is for you.

You’ll want to learn all of the querying skills from the previous section, but you may also need to learn more about creating databases, optimizing them, and ensuring that they are secure. This means you’ll need to spend additional time learning about things like PostgreSQL, and think about the SQL skills you need from a data engineer’s perspective.

This will probably take you a month or two, although it’s worth noting that these kinds of roles will generally also require other technical skills that’ll take additional time to learn if you don’t already know them.

Ready to get started?

SQL can unlock a whole new world of data that makes all of your work more efficient and more impactful. Sign up for a free Dataquest account and you can try out our SQL Fundamentals course and see just how quickly you can make real progress learning SQL!

Learn SQL the right way!

  • Writing real queries
  • In your browser
  • On your schedule

Why watch video lectures when  you can learn by doing?

The post How Long Does It Take to Learn SQL? appeared first on Dataquest.

]]>
SQL vs T-SQL: Understanding the Differences https://www.dataquest.io/blog/sql-vs-t-sql/ Thu, 04 Mar 2021 17:18:12 +0000 https://www.dataquest.io/?p=27854 SQL or T-SQL — which one do you need to learn? SQL and T-SQL both see heavy use in the database and data science industries. But what exactly are they? These two query languages are very similar, both in name and in what they can do, so the distinction between them can be difficult to […]

The post SQL vs T-SQL: Understanding the Differences appeared first on Dataquest.

]]>
SQL or T-SQL - which one should you learn?

SQL or T-SQL — which one do you need to learn?

SQL and T-SQL both see heavy use in the database and data science industries. But what exactly are they? These two query languages are very similar, both in name and in what they can do, so the distinction between them can be difficult to understand.

In this post, we're going to: define what Standard SQL and T-SQL are, investigate the differences between them, provide examples of each, and summarize which you should be learning and why.

  • Define what Standard SQL and T-SQL are
  • Investigate the differences between them
  • Provide examples of each
  • Summarize which you should be learning and why

If blog articles aren’t your preferred method of learning, consider Dataquest’s SQL courses. Many of our students prefer the structure that comes with it, and the fact that they can apply what they learn with real code.

What is Standard SQL?

Standard SQL, usually referred to simply as "SQL," is a type of programming language called a query language. Query languages are used for communicating with a database.

SQL is used for adding, retrieving, or updating data stored in a database. It is used across many different types of databases. Meaning, if you learn the basics of SQL, you will be in a good position for a career in data.

Databases and the data stored within them are a core part of how many companies operate. An easy example is with a retailer that might store order or customer information in a database. SQL is a programming language that allows the company to work with that data.

What is T-SQL?

T-SQL, which stands for Transact-SQL and is sometimes referred to as TSQL, is an extension of the SQL language used primarily within Microsoft SQL Server. This means that it provides all the functionality of SQL but with some added extras.

You can think of it a bit like a SQL dialect — it's very similar to regular SQL, but it has a few extras and differences that make it unique.

Despite the clear and rigid specifications of standard SQL, it does allow for database companies to add their own extensions to set them apart from other products. T-SQL is an example of this for Microsoft SQL Server databases — T-SQL is central to the software and runs most operations within it. 

Most major database vendors offer their own SQL language extensions for their own products, and T-SQL is one of the most widely-used examples of these (because Microsoft SQL server is popular).

Put simply: when you are writing queries within Microsoft SQL Server, you are effectively using T-SQL. All applications that communicate with SQL Server, regardless of the application's user interface, do so by sending T-SQL statements to the server.

However, in addition to SQL Server, other database management systems (DBMS) also support T-SQL. Another Microsoft product, Microsoft Azure SQL Database, supports most features of T-SQL.

T-SQL has been designed to make working with those databases that support it easier and more efficient.

What is the difference between SQL and T-SQL?

Now we have covered the basics of both, let's take a look at the main differences:

Difference #1

The obvious difference is in what they are designed for: SQL is a query language used for manipulating data stored in a database. T-SQL is also a query language, but it's an extension of SQL that is primarily used in Microsoft SQL Server databases and software.

Difference #2

SQL is open-source. T-SQL is developed and owned by Microsoft.

Difference #3

SQL statements are executed one at a time, also known as "non-procedural." T-SQL executes statements in a "procedural" way, meaning that the code will be processed as a block, logically and in a structured order.

There are advantages and disadvantages to each approach, but from a learner perspective, this difference isn't too important. You'll be able to get and work with the data you want in either language, it's just that the way you go about doing that will vary a bit depending on which language you're using and the specifics of your query.

Difference #4

On top of these more general differences, SQL and T-SQL also have some slightly different command key words. T-SQL also features functions that are not part of regular SQL.

An example of this is how in we select the top X number of rows. In standard SQL, we would use the LIMIT keyword. In T-SQL, we use the TOP keyword. 

Both of these commands do the same thing, as we can see in the examples below. Both queries will return the top ten rows in the users table ordered by the age column.

SQL Example

SELECT *
FROM users
ORDER BY age
LIMIT 10;

T-SQL Example

SELECT TOP 10 (*)
FROM users
ORDER BY age;

Difference #5

Finally, and as referenced before, T-SQL offers functionality that does not appear in regular SQL. One example is the ISNULL function. This will replace NULL values coming from a specific column. The below would return an age of “0” for any rows that have a value of NULL in the age column.

SELECT ISNULL(0, age)
FROM users;

(There are ways of doing this in standard SQL too, of course, but the commands are slightly different.)

These are just a couple of code differences to give you an idea of how the two compare, but of course, there are many more. You can learn more about SQL commands with our extensive guide. And of course, Microsoft has documentation for working with T-SQL.

Which is better to learn?

If you want to work with databases in any way, or if you're seeking a data job, learning SQL is a necessity.

As T-SQL is an extension of SQL, you will need to learn the basics of SQL before starting. If you learn T-SQL first, you will end up picking up knowledge of standard SQL anyway.

With most things, which you choose to learn should depend on what you are trying to achieve. If you are going to be working with Microsoft SQL server, then it is worth learning more about T-SQL. If you are a beginner looking to get started in using databases, then begin with learning about SQL.

Learn SQL the right way!

  • Writing real queries
  • In your browser
  • On your schedule

Why watch video lectures when  you can learn by doing?

The post SQL vs T-SQL: Understanding the Differences appeared first on Dataquest.

]]>
How to Learn Python (Step-by-Step) in 2021 https://www.dataquest.io/blog/learn-python-the-right-way/ Wed, 24 Feb 2021 09:16:00 +0000 https://dq.t79ae38x-liquidwebsites.com/?p=7016 Learn Python the right way, avoid the "cliff of boring," and give yourself the best chance to actually learn to code by following these steps.

The post How to Learn Python (Step-by-Step) in 2021 appeared first on Dataquest.

]]>
learning python shouldn't feel like climbing a mountain

What's the best way to learn Python? It doesn't have to feel like scaling a cliff!

Python is an important programming language to know — it's widely-used in fields like data science, web development, software engineering, game development, automation. But what's the best way to learn Python? That can be difficult and painful to figure out. I know that from experience.

Covid-19 Update: Has the Best Way to Learn Python Changed?

Nope! The Covid-19 pandemic has certanly disrupted in-person Python instructional opportunities like bootcamps, university programs, etc. But the best way to learn Python hasn't changed.

As you'll discover in this article, the right way to learn Python involves working on personal projects so that you truly care about what you're doing and are motivated to continue. This is rarely possible with in-person instruction — you and the rest of your Python class will get assigned the same generic practice problems because otherwise it's too difficult for the teacher to grade!

Learning Python on your own presents its own challenges, of course, but in our experience the single most important factor in success or failure is your personal level of motivation. And since most people aren't personally passionate about Python syntax, the way to maintain your motivation and passion for learning over the long haul is to work on projects that mean something to you.

And of course, you can still work from and learn from others remotely. The Dataquest community is an active, inclusive space for Python learners to share, work together, and learn from each other.

And of course, there are many other ways you can learn with others or from others without being in the same physical space! Finding a mentor online and having Google Meet or Zoom sessions can be very helpful when you're in the later stages of your learning and starting to think about careers.

 

One of the things that I found most frustrating when I was learning Python was how generic all the learning resources were. I wanted to learn how to make websites using Python, but it seemed like every learning resource wanted me to spend two long, boring, months on Python syntax before I could even think about doing what interested me.

This mismatch made learning Python quite intimidating for me. I put it off for months. I got a couple of lessons into the Codecademy tutorials, then stopped. I looked at Python code, but it was foreign and confusing:

from django.http import HttpResponse
def index(request):
    return HttpResponse("Hello, world. You're at the polls index.")

The above code is from the tutorial for Django, a popular Python website development framework. Experienced programmers will often throw snippets like the above at you. “It’s easy!”, they’ll promise.

But even a few seemingly simple lines of code can be incredibly confusing. For instance, why are some lines indented? What’s django.http? Why are some things in parentheses?

Understanding how everything fits together when you don’t know much Python can be very hard.

The problem is that you need to understand the building blocks of the Python language to build anything interesting. The above code snippet creates a view, which is one of the key building blocks of a website using the popular MVC architecture. If you don’t know how to write the code to create a view, it isn’t really possible to make a dynamic website.

Most tutorials assume that you need to learn all of Python syntax before you can start doing anything interesting. This is what leads to months spent just on syntax, when what you really want to be doing is analyzing data, or building a website, or creating an autonomous drone.

All that time spent on syntax rather than what you want to be doing causes your motivation to ebb away, and to you just calling the whole thing off.

I like to think of this as the “cliff of boring”. You need to be able to climb the “cliff of boring” to make it to the “land of interesting stuff you work on” (better name pending).

But you don't have to spend months on that cliff.

learning python should not feel like this

Learning Python syntax doesn't have to feel like this.

After facing the “cliff of boring” a few times and walking away, I found a process that worked better for me. In fact, I think this is the best way to learn Python.

What worked was blending learning the basics with building interesting things. I spent as little time as possible learning the basics, then immediately dove into creating things that interested me.

In this blog post, I’ll walk you through step by step how to replicate this process, regardless of why you want to learn Python.

Step 1: Figure Out What Motivates You to Learn Python

Before you start diving into learning Python online, it’s worth asking yourself why you want to learn it. This is because it’s going to be a long and sometimes painful journey. Without enough motivation, you probably won’t make it through. For example, I slept through high school and college programming classes when I had to memorize syntax and I wasn’t motivated. On the other hand, when I needed to use Python to build a website to automatically score essays, I stayed up nights to finish it.

Figuring out what motivates you will help you figure out an end goal, and a path that gets you there without boredom. You don’t have to figure out an exact project, just a general area you’re interested in as you prepare to learn Python.

Pick an area you’re interested in, such as:

  • Data science / Machine learning
  • Mobile apps
  • Websites
  • Games
  • Data processing and analysis
  • Hardware / Sensors / Robots
  • Scripts to automate your work

Yes, you can make robots using Python! From the Raspberry Pi Cookbook.

Figure out one or two areas that interest you, and you’re willing to stick with. You’ll be gearing your learning towards them, and eventually will be building projects in them.

Step 2: Learn the Basic Syntax

Unfortunately, this step can’t be skipped. You have to learn the very basics of Python syntax before you dive deeper into your chosen area. You want to spend the minimum amount of time on this, as it isn’t very motivating. 

Here are some good resources to help you learn the basics:

I can’t emphasize enough that you should only spend the minimum amount of time possible on basic syntax. The quicker you can get to working on projects, the faster you will learn. You can always refer back to the syntax when you get stuck later. You should ideally only spend a couple of weeks on this phase, and definitely no more than a month.

Also, a quick note: learn Python 3, not Python 2. Unfortunately a lot of "learn Python" resources online still teach Python 2, but you should definitely learn Python 3. Python 2 is no longer supported, so bugs and security holes will not be fixed!

Step 3: Make Structured Projects

Once you’ve learned the basic syntax, it’s possible to start making projects on your own. Projects are a great way to learn, because they let you apply your knowledge. Unless you apply your knowledge, it will be hard to retain it. Projects will push your capabilities, help you learn new things, and help you build a portfolio to show to potential employers.

However, very freeform projects at this point will be painful — you’ll get stuck a lot, and need to refer to documentation. Because of this, it’s usually better to make more structured projects until you feel comfortable enough to make projects completely on your own. Many learning resources offer structured projects, and these projects let you build interesting things in the areas you care about while still preventing you from getting stuck.

Let’s look at some good resources for structured projects in each area:

Data science / Machine learning

  • Dataquest — Teaches you Python and data science interactively. You analyze a series of interesting datasets ranging from CIA documents to NBA player stats. You eventually build complex algorithms, including neural networks and decision trees.
  • Python for Data Analysis — written by the author of a major Python data analysis library, it’s a good introduction to analyzing data in Python.
  • Scikit-learn documentation — Scikit-learn is the main Python machine learning library. It has some great documentation and tutorials.
  • CS109 — this is a Harvard class that teaches Python for data science. They have some of their projects and other materials online.

Mobile Apps

  • Kivy guide — Kivy is a tool that lets you make mobile apps with Python. They have a guide on how to get started.

Websites

  • Flask tutorial — Flask is a popular web framework for Python. This is the introductory tutorial.
  • Bottle tutorial — Bottle is another web framework for Python. This is how to get started with it.
  • How To Tango With Django — A guide to using Django, a complex Python web framework.

Games

An example of a game you can make with Pygame. This is Barbie Seahorse Adventures 1.0, by Phil Hassey.

Hardware / Sensors / Robots

Scripts to Automate Your Work

Once you’ve done a few structured projects in your own area, you should be able to move into working on your own projects. But, before you do, it’s important to spend some time learning how to solve problems.

Step 4: Work on Python Projects on Your Own

Once you’ve completed some structured projects, it’s time to work on projects on your own to continue to learn Python better. You’ll still be consulting resources and learning concepts, but you’ll be working on what you want to work on. Before you dive into working on your own projects, you should feel comfortable debugging errors and problems with your programs. Here are some resources you should be familiar with:

  • StackOverflow — a community question and answer site where people discuss programming issues. You can find Python-specific questions here.
  • Google — the most commonly used tool of every experienced programmer. Very useful when trying to resolve errors. Here’s an example.
  • Python documentation — a good place to find reference material on Python.

Once you have a solid handle on debugging issues, you can start working on your own projects. You should work on things that interest you. For example, I worked on tools to trade stocks automatically very soon after I learned programming.

Here are some tips for finding interesting projects:

  • Extend the projects you were working on previously, and add more functionality.
  • Check out our list of Python projects for beginners.
  • Go to Python meetups in your area, and find people who are working on interesting projects.
  • Find open source packages to contribute to.
  • See if any local nonprofits are looking for volunteer developers.
  • Find projects other people have made, and see if you can extend or adapt them. Github is a good place to find these.
  • Browse through other people’s blog posts to find interesting project ideas.
  • Think of tools that would make your every day life easier, and build them.

Remember to start very small. It’s often useful to start with things that are very simple so you can gain confidence. It’s better to start a small project that you finish that a huge project that never gets done. At Dataquest, we have guided projects that give you small data science related tasks that you can build on.

It’s also useful to find other people to work with for more motivation.

If you really can’t think of any good project ideas, here are some in each area we’ve discussed:

Data Science / Machine Learning Project Ideas

  • A map that visualizes election polling by state.
  • An algorithm that predicts the weather where you live.
  • A tool that predicts the stock market.
  • An algorithm that automatically summarizes news articles.

You could make a more interactive version of this map. From RealClearPolitics.

Mobile App Project Ideas

  • An app to track how far you walk every day.
  • An app that sends you weather notifications.
  • A realtime location-based chat.

Website Project Ideas

  • A site that helps you plan your weekly meals.
  • A site that allows users to review video games.
  • A notetaking platform.

Python Game Project Ideas

  • A location-based mobile game, where you capture territory.
  • A game where you program to solve puzzles.

Hardware / Sensors / Robots Project Ideas

  • Sensors that monitor your home temperature and let you monitor your house remotely.
  • A smarter alarm clock.
  • A self-driving robot that detects obstacles.

Work Automation Project Ideas

  • A script to automate data entry.
  • A tool to scrape data from the web.

My first project on my own was adapting my automated essay scoring algorithm from R to Python. It didn’t end up looking pretty, but it gave me a sense of accomplishment, and started me on the road to building my skills.

The key is to pick something and do it. If you get too hung up on picking the perfect project, there’s a risk that you’ll never make one.

Step 5: Keep working on harder projects

Keep increasing the difficulty and scope of your projects. If you’re completely comfortable with what you’re building, it means it’s time to try something harder.

You can choose a new project that

Here are some ideas for when that time comes:

  • Try teaching a novice how to build a project you made.
  • Can you scale up your tool? Can it work with more data, or can it handle more traffic?
  • Can you make your program run faster?
  • Can you make your tool useful for more people?
  • How would you commercialize what you’ve made?

Going forward

At the end of the day, Python is evolving all the time. There are only a few people who can legitimately claim to completely understand the language, and they created it.

You’ll need to be constantly learning and working on projects. If you do this right, you’ll find yourself looking back on your code from 6 months ago and thinking about how terrible it is. If you get to this point, you’re on the right track. Working only on things that interest you means that you’ll never get burned out or bored.

Python is a really fun and rewarding language to learn, and I think anyone can get to a high level of proficiency in it if they find the right motivation.

I hope this guide has been useful on your journey. If you have any other resources to suggest, please let us know!

Find out more about how you can learn Python and add this skill to your portfolio by visiting Dataquest.

Common Python Questions:


Is it hard to learn Python?

Learning Python can certainly be challenging, and you're likely to have frustrating moments. Staying motivated to keep learning is one of the biggest challenges.

However, if you take the step-by-step approach I've outlined here, you should find that it's easy to power through frustrating moments, because you'll be working on projects that genuinely interest you.

Can you learn Python for free?

There are lots of free Python learning resources out there — just here at Dataquest, we have dozens of free Python tutorials and our interactive data science learning platform, which teaches Python, is free to sign up for and includes many free missions. The internet is full of free Python learning resources!

The downside to learning for free is that to learn what you want, you'll probably need to patch together a bunch of different free resources. You'll spend extra time researching what you need to learn next, and then finding free resources that teach it. Platforms that cost money may offer better teaching methods (like the interactive, in-browser coding Dataquest offers), and they also save you the time of having to find and build your own curriculum.

Can you learn Python from scratch (with no coding experience)?

Yes. At Dataquest, we've had many learners start with no coding experience and go on to get jobs as data analysts, data scientists, and data engineers. Python is a great language for programming beginners to learn, and you don't need any prior experience with code to pick it up. 

How long does it take to learn Python?

Learning a programming language is a bit like learning a spoken language — you're never really done, because programming languages evolve and there's always more to learn! However, you can get to a point of being able to write simple-but-functional Python code pretty quickly.

How long it takes to get to job-ready depends on your goals, the job you're looking for, and how much time you can dedicate to study. But for some context, Dataquest learners we surveyed in 2020 reported reaching their learning goals in less than a year — many in less than six months — with less than ten hours of study per week.

How can I learn Python faster?

Unfortunately, there aren't really any secret shortcuts! The best thing you can do is find a platform that teaches Python (or build a curriculum for yourself) specifically for the skill you want to learn (for example, Python for game dev, or Python for data science).

This should ensure that you're not wasting any time learning things you won't actually need for your day-to-day Python work. But make no mistake, whatever you want to do with Python, it'll take some time to learn!

Do you need a Python certification to find work?

We've written about Python certificates in depth, but the short answer is: probably not. Different companies and industries have different standards, but in data science, certificates don't carry much weight. Employers care about the skills you have — being able to show them a GitHub full of great Python code is much more important than being able to show them a certificate.

Should you learn Python 2 or 3?

We've written about Python 2 or Python 3 as well, but the short answer is this: learn Python 3. A few years ago, this was still a topic of debate, and some extreme predictions even claimed that Python 3 would "kill Python." That hasn't happened, and today, Python 3 is everywhere.

Is Python a good language to learn in 2021?

Yes. Python is a popular and flexible language that's used professionally in a wide variety of contexts.

We teach Python for data science and machine learning, for example, but if you wanted to apply your Python skills in another area, Python is used in finance, web development, software engineering, game development, etc.

If you're working with data, Python is the most in-demand programming language you could learn. Here's data from open job postings on Indeed.com in February of 2021:

python is the most important skill for data scientist jobs
python is the most important skill for data engineer jobs
python is the second most important skill for data analyst jobs

As you can see, Python is a critical skill, and it's listed above every other technical skill in data scientist and data engineering job postings. It ranks second, behind only SQL, in data analyst job postings. Many jobs in all three areas will require both Python and SQL skills, but SQL is a query language. In terms of programming skills, Python is most in-demand.

(Incidentally, we're sometimes asked why Dataquest doesn't teach Julia for data science. The charts above probably answer that question — our curriculum is very focused on real-world skills, and we choose what courses to make based on an analysis of data job postings so that we can be sure the skills you learn at Dataquest are helpful in the real world.)

Moreover, Python data skills can be really useful even if you have no aspiration to become a full-time data scientist or programming. Having some data analysis skills with Python can be useful for a wide variety of jobs — if you work with spreadsheets, chances are there are things you could be doing faster and better with a little Python. 

The post How to Learn Python (Step-by-Step) in 2021 appeared first on Dataquest.

]]>
The Best Way to Learn SQL (According to Seasoned Devs) https://www.dataquest.io/blog/best-way-to-learn-sql/ Thu, 18 Feb 2021 00:49:47 +0000 https://www.dataquest.io/?p=27610 What's the best way to learn SQL? With all of the resources available, learning SQL the “right way” can be difficult. Finding the best way to learn SQL is tricky because everyone learns things differently. But, after training tens of thousands of students — seeing what works and what doesn’t — we’ve come up with […]

The post The Best Way to Learn SQL (According to Seasoned Devs) appeared first on Dataquest.

]]>
what's the best way to learn SQL?

What's the best way to learn SQL?

With all of the resources available, learning SQL the “right way” can be difficult. Finding the best way to learn SQL is tricky because everyone learns things differently. But, after training tens of thousands of students — seeing what works and what doesn’t — we’ve come up with a few easy steps that anyone can follow. Here’s the best way to learn SQL:

Step 1: Determine why you want to learn SQL

Before you dive into a SQL course, it’s important to be sure you have a good answer for the question “why should I learn SQL?”

That’s because while SQL isn’t too difficult to learn, no learning journey is ever totally smooth sailing. You will likely face moments of frustration and confusion. If you don’t have a good reason to learn SQL, it’s going to be very easy for you to quit in those moments.

There’s no one answer to that question, but here are some of the most common reasons people want to learn SQL:

  • You’re feeling bottlenecked by Excel and sick of VLOOKUP
  • You want to be able to access your company’s data easily, on-demand
  • You want to be able to work with bigger datasets quickly
  • You want to get a job as a data analyst, data scientist, or data engineer (and you know that SQL is the single most important skill for those jobs)
  • You want to create transparent, repeatable data processes to reduce repetitive tasks

Of course, those are just a few broad reasons. You need to find a reason that speaks to you. It may be something very specific, like a particular question you’d like to answer about your customers, or a particular dashboard you’d like to build.

(Can you build a dashboard with SQL? Sort of. We’ll get to that later!)

Step 2: Learn the basic syntax

This typically isn’t people’s favorite part of learning a programming language (or in this case, a query language). But it can’t be avoided. There’s just no way you can get to a functional level of SQL without being able to look at something like this and know what’s going on:

SELECT c.name capital_city, f.name country
FROM facts f
INNER JOIN (
        SELECT * FROM cities
                WHERE capital = 1
                ) c ON c.facts_id = f.id
LIMIT 10;

Thankfully, learning this may be easier than you think. While that probably looks complicated and confusing at first glance, SQL’s syntax is actually pretty straightforward. And the list of SQL commands — the all-caps words like SELECT in the code above — that you’ll actually use on a regular basis is short.

The key to being successful with this step is to power through it as quickly as you can. Set aside a few hours to work through Dataquest’s first SQL course all at once. Or pick another learning resource and set aside enough time to get through the basics.

What’s most important here is that you don’t drag this out. You want to get to the point of being able to actually do things with SQL as quickly as possible, because being able to dig into real problems and find the answers is a powerful motivator. That’s what’s really going to keep you motivated and learning, so we want to get you to that point as fast as possible.

Step 3: Start working on guided projects

As soon as you’ve learned the basics, it’s time to start diving into actual projects using SQL.

If you’re learning with us at Dataquest, this is built into the curriculum, with interactive guided projects that challenge you to use your new SQL skills to query and analyze real databases for answers.

If you’re not learning with Dataquest, we suggest the next best thing: guided projects and tutorials.

You need to find something that will give you a bit of structure and guidance, because at this stage, the process of trying to build a full SQL project from scratch would probably be frustrating. You want something that you can try to do on your own, but that also offers some guidance you can look to when you get lost or aren’t sure what to do next.

For example, here’s a tutorial on joins in SQL. This would be great practice, but try to work through it on your own, checking the code snippets only to make sure you’re right after you’ve written your own queries.

Remember, the goal here is to work on guided projects with increasing independence. If you’re simply copy-pasting code from a tutorial, you won’t be learning much, so be sure you’ve given it your best effort before you check the answer.

Step 4: Familiarize yourself with helpful SQL resources

Once you’ve worked through some guided projects, it’s time to step out on your own. The good news: you can work with exactly the data you want, to answer exactly the questions you want. How motivating!

The bad news: there’s no answer key you can check! So before you start your first project, it’s helpful to bookmark a few useful SQL resources. Remember, there’s no shame whatsoever in Googling for answers — even the most seasoned SQL developers and users do this frequently!

Useful SQL resources:

  • Learning SQL 2nd Edition (PDF) — This O’Reilly book on the basics of SQL is available for free in PDF format, and makes a good reference.
  • StackOverflow SQL questions — Chances are, any SQL question you’ll have has already been answered here. But if it hasn’t create an account and ask it for yourself!
  • Github — If SQL is your first foray into the world of programming, you may not have an account here. If that’s the case, set one up and start learning how to use it! Github is great for sharing your own SQL projects with the world (and potential employers), and it’s also an amazing resource for looking at other people’s code.
  • /r/SQL — Reddit has a SQL community that’s large, active, and (mostly) happy to answer questions.
  • The Dataquest community — Our community is active, friendly, and ready to help you with all your SQL questions. Best of all, it’s open to everyone — you don’t have to be a Dataquest subscriber to get help there.

Step 5: Build your own SQL projects

Now that you know some good places to look for help when you get stuck, it’s time to start working on your own SQL projects.

This is where the answer you came up with in Step 1 really starts to matter. Knowing why you want to learn SQL will probably help you answer the question: what projects should I work on?

The short answer? Work on projects you care about. If you’re learning SQL because you’re sick of Excel slowing you down at work, then your first project should probably be figuring out how to do those work tasks more efficiently with SQL.

If you’re learning SQL because you want a particular job, you should work on SQL projects that are as close as possible to what you’ll actually be doing when you get the job. For example, if your passion is crunching data to help decrease carbon emissions and make things more energy-efficient, then you’ll probably want to work on projects that relate to that goal.

We should note: this step can be a little challenging if you’re not working at a company, or if you don’t want to use company data for your projects. Finding a SQL database that’s freely available to all that contains exactly the kind of data you want to work with can be difficult, depending on your goals.

But never fear! While it takes a little bit of extra effort, it is possible to convert any downloadable data you find in CSV format (or something similar) into a SQL database format such as a SQLite table. There are even sites that can make the conversion process pretty easy.

Whatever data you want to work with, with a little digging, you should be able to find a way to work with it using SQL.

And don’t forget: share your SQL projects on your Github when you’re finished with them. Go back and update them when you learn something new!

Step 6: Make more advanced projects

The final step is essentially a continuation of Step 5, and you can repeat this for as long as you’d like. The key to continued learning here is that you have to ramp up the challenge.

Once you’ve learned how to build the SQL project that initially motivated you — maybe you’ve written a query that replaces your old Excel workflow — it can be tempting to keep doing projects along those same lines.

Doing the same thing over and over is good for retention, but it’ll stunt your growth. It’s best to try to ensure that with each new project, you’re learning or trying at least one new thing — something you don’t already know how to do.

This could mean you’re taking on an entirely new project, or it could mean revisiting an old project to give it new complexity.

It could also mean taking on challenges you may not have thought about previously, such as:

  • Can you integrate your SQL skills with a tool like Mode to produce a dashboard?
  • Can you teach someone else how to query your company database using SQL?

At this point, you’ll havr the skills to do more or less anything you want with SQL — not because you know how to do everything, but because your project-building process has taught you how to find the answers to anything you don’t know.

Learn SQL the right way!

  • Writing real queries
  • In your browser
  • On your schedule

Why watch video lectures when  you can learn by doing?

Frequently-asked questions about SQL:

Is SQL difficult to learn?

That’s a very personal question — what’s very easy to one person may seem very difficult to the next, and vice versa. However, most people find SQL pretty easy to learn, especially when compared to full-on programming languages like Python or R.

That’s because unlike a “full” programming language, SQL is a query language. It’s built specifically for interacting with relational database management systems such as Microsoft SQL Server, Oracle, SQLite, MySQL, etc. For that reason, there’s not as much to learn, and some of the more complex concepts that exist by necessity in more holistic programming languages aren’t a factor in SQL.

That said, the fact that most people find SQL relatively easy to learn does not mean that you will, or that you should feel ashamed if you find it challenging! Particularly if this is your first foray into the world of programming, you should be ready for a challenge.

(But don’t worry. No matter what your background is, you can totally learn SQL. Our community is here to help you anytime you need it!)

SQL or Python: which is better to learn?

The answer to this question really depends on your goals. They’re very different things.

SQL is a query language. It’s really only useful for interacting with, filtering, and lightly analyzing data from databases. It offers a lot of power for working with data in those contexts, but it can’t do all the things a full programming language like Python can do.

Python is a programming language. That makes it a bit more complex to learn, but it also means it can do a lot more. You can analyze data in Python, but you can also use it to build machine learning models. Or make video games. Or program a robot. Or design art.

If you work with data often — if you’re opening spreadsheets every day and you know what VLOOKUP is — there’s a good chance you’d benefit from learning both languages.

At Dataquest, we teach both Python and SQL as part of our Data Analyst and Data Scientist career paths. Both skills are required for full-time data jobs (although R can be substituted for Python, learning SQL is non-negotiable).

Can you learn SQL on your own?

Yes. We’ve seen thousands of students do exactly that, working through our interactive SQL courses on their own time, at their own pace.

Even if you’re not using Dataquest, it is absolutely possible to learn SQL on your own. Having a supportive community you can turn to for help certainly can make things easier, though!

How fast can you learn SQL?

The short answer is: pretty fast.

The longer answer is you can learn the basics — enough to be functional — quite quickly. Even with part-time study (for example, working on SQL in the evenings after a full-time job), many learners who’ve never coded before can reach their goals and be able to complete independent SQL projects in just a few months.

If you have some programming experience and/or you’re willing to spend a bit more time each day studying, you can learn enough SQL to accomplish your goals even faster than that!

But with that said, learning SQL is like learning any language — there’s never really an end point. Even the pros who are using SQL at work every day still learn new things now and then. Learning any programming language should be considered a lifelong journey, not something that starts and ends over the span of a few months.

Master SQL by writing SQL!

  • Build SQL projects
  • No setup required
  • In-demand skills

Why watch video lectures when  you can learn by doing?

The post The Best Way to Learn SQL (According to Seasoned Devs) appeared first on Dataquest.

]]>
SQL Commands: The Complete List (w/ Examples) https://www.dataquest.io/blog/sql-commands/ Wed, 17 Feb 2021 15:47:05 +0000 https://www.dataquest.io/?p=27592 What can you do with SQL? Here's a reference guide to the most commonly-used SQL commands, with code examples.

The post SQL Commands: The Complete List (w/ Examples) appeared first on Dataquest.

]]>
shouting sql commands at your database with a megaphone

To get a data job in 2021, you are going to need to learn SQL. As with any language and especially when you are a beginner, it can be useful to have a list of common SQL commands and operators in one place to refer to whenever you need it — we’d like to be that place for you!

Below is a comprehensive list of SQL commands, organized by the top-level of each (e.g. SELECT TOP is within the SELECT category). 

If you’re on a journey to learn SQL and you’ve been frustrated by the lack of structure or the dull curriculum composed of Google searches, then you may like Dataquest’s interactive SQL courses. Whether you’re a beginner trying to get job-ready or a seasoned developer looking to stay sharp, there’s a SQL course for you. 

List of SQL Commands

SELECT

SELECT is probably the most commonly-used SQL statement. You'll use it pretty much every time you query data with SQL. It allows you to define what data you want your query to return.

For example, in the code below, we’re selecting a column called name from a table called customers.

SELECT name
FROM customers;

SELECT *

SELECT used with an asterisk (*) will return all of the columns in the table we're querying.

SELECT * FROM customers;

SELECT DISTINCT

SELECT DISTINCT only returns data that is distinct — in other words, if there are duplicate records, it will return only one copy of each.

The code below would return only rows with a unique name from the customers table.

SELECT DISTINCT name
FROM customers;

SELECT INTO

SELECT INTO copies the specified data from one table into another.

SELECT * INTO customers
FROM customers_bakcup;

SELECT TOP

SELECT TOP only returns the top x number or percent from a table.

The code below would return the top 50 results from the customers table:

SELECT TOP 50 * FROM customers;

The code below would return the top 50 percent of the customers table:

SELECT TOP 50 PERCENT * FROM customers;

AS

AS renames a column or table with an alias that we can choose. For example, in the code below, we’re renaming the name column as first_name:

SELECT name AS first_name
FROM customers;

FROM

FROM specifies the table we're pulling our data from:

SELECT name
FROM customers;

WHERE

WHERE filters your query to only return results that match a set condition. We can use this together with conditional operators like =, >, <, >=, <=, etc.

SELECT name
FROM customers
WHERE name = ‘Bob’;

AND

AND combines two or more conditions in a single query. All of the conditions must be met for the result to be returned.

SELECT name
FROM customers
WHERE name = ‘Bob’ AND age = 55;

OR

OR combines two or more conditions in a single query. Only one of the conditions must be met for a result to be returned.

SELECT name
FROM customers
WHERE name = ‘Bob’ OR age = 55;

BETWEEN

BETWEEN filters your query to return only results that fit a specified range.

SELECT name
FROM customers
WHERE age BETWEEN 45 AND 55;

LIKE

LIKE searches for a specified pattern in a column. In the example code below, any row with a name that included the characters Bob would be returned.

SELECT name
FROM customers
WHERE name LIKE%Bob%;

Other operators for LIKE:

  • %x — will select all values that begin with x
  • %x% — will select all values that include x
  • x% — will select all values that end with x
  • x%y — will select all values that begin with x and end with y
  • _x% — will select all values have x as the second character
  • x_% — will select all values that begin with x and are at least two characters long. You can add additional _ characters to extend the length requirement, i.e. x___%

IN

IN allows us to specify multiple values we want to select for when using the WHERE command.

SELECT name
FROM customers
WHERE name IN (‘Bob’, ‘Fred’, ‘Harry’);

IS NULL

IS NULL will return only rows with a NULL value.

SELECT name
FROM customers
WHERE name IS NULL;

IS NOT NULL

IS NOT NULL does the opposite — it will return only rows without a NULL value.

SELECT name
FROM customers
WHERE name IS NOT NULL;

CREATE

CREATE can be used to set up a database, table, index or view.


CREATE DATABASE

CREATE DATABASE creates a new database, assuming the user running the command has the correct admin rights.

CREATE DATABASE dataquestDB;

CREATE TABLE

CREATE TABLE creates a new table inside a database. The terms int and varchar(255) in this example specify the datatypes of the columns we're creating.
CREATE TABLE customers (
    customer_id int,
    name varchar(255),
    age int
);

CREATE INDEX

CREATE INDEX generates an index for a table. Indexes are used to retrieve data from a database faster.

CREATE INDEX idx_name
ON customers (name);

CREATE VIEW

CREATE VIEW creates a virtual table based on the result set of an SQL statement. A view is like a regular table (and can be queried like one), but it is not saved as a permanent table in the database.

CREATE VIEW [Bob Customers] AS
SELECT name, age
FROM customers
WHERE name = ‘Bob’;

DROP

DROP statements can be used to delete entire databases, tables or indexes.

It goes without saying that the DROP command should only be used where absolutely necessary.


DROP DATABASE

DROP DATABASE deletes the entire database including all of its tables, indexes etc as well as all the data within it.

Again, this is a command we want to be very, very careful about using!

DROP DATABASE dataquestDB;

DROP TABLE

DROP TABLE deletes a table as well as the data within it.

DROP TABLE customers;

DROP INDEX

DROP INDEX deletes an index within a database.

DROP INDEX idx_name;

UPDATE

The UPDATE statement is used to update data in a table. For example, the code below would update the age of any customer named Bob in the customers table to 56.

UPDATE customers
SET age = 56
WHERE name = ‘Bob’;

DELETE

DELETE can remove all rows from a table (using *), or can be used as part of a WHERE clause to delete rows that meet a specific condition.

DELETE FROM customers
WHERE name = ‘Bob’;

ALTER TABLE

ALTER TABLE allows you to add or remove columns from a table. In the code snippets below, we’ll add and then remove a column for surname. The text varchar(255) specifies the datatype of the column.

ALTER TABLE customers
ADD surname varchar(255);
ALTER TABLE customers
DROP COLUMN surname;

We hope this is a helpful resource,

but it's not the best way to learn SQL

Learn SQL by actually doing it!

Interactive lessons allow you to write, run, and check queries right in your browser window.

AGGREGATE FUNCTIONS (COUNT/SUM/AVG/MIN/MAX)

An aggregate function performs a calculation on a set of values and returns a single result.

COUNT

COUNT returns the number of rows that match the specified criteria. In the code below, we’re using *, so the total row count for customers would be returned.

SELECT COUNT(*)
FROM customers;

SUM

SUM returns the total sum of a numeric column.

SELECT SUM(age)
FROM customers;

AVG

AVG returns the average value of a numeric column.

SELECT AVG(age)
FROM customers;

MIN

MIN returns the minimum value of a numeric column.

SELECT MIN(age)
FROM customers;

MAX

MAX returns the maximum value of a numeric column.

SELECT MAX(age)
FROM customers;

GROUP BY

The GROUP BY statement groups rows with the same values into summary rows. The statement is often used with aggregate functions. For example, the code below will display the average age for each name that appears in our customers table.

SELECT name, AVG(age)
FROM customers
GROUP BY name;

HAVING

HAVING performs the same action as the WHERE clause. The difference is that HAVING is used for aggregate functions, whereas WHERE doesn’t work with them.

The below example would return the number of rows for each name, but only for names with more than 2 records.

SELECT COUNT(customer_id), name
FROM customers
GROUP BY name
HAVING COUNT(customer_id) > 2;

ORDER BY 

ORDER BY sets the order of the returned results. The order will be ascending by default.

SELECT name
FROM customers
ORDER BY age;

DESC

DESC will return the results in descending order.

SELECT name
FROM customers
ORDER BY age DESC;

OFFSET

The OFFSET statement works with ORDER BY and specifies the number of rows to skip before starting to return rows from the query.

SELECT name
FROM customers
ORDER BY age
OFFSET 10 ROWS;

FETCH

FETCH specifies the number of rows to return after the OFFSET clause has been processed. The OFFSET clause is mandatory, while the FETCH clause is optional.

SELECT name
FROM customers
ORDER BY age
OFFSET 10 ROWS
FETCH NEXT 10 ROWS ONLY;

JOINS (INNER, LEFT, RIGHT, FULL)

A JOIN clause is used to combine rows from two or more tables. The four types of JOIN are INNER, LEFT, RIGHT and FULL.


INNER JOIN

INNER JOIN selects records that have matching values in both tables.

SELECT name
FROM customers
INNER JOIN orders
ON customers.customer_id = orders.customer_id;

LEFT JOIN

LEFT JOIN selects records from the left table that match records in the right table. In the below example the left table is customers.

SELECT name
FROM customers
LEFT JOIN orders
ON customers.customer_id = orders.customer_id;

RIGHT JOIN

RIGHT JOIN selects records from the right table that match records in the left table. In the below example the right table is orders.
SELECT name
FROM customers
RIGHT JOIN orders
ON customers.customer_id = orders.customer_id;

FULL JOIN

FULL JOIN selects records that have a match in the left or right table. Think of it as the “OR” JOIN compared with the “AND” JOIN (INNER JOIN).

SELECT name
FROM customers
FULL OUTER JOIN orders
ON customers.customer_id = orders.customer_id;

EXISTS

EXISTS is used to test for the existence of any record in a subquery.

SELECT name
FROM customers
WHERE EXISTS
(SELECT order FROM ORDERS WHERE customer_id = 1);

GRANT

GRANT gives a particular user access to database objects such as tables, views or the database itself. The below example would give SELECT and UPDATE access on the customers table to a user named “usr_bob”.

GRANT SELECT, UPDATE ON customers TO usr_bob;

REVOKE

REVOKE removes a user's permissions for a particular database object.

REVOKE SELECT, UPDATE ON customers FROM usr_bob;

SAVEPOINT

SAVEPOINT allows you to identify a point in a transaction to which you can later roll back. Similar to creating a backup.

SAVEPOINT SAVEPOINT_NAME;

COMMIT

COMMIT is for saving every transaction to the database. A COMMIT statement will release any existing savepoints that may be in use and once the statement is issued, you cannot roll back the transaction.

DELETE FROM customers
WHERE name = ‘Bob’;
COMMIT;

ROLLBACK

ROLLBACK is used to undo transactions which are not saved to the database. This can only be used to undo transactions since the last COMMIT or ROLLBACK command was issued. You can also rollback to a SAVEPOINT that has been created before.

ROLLBACK TO SAVEPOINT_NAME;

TRUNCATE

TRUNCATE TABLE removes all data entries from a table in a database, but keeps the table and structure in place. Similar to DELETE.

TRUNCATE TABLE customers;

UNION

UNION combines multiple result-sets using two or more SELECT statements and eliminates duplicate rows.

SELECT name FROM customers
UNION
SELECT name FROM orders;

UNION ALL

UNION ALL combines multiple result-sets using two or more SELECT statements and keeps duplicate rows.

SELECT name FROM customers
UNION ALL
SELECT name FROM orders;

We hope this page serves as a helpful quick-reference guide to SQL commands. But if you really want to learn your SQL skills, copy-apsting code won't cut it. Check out our interactive SQL courses and start learning by doing!

Learn SQL the right way!

  • Writing real queries
  • In your browser
  • On your schedule

Why watch video lectures when  you can learn by doing?

The post SQL Commands: The Complete List (w/ Examples) appeared first on Dataquest.

]]>
SQL vs MySQL: A Simple Guide to the Differences https://www.dataquest.io/blog/sql-vs-mysql/ Thu, 11 Feb 2021 23:25:51 +0000 https://www.dataquest.io/?p=27560 SQL and MySQL are important tools for working with data. But what are they, exactly, and how are they different? Let's clear that up.

The post SQL vs MySQL: A Simple Guide to the Differences appeared first on Dataquest.

]]>
sqlor mysql, what's the difference?

SQL and MySQL are two of the most popular data management tools in the world. But for a beginner, or even someone with more experience, the difference between the two can be confusing. 

In this post, we’re going to define what SQL and MySQL are, investigate the differences between them and dive into some of the alternative products out there.

Must-Know SQL-Related Key Terms

Before we get started, let us explain a couple of the key terms that we will be using throughout. Feel free to skip this section by clicking here if you want to get straight into the article or return later on to understand more.

Database

A database is a set of data stored in a computer and it is usually structured in a way that makes the data easily accessible. 

Relational Database Management System

A relational database is a type of database that allows us to identify and access data in relation to another piece of data in the database. It stores data in rows and columns in a series of tables to make processing and querying efficient.

A simple example of a relational database: imagine a small business, Company X, that takes orders from customers. It sets up two tables in its database:

  • Customer_information_table (which has fields for customer_id, address, phone_number, etc…)
  • Customer_orders_table (which has fields for customer_id, product, quantity, etc…)

The two tables have a relationship (they share the customer_id field). That's what makes this a relational database. 

In Company X’s warehouse, they process orders by going through records in the Customer orders table. But they can also use thecustomer_id in that orders table to grab more information about a customer from the Customer information table.

Not only is this a more efficient way of storing data but it means that if you need to update a customer's information you can do so in one place (the Customer information table), rather than having to update multiple tables with redundant information.

Our article on SQL basics goes into more detail on relational databases. Most modern databases are set up this way because they are much easier to manage, are flexible and scalable.

Relational databases are sometimes referred to as RDBMS — Relational Database Management Systems.

Storage Engine

A storage engine is a piece of software that a database management system uses to create, read and update data from a database.

Open Source

Open source simply means software in which the original source code is made freely available to all and may be redistributed and modified.

What is SQL?

SQL stands for Structured Query Language and is pronounced “S.Q.L.” or “Sequel”. It is a special kind of programming language that is used for communicating with a database.

If you want to add, retrieve, or update data in a database you can use SQL to do that. 

This is important because most companies store their data in databases. There are many types of databases and most of them speak SQL. We will discuss two of these in this article (MySQL and SQL Server), but there are many others such as PostgreSQL, IBM Db2, and Amazon Aurora, just to name a few.

Learning the basics of SQL will likely serve you well with whichever database you or your company uses.

Fun Fact: SQL became the official standard of the American National Standards Institute (ANSI) in 1986, and of the International Organization for Standardization (ISO) in 1987. Although it has been around for decades, it's still almost widely used and very in-demand today!

What is MySQL?

MySQL is an open source Relational Database Management System (RDBMS) owned by Oracle.

It is an extremely popular tool for several reasons. Firstly, its open source status means it is completely free to use. Experienced developers can even dive right in and change its source code to suit their needs, if they wish.

Even though MySQL is free to use, Oracle does offer premier support services which you can buy through a commercial license.

MySQL is also heavily supported and users can run the software on a variety of platforms and operating systems including Windows, Linux, UNIX and more.

MySQL provides various storage engines for its tables: MyISAM, InnoDB, Merge, MEMORY (HEAP), ARCHIVE, CSV and FEDERATED. 

For example, the CSV engine will store the data in a CSV file format. This could be used to migrate data into alternative, non-SQL applications such as spreadsheet software.

Each of these storage engines has its own advantages and disadvantages. Prior to creating your database, it is important to understand each and choose the most appropriate one for your tables to maximize the performance of the database.

We’ve barely scratched the surface of what MySQL can offer. However, it should be enough to understand the differences between SQL and MySQL.

Fun Fact: MySQL owes its name to one of the founders - Michael "Monty" Widenius - who named it after his daughter My.

What is the difference between SQL and MySQL?

In a nutshell, SQL is a language for querying databases and MySQL is an open source database product. 

SQL is used for accessing, updating and maintaining data in a database and MySQL is an RDBMS that allows users to keep the data that exists in a database organized.

SQL does not change (much), as it is a language. MySQL updates frequently as it is a piece of software.

In layman's terms, SQL could be seen as a bank teller and MySQL could be seen as the bank. You need the bank teller (SQL) to communicate with the bank (MySQL) and you need the bank to manage the money (the data). They work in tandem but they are completely different.

What is SQL Server?

Like MySQL, SQL Server is a relational database management system. However, unlike MySQL, SQL Server is not open source. It is owned by Microsoft and there are several editions available, depending on the users’ needs and budget.

One of these editions is called SQL Server Express and is free to download and distribute. It comprises of a database specifically targeted for embedded and smaller-scale applications

A common question for those new to the field is “are SQL and SQL Server the same thing?”. In a word: no. The difference between the two is similar to the difference we laid out between SQL and MySQL. SQL is a language for querying databases and SQL Server is a system for managing relational databases.

In terms of MySQL vs SQL Server, there’s no right answer for every organization.

If you’re a startup company strapped for cash, you’re likely to opt for MySQL.

If you’re a large company looking to run high volumes of activity on a database, then you might lean towards SQL Server.

When it comes down to it each of the systems has their own advantages and disadvantages.

Why should I use SQL?

If you want a job in data then you’re going to need to learn SQL. It’s well supported, it’s the most commonly used language in data science and it’s constantly in high demand.

Check out our article on SQL certification for more details on why it’s such an important skill to learn.

Why should I use MySQL?

You should use MySQL if you are looking to set up a database that is cheap (or free!), secure and reliable. You can download the software and be up-and-running in a matter of minutes.

You will then need to learn the SQL language to start using it effectively.

Conclusion

As we can see, it's difficult to actually compare SQL and MySQL. While they are related (and have similar names), they do completely different things and can be used individually or in tandem depending on what you are trying to achieve.

Learn SQL the right way!

  • Writing real queries
  • In your browser
  • On your schedule

Why passively watch video lectures when  you can learn by doing?

The post SQL vs MySQL: A Simple Guide to the Differences appeared first on Dataquest.

]]>
SQL Interview Questions — Real Questions to Prep for Your Job Interview https://www.dataquest.io/blog/sql-interview-questions/ Mon, 01 Feb 2021 22:46:42 +0000 https://www.dataquest.io/?p=27411 A lot of the SQL interview questions you'll find on the web are generic: "What is SQL?" You'll never be asked that. We've got real questions to help you prep.

The post SQL Interview Questions — Real Questions to Prep for Your Job Interview appeared first on Dataquest.

]]>
sql-interview-questions

If you're looking for a job in data, chances are you're going to have to answer some SQL interview questions, or complete some kind of SQL test.

That's because SQL skills are required for most data jobs. We dug into the data in depth in this post about why you should learn SQL, but the short version is this: more than half of all data analyst, data scientist, and data engineer jobs in 2021 list SQL as a requirement.

The importance of SQL is especially stark for data analyst roles:

Skills listed in data analyst job posts, SQL is the most in-demand skill

SQL is far and away the most in-demand skill for Data Analyst roles. Data: Indeed.com, 1/29/2021.

Preparing for SQL questions in a job interview

We've written an extensive guide on job interviews in data science. You should be aware that SQL will almost certainly play a role in your interview process, especially if you're looking at Data Analyst roles.

Every company does things differently, but here are a few of the more common ways companies test SQL skills:

  • In-person (or video) interview where you're asked SQL questions or given SQL problems to solve.
  • Take-home SQL task or tasks.
  • In-person (or video) live coding session where you're asked to use SQL skills to answer questions in real time.
  • Whiteboard coding session where you're asked to demonstrate your SQL skills by sketching out queries on a whiteboard.

If you're not comfortable writing SQL queries already, there's no time like the present to sign up for a free account and dive into our interactive SQL courses. But let's say you're already a SQL master. You're still going to want some practice!

And that's where you may encounter a problem.

Online practice question lists are (mostly) terrible

If you Google "SQL Interview questions," you're going to find a bunch of articles that list questions like these (these are all real questions pulled from top-ranking articles)

  • What is SQL?
  • What is a database?
  • What are tables?
  • What is a join?

You get the idea. And we suppose it's possible that you'll be asked "what is SQL" in a job interview. But it's definitely not likely.

Much more likely: the SQL interview questions you'll face will be asking you to solve real problems with SQL, or asking you to answer trickier questions that test your working knowledge.

We've compiled some of these questions below, and provided expandable answers so that you can test yourself, and then check to make sure you're right.

Test yourself with real SQL interview questions:

Question 1

Given the table below, write a SQL query that retrieves the personal data about alumni who scored above 16 on their calculus exam.

alumni

student_id name surname birth_date faculty
347 Daniela Lopes 1991-04-26 Medical School
348 Robert Fischer 1991-03-09 Mathematics

evaluation

student_id
class_id
exam_date
grade
347
74
2015-06-19
16
347
87
2015-06-06
20
348
74
2015-06-19
13

curricula

class_id
class_name
professor_id
semester
74
algebra
435
2015_summer
87
calculus
532
2015_summer
46
statistics
625
2015_winter

Click to reveal answer

There are several possible answers. Here’s one:

SELECT a.name, a.surname, a.birth_date, a.faculty
  FROM alumni AS a
 INNER JOIN evaluation AS e
       ON a.student_id=e.student_id
 INNER JOIN curricula AS c
       ON e.class_id = c.class_id
 WHERE c.class_name = 'calculus' AND e.grade>16;

Question 2

We’ll work with the beverages table. Its first rows are given below.

id name launch_year fruit_pct contributed_by
1 Bruzz 2007 45 Sam Malone
2 Delightful 2008 41 Sam Malone
3
Nice 2015 42
Sam Malone

Write a query to extract only beverages where fruit_pct is between 35 and 40 (including both ends).

Click to reveal answer

There are several possible answers. Here’s one:

SELECT *
  FROM beverages
 WHERE fruit_pct BETWEEN 35 AND 40;

Question 3

We’ll work with the beverages table again. Its first rows are given below.

id name launch_year fruit_pct contributed_by
1 Bruzz 2007 45 Sam Malone
2 Delightful 2008 41 Sam Malone
3
Nice 2015 42
Sam Malone
Write a query to extract only beverages whose contributor only has one name

Click to reveal answer

There are several possible answers. Here’s one:

SELECT *
  FROM beverages
 WHERE contributed_by NOT LIKE '% %';

Question 4

We’ll work with the beverages table again. Its first rows are given below.

id name launch_year fruit_pct contributed_by
1 Bruzz 2007 45 Sam Malone
2 Delightful 2008 41 Sam Malone
3
Nice 2015 42
Sam Malone

Write a query that finds the average fruit_pct by contributor and displays it ascending order.

Click to reveal answer

There are several possible answers. Here’s one:

SELECT contributed_by, AVG(fruit_pct) AS mean_fruit
  FROM beverages
 GROUP BY contributed_by
 ORDER BY mean_fruit;

Question 5

Take a look at the query given below:

SELECT column, AGG_FUNC(column_or_expression),FROM a_table
 INNER JOIN some_table
       ON a_table.column = some_table.column
 WHERE a_condition
 GROUP BY column
HAVING some_condition
 ORDER BY column
 LIMIT 5;

In what order does SQL run the clauses? Select the correct option from the list of choices below:

  1. SELECT, FROM, WHERE, GROUP BY
  2. FROM, WHERE, HAVING, SELECT, LIMIT
  3. SELECT, FROM, INNER JOIN, GROUP BY
  4. FROM, SELECT, LIMIT, WHERE

Click to reveal answer

The correct option is 2. It goes like this:

  1. The SQL engine fetches the data from the tables (FROM and INNER JOIN)
  2. Filters it (WHERE)
  3. Aggregates the data (GROUP BY)
  4. Filters the aggregated data (HAVING)
  5. Selects the columns and expressions to display (SELECT)
  6. Orders the remaining data (ORDER BY)
  7. Limits the results (LIMIT)

Question 6

What is the purpose of an index in a database table?

Click to reveal answer

The purpose of an index in a database table is to improve the speed of looking through that table's data. The standard analogy is that it's (usually) much faster to look up something in a book by looking at its index than by flipping every page until we find what we want.

Question 7

What rows of my_table does the following query yield? Give a descriptive answer.

SELECT *
  FROM my_table
 WHERE 1 = 1.0;

Click to reveal answer

It returns the whole table because 1=1.0 always evaluates to true.

 

Question 8

What rows of my_table does the following query yield? Give a descriptive answer.

SELECT *
  FROM my_table
 WHERE NULL = NULL;

Click to reveal answer

It returns no rows because, by definition, NULL does not equal itself.

 

More resources for SQL interview prep

We'll be adding new questions to that list over time, but in the interim, here are some more helpful resources for review during your SQL interview question prep:

Of course, don't forget to bookmark this post, because we'll be adding more SQL interview questions for you to quiz yourself with over time!

Learn SQL the right way!

  • Writing real queries
  • In your browser
  • On your schedule

Why passively watch video lectures when  you can learn by doing?

SQL questions written by Bruno Cunha.

The post SQL Interview Questions — Real Questions to Prep for Your Job Interview appeared first on Dataquest.

]]>
SQL Basics — Hands-On Beginner SQL Tutorial Analyzing Bike-Sharing https://www.dataquest.io/blog/sql-basics/ Mon, 01 Feb 2021 09:00:00 +0000 https://dq.t79ae38x-liquidwebsites.com/2017/05/09/sql-basics/ Learn the SQL basics and go hands-on querying databases as you analyze bike rental data in this free beginner SQL tutorial.

The post SQL Basics — Hands-On Beginner SQL Tutorial Analyzing Bike-Sharing appeared first on Dataquest.

]]>
learn the basics of sql while analyzing bike sharing data

Although learning anything new can be intimidating, mastering the SQL basics is actually not as difficult as you might think.

In this tutorial, we're going to dig into SQL basics from the perspective of a total beginner to get you up and running with this crucial skill.

Let's start by answering a few questions:

Who can benefit from learning SQL basics?

SQL, pronounced "sequel" (or S-Q-L, if you prefer), is a critical tool for data analysts, data scientists, and a wide variety of professionals in other roles, including marketing, finance, HR, sales, and much more.

SQL is the most important language for getting a job in data, but that's just the tip of the iceberg — since most companies store their data in SQL-based databases, almost anyone who works with company data or spreadsheets can benefit from learning SQL

What is SQL?

SQL stands for Structured Query Language. A query language is a kind of programming language that's designed to facilitate retrieving specific information from databases, and that's exactly what SQL does. To put it simply, SQL is the language of databases.

That matters because most companies store their data in databases. And while there are many types of databases (MySQL, PostgreSQL, Microsoft SQL Server), most of them speak SQL. Once you've got SQL basics under your belt, you'll be able to work with any of them.

Even if you're planning to do your analysis with another language like Python, at most companies, chances are you'll need to use SQL to retrieve the data you need from the company's database. As of this writing, there are more than 75,000 open SQL jobs listed on Indeed in the US alone.

So let's get started with learning SQL!

(If you'd prefer to learn interactively, writing and running SQL queries from your browser, you should sign up for free and check out our SQL Fundamentals course!)

Learn SQL the right way!

  • Writing real queries
  • In your browser
  • On your schedule

Why watch video lectures when  you can learn by doing?

In this tutorial we'll be working with a dataset from the bike-sharing service Hubway, which includes data on over 1.5 million trips made with the service.

hubway bike sharing

We'll start by looking a little bit at databases, what they are and why we use them, before starting to write some queries of our own in SQL.

If you'd like to follow along you can download the hubway.db file here (130 MB).

SQL Basics: Relational Databases

A relational database is a database that stores related information across multiple tables and allows you to query information in more than one table at the same time.

It's easier to understand how this works by thinking through an example. Imagine you're a business and you want to keep track of your sales information. You could set up a spreadsheet in Excel with all of the information you want to keep track of as separate columns: Order number, date, amount due, shipment tracking number, customer name, customer address, and customer phone number.

spreadsheet example sql tutorial

This setup would work fine for tracking the information you need to begin with, but as you start to get repeat orders from the same customer you'll find that their name, address and phone number gets stored in multiple rows of your spreadsheet.

As your business grows and the number of orders you're tracking increases, this redundant data will take up unnecessary space and generally decrease the efficiency of your sales tracking system. You might also run into issues with data integrity. There's no guarantee, for example, that every field will be populated with the correct data type or that the name and address will be entered exactly the same way every time.

sql basics tutorial tables example

With a relational database, like the one in the above diagram, you avoid all of these issues. You could set up two tables, one for orders and one for customers. The 'customers' table would include a unique ID number for each customer, along with the name, address and phone number we were already tracking. The 'orders' table would include your order number, date, amount due, tracking number and, instead of a separate field for each item of customer data, it would have a column for the customer ID.

This enables us to pull up all of the customer info for any given order, but we only have to store it once in our database rather than listing it out again for every single order.

Our Data Set

Let's start by taking a look at our database. The database has two tables, trips and stations. To begin with, we'll just look at the trips table. It contains the following columns:

  • id — A unique integer that serves as a reference for each trip
  • duration — The duration of the trip, measured in seconds
  • start_date — The date and time the trip began
  • start_station — An integer that corresponds to the id column in the stations table for the station the trip started at
  • end_date — The date and time the trip ended
  • end_station — The 'id' of the station the trip ended at
  • bike_number — Hubway's unique identifier for the bike used on the trip
  • sub_type — The subscription type of the user. "Registered" for users with a membership, "Casual" for users without a membership
  • zip_code — The zip code of the user (only available for registered members)
  • birth_date — The birth year of the user (only available for registered members)
  • gender — The gender of the user (only available for registered members)

Our Analysis

With this information and the SQL commands we'll learn shortly, here are some questions that we'll try to answer over the course of this post:

  • What was the duration of the longest trip?
  • How many trips were taken by 'registered' users?
  • What was the average trip duration?
  • Do registered or casual users take longer trips?
  • Which bike was used for the most trips?
  • What is the average duration of trips by users over the age of 30?

The SQL commands we'll use to answer these questions are:

  • SELECT
  • WHERE
  • LIMIT
  • ORDER BY
  • GROUP BY
  • AND
  • OR
  • MIN
  • MAX
  • AVG
  • SUM
  • COUNT

Installation and Setup

For the purposes of this tutorial, we will be using a database system called SQLite3. SQLite has come as part of Python from version 2.5 onwards, so if you have Python installed you'll almost certainly have SQLite as well. Python and the SQLite3 library can easily be installed and set up with Anaconda if you don't already have them.

Using Python to run our SQL code allows us to import the results into a Pandas dataframe to make it easier to display our results in an easy to read format. It also means we can perform further analysis and visualization on the data we pull from the database, although that will be beyond the scope of this tutorial.

Alternatively, if we don't want to use or install Python, we can run SQLite3 from the command line. Simply download the "precompiled binaries" from the SQLite3 web page and use the following code to open the database:

~$ sqlite hubway.db SQLite version 3.14.0 2016-07-26 15:17:14Enter ".help" for usage hints.sqlite>

From here we can just type in the query we want to run and we will see the data returned in our terminal window.

An alternative to using the terminal is to connect to the SQLite database via Python. This would allow us to use a Jupyter notebook, so that we could see the results of our queries in a neatly formatted table.

To do this, we'll define a function that takes our query (stored as a string) as an input and shows the result as a formatted dataframe:

import sqlite3
import pandas as pd
db = sqlite3.connect('hubway.db')
def run_query(query):
    return pd.read_sql_query(query,db)
Of course, we don't have to use Python with SQL. If you're an R programmer already, our SQL Fundamentals for R Users course would be a great place to start.

SELECT

The first command we'll work with is SELECT. SELECT will be the foundation of almost every query we write - it tells the database which columns we want to see. We can either specify columns by name (separated by commas) or use the wildcard * to return every column in the table.

In addition to the columns we want to retrieve, we also have to tell the database which table to get them from. To do this we use the keyword FROM followed by the name of the table. For example, if we wanted to see the start_date and bike_number for every trip in the trips table, we could use the following query:

SELECT start_date, bike_number FROM trips;

In this example, we started with the SELECT command so that the database knows we want it to find us some data. Then we told the database we were interested in the start_date and bike_number columns. Finally we used FROM to let the database know that the columns we want to see are part of the trips table.

One important thing to be aware of when writing SQL queries is that we'll want to end every query with a semicolon (;). Not every SQL database actually requires this, but some do, so it's best to form this habit.

LIMIT

The next command we need to know before we start to run queries on our Hubway database is LIMIT. LIMIT simply tells the database how many rows you want it to return.

The SELECT query we looked at in the previous section would return the requested information for every row in the trips table, but sometimes that could mean a lot of data. We might not want all of it. If, instead, we wanted to see the start_date and bike_number for the first five trips in the database, we could add LIMIT to our query as follows:

SELECT start_date, bike_number FROM trips LIMIT 5;

We simply added the LIMIT command and then a number representing the number of rows we want to be returned. In this instance we used 5, but you can replace that with any number to get the appropriate amount of data for the project you're working on.

We will use LIMIT a lot in our queries on the Hubway database in this tutorial — the trips table contains over a 1.5 million rows of data and we certainly don't need to display all of them!

Let's run our first query on the Hubway database. First we will store our query as a string and then use the function we defined earlier to run it on the database. Take a look at the following example:

query = 'SELECT * FROM trips LIMIT 5;'
run_query(query)
id duration start_date start_station end_date end_station bike_number sub_type zip_code birth_date gender
0 1 9 2011-07-28 10:12:00 23 2011-07-28 10:12:00 23 B00468 Registered '97217 1976.0 Male
1 2 220 2011-07-28 10:21:00 23 2011-07-28 10:25:00 23 B00554 Registered '02215 1966.0 Male
2 3 56 2011-07-28 10:33:00 23 2011-07-28 10:34:00 23 B00456 Registered '02108 1943.0 Male
3 4 64 2011-07-28 10:35:00 23 2011-07-28 10:36:00 23 B00554 Registered '02116 1981.0 Female
4 5 12 2011-07-28 10:37:00 23 2011-07-28 10:37:00 23 B00554 Registered '97214 1983.0 Female

This query uses * as a wildcard instead of specifying columns to return. This means the SELECT command has given us every column in the trips table. We also used the LIMIT function to restrict the output to the first five rows of the table.

You will often see that people capitalize the commmand keywords in their queries (a convention that we'll follow throughout this tutorial) but this is mostly a matter of preference. This capitalization makes the code easier to read, but it doesn't actually affect the code's function in any way. If you prefer to write your queries with lowercase commands, the queries will still execute correctly.

Our previous example returned every column in the trips table. If we were only interested in the duration and start_date columns, we could replace the wildcard with the column names as follows:

query = 'SELECT duration, start_date FROM trips LIMIT 5'
run_query(query)
duration start_date
0 9 2011-07-28 10:12:00
1 220 2011-07-28 10:21:00
2 56 2011-07-28 10:33:00
3 64 2011-07-28 10:35:00
4 12 2011-07-28 10:37:00

ORDER BY

The final command we need to know before we can answer the first of our questions is ORDER BY. This command allows us to sort the database on a given column.

To use it, we simply specify the name of the column we would like to sort on. By default, ORDER BY sorts in ascending order. If we would like to specify which order the database should be sorted, we can add the keyword ASC for ascending order or DESC for descending order.

For example, if we wanted to sort the trips table from the shortest duration to the longest we could add the following line to our query:

ORDER BY duration ASC

With the SELECT, LIMIT and ORDER BY commands in our repertoire, we can now attempt to answer our first question: What was the duration of the longest trip?

To answer this question, it's helpful to break it down into sections and identify which commands we will need to address each part.

First we need to pull the information from the duration column of the trips table. Then, to find which trip is the longest, we can sort the duration column in descending order. Here's how we might work this through to come up with a query that will get the information we're looking for:

  • Use SELECT to retrieve the duration column FROM the trips table
  • Use ORDER BY to sort the duration column and use the DESC keyword to specify that you want to sort in descending order
  • Use LIMIT to restrict the output to 1 row

Using these commands in this way will return the single row with the longest duration, which will provide us the answer to our question.

One more thing to note — as your queries add more commands and get more complicated, you may find it easier to read if you separate them onto multiple lines. This, like capitalization, is a matter of personal preference. It doesn't affect how the code runs (the system just reads the code from the beginning until it reaches the semicolon), but it can make your queries clearer and easier to follow. In Python, we can separate a string onto multiple lines by using triple quote marks.

Let's go ahead and run this query and find out how long the longest trip lasted.

query = '''
SELECT duration FROM trips
ORDER BY duration DESC
LIMIT 1;
'''
run_query(query)
duration
0 9999

Now we know that the longest trip lasted 9999 seconds, or a little over 166 minutes. With a maximum value of 9999, however, we don't know whether this is really the length of the longest trip or if the database was only set up to allow a four digit number.

If it's true that particularly long trips are being cut short by the database, then we might expect to see a lot of trips at 9999 seconds where they reach the limit. Let's try running the same query as before, but adjust the LIMIT to return the 10 highest durations to see if that's the case:

query = '''
SELECT durationFROM trips
ORDER BY duration DESC
LIMIT 10
'''
run_query(query)
duration
0 9999
1 9998
2 9998
3 9997
4 9996
5 9996
6 9995
7 9995
8 9994
9 9994

What we see here is that there aren't a whole bunch of trips at 9999, so it doesn't look like we're cutting off the top end of our durations, but it's still difficult to tell whether that's the real length of the trip or just the maximum allowed value.

Hubway charges additional fees for rides over 30 minutes (somebody keeping a bike for 9999 seconds would have to pay an extra $25 in fees) so it's plausible that they decided 4 digits would be sufficient to track the majority of rides.

WHERE

The previous commands are great for pulling out sorted information for particular columns, but what if there is a specific subset of the data we want to look at? That's where WHERE comes in. The WHERE command allows us to use a logical operator to specify which rows should be returned. For example you could use the following command to return every trip taken with bike B00400:

WHERE bike_number = "B00400"

You'll also notice that we use quote marks in this query. That's because the bike_number is stored as a string. If the column contained numeric data types the quote marks would not be necessary.

Let's write a query that uses WHERE to return every column in the trips table for each row with a duration longer than 9990 seconds:

query = '''
SELECT * FROM trips
WHERE duration > 9990;
'''
run_query(query)
id duration start_date start_station end_date end_station bike_number sub_type zip_code birth_date gender
0 4768 9994 2011-08-03 17:16:00 22 2011-08-03 20:03:00 24 B00002 Casual
1 8448 9991 2011-08-06 13:02:00 52 2011-08-06 15:48:00 24 B00174 Casual
2 11341 9998 2011-08-09 10:42:00 40 2011-08-09 13:29:00 42 B00513 Casual
3 24455 9995 2011-08-20 12:20:00 52 2011-08-20 15:07:00 17 B00552 Casual
4 55771 9994 2011-09-14 15:44:00 40 2011-09-14 18:30:00 40 B00139 Casual
5 81191 9993 2011-10-03 11:30:00 22 2011-10-03 14:16:00 36 B00474 Casual
6 89335 9997 2011-10-09 02:30:00 60 2011-10-09 05:17:00 45 B00047 Casual
7 124500 9992 2011-11-09 09:08:00 22 2011-11-09 11:55:00 40 B00387 Casual
8 133967 9996 2011-11-19 13:48:00 4 2011-11-19 16:35:00 58 B00238 Casual
9 147451 9996 2012-03-23 14:48:00 35 2012-03-23 17:35:00 33 B00550 Casual
10 315737 9995 2012-07-03 18:28:00 12 2012-07-03 21:15:00 12 B00250 Registered '02120 1964 Male
11 319597 9994 2012-07-05 11:49:00 52 2012-07-05 14:35:00 55 B00237 Casual
12 416523 9998 2012-08-15 12:11:00 54 2012-08-15 14:58:00 80 B00188 Casual
13 541247 9999 2012-09-26 18:34:00 54 2012-09-26 21:21:00 54 T01078 Casual

As we can see, this query returned 14 different trips, each with a duration of 9990 seconds or more. Something that stands out about this query is that all but one of the results has a sub_type of "Casual". Perhaps this is an indication that "Registered" users are more aware of the extra fees for long trips. Maybe Hubway could do a better job of conveying their pricing structure to Casual users to help them avoid overage charges.

We can already see how even a beginner-level command of SQL can help us answer business questions and find insights in our data.

Returning to WHERE, we can also combine multiple logical tests in our WHERE clause using AND or OR. If, for example, in our previous query we had only wanted to return the trips with a duration over 9990 seconds that also had a sub_type of Registered, we could use AND to specify both conditions.

Here's another personal preference recommendation: use parentheses to separate each logical test, as demonstrated in the code block below. This isn't strictly required for the code to function, but parentheses make your queries easier to understand as you increase the complexity.

Let's run that query now. We already know it should only return one result, so it should be easy to check that we've got it right:

query = '''
SELECT * FROM trips
WHERE (duration >= 9990) AND (sub_type = "Registered")
ORDER BY duration DESC;
'''
run_query(query)
id duration start_date start_station end_date end_station bike_number sub_type zip_code birth_date gender
0 315737 9995 2012-07-03 18:28:00 12 2012-07-03 21:15:00 12 B00250 Registered '02120 1964.0 Male

The next question we set out at the beginning of the post is "How many trips were taken by 'registered' users?" To answer it, we could run the same query as above and modify the WHERE expression to return all of the rows where sub_type is equal to 'Registered' and then count them up.

However, SQL actually has a built-in command to do that counting for us, COUNT.

COUNT allows us to shift the calculation to the database and save us the trouble of writing additional scripts to count up results. To use it, we simply include COUNT(column_name) instead of (or in addition to) the columns you want to SELECT, like this:

SELECT COUNT(id)
FROM trips

In this instance, it doesn't matter which column we choose to count because every column should have data for each row in our query. But sometimes a query might have missing (or "null") values for some rows. If we're not sure whether a column contains null values we can run our COUNT on the id column — the id column is never null, so we can be sure our count won't have missed anything.

We can also use COUNT(1) or COUNT(*) to count up every row in our query. It's worth noting that sometimes we might actually want to run COUNT on a column with null values. For example, we might want to know how many rows in our database have missing values for a column.

Let's take a look at a query to answer our question. We can use SELECT COUNT(*) to count up the total number of rows returned and WHERE sub_type = "Registered" to make sure we only count up the trips taken by Registered users.

query = '''
SELECT COUNT(*)FROM trips
WHERE sub_type = "Registered";
'''
run_query(query)
COUNT(*)
0 1105192

This query worked, and has returned the answer to our question. But the column heading isn't particularly descriptive. If someone else were to look at this table, they wouldn't be able to understand what it meant. If we want to make our results more readable, we can use AS to give our output an alias (or nickname). Let's re-run the previous query but give our column heading an alias of Total Trips by Registered Users:

query = '''
SELECT COUNT(*) AS "Total Trips by Registered Users"
FROM trips
WHERE sub_type = "Registered";
'''
run_query(query)
Total Trips by Registered Users
0 1105192

Aggregate Functions

COUNT is not the only mathematical trick SQL has up its sleeves. We can also use SUM, AVG, MIN and MAX to return the sum, average, minimum and maximum of a column respectively. These, along with COUNT, are known as aggregate functions.

So to answer our third question, "What was the average trip duration?", we can use the AVG function on the duration column (and, once again, use AS to give our output column a more descriptive name):

query = '''
SELECT AVG(duration) AS "Average Duration"
FROM trips;
'''
run_query(query)
Average Duration
0 912.409682

It turns out that the average trip duration is 912 seconds, which is about 15 minutes. This makes some sense, since we know that Hubway charges extra fees for trips over 30 minutes. The service is designed for riders to take short, one-way trips.

What about our next question, do registered or casual users take longer trips? We already know one way to answer this question — we could run two SELECT AVG(duration) FROM trips queries with WHERE clauses that restrict one to "Registered" and one to "Casual" users.

Let's do it a different way, though. SQL also includes a way to answer this question in a single query, using the GROUP BY command.

GROUP BY

GROUP BY separates rows into groups based on the contents of a particular column and allows us to perform aggregate functions on each group.

To get a better idea of how this works, let's take a look at the gender column. Each row can have one of three possible values in the gender column, "Male", "Female" or Null (missing; we don't have gender data for casual users).

When we use GROUP BY, the database will separate out each of the rows into a different group based on the value in the gender column, in much the same way that we might separate a deck of cards into different suits. We can imagine making two piles, one of all the males, one of all the females.

Once we have our two separate piles, the database will perform any aggregate functions in our query on each of them in turn. If we used COUNT, for example, the query would count up the number of rows in each pile and return the value for each separately.

Let's walk through exactly how to write a query to answer our question of whether registered or casual users take longer trips.

  • As with each of our queries so far, we'll start with SELECT to tell the database which information we want to see. In this instance, we'll want sub_type and AVG(duration).
  • We'll also include GROUP BY sub_type to separate out our data by subscription type and calculate the averages of registered and casual users separately.

Here's what the code looks like when we put it all together:

query = '''
SELECT sub_type, AVG(duration) AS "Average Duration"
FROM trips
GROUP BY sub_type;
'''
run_query(query)
sub_type Average Duration
0 Casual 1519.643897
1 Registered 657.026067

That's quite a difference! On average, registered users take trips that last around 11 minutes whereas casual users are spending almost 25 minutes per ride. Registered users are likely taking shorter, more frequent trips, possibly as part of their commute to work. Casual users, on the other hand, are spending around twice as long per trip.

It's possible that casual users tend to come from demographics (tourists, for example) that are more inclined to take longer trips make sure they get around and see all the sights. Once we've discovered this difference in the data, there are many ways the company might be able to investigate it to better understand what's causing it.

For the purposes of this tutorial, however, let's move on. Our next question was which bike was used for the most trips?. We can answer this using a very similar query. Take a look at the following example and see if you can figure out what each line is doing — we'll go through it step by step afterwards so you can check you got it right:

query = '''
SELECT bike_number as "Bike Number", COUNT(*) AS "Number of Trips"
FROM trips
GROUP BY bike_number
ORDER BY COUNT(*) DESC
LIMIT 1;
'''
run_query(query)
Bike Number Number of Trips
0 B00490 2120

As you can see from the output, bike B00490 took the most trips. Let's run through how we got there:

  • The first line is a SELECT clause to tell the database we want to see the bike_number column and a count of every row. It also uses AS to tell the database to display each column with a more useful name.
  • The second line uses FROM to specify that the data we're looking for is in the trips table.
  • The third line is where things start to get a little tricky. We use GROUP BY to tell the COUNT function on line 1 to count up each value for bike_number separately.
  • On line four we have an ORDER BY clause to sort the table in descending order and make sure our most-used bike is at the top.
  • Finally we use LIMIT to restrict the output to the first row, which we know will be the bike that was used in the highest number of trips because of how we sorted the data on line four.

Arithmetic Operators

Our final question is a little more tricky than the others. We want to know the average duration of trips by registered members over the age of 30.

We could just figure out the year in which 30 year olds were born in our heads and then plug it in, but a more elegant solution is to use arithmetic operations directly within our query. SQL allows us to use +, -, * and / to perform an arithmetic operation on an entire column at once.

query = '''
SELECT AVG(duration) FROM trips
WHERE (2017 - birth_date) > 30;
'''
run_query(query)
AVG(duration)
0 923.014685

JOIN

So far we've been looking at queries that only pull data from the trips table. However, one of the reasons SQL is so powerful is that it allows us to pull data from multiple tables in the same query.

Our bike-sharing database contains a second table, stations. The stations table contains information about every station in the Hubway network and includes an id column that is referenced by the trips table.

Before we start to work through some real examples from this database, though, let's look back at the hypothetical order tracking database from earlier. In that database we had two tables, orders and customers, and they were connected by the customer_id column.

Let's say we wanted to write a query that returned the order_number and name for every order in the database. If they were both stored in the same table, we could use the following query:

SELECT order_number, name
FROM orders;

Unfortunately order_number column and the name column are stored in two different tables, so we have to add a few extra steps. Let's take a moment to think through the additional things the database will need to know before it can return the information we want:

  • Which table is the order_number column in?
  • Which table is the name column in?
  • How is the information in the orders table connected to the information in the customers table?

To answer the first two of these questions, we can include the table names for each column in our SELECT command. The way we do this is simply to write the table name and column name separated by a .. For example, instead of SELECT order_number, name we would write SELECT orders.order_number, customers.name. Adding the table names here helps the database to find the columns we're looking for by telling it which table to look in for each.

To tell the database how the orders and customers tables are connected, we use JOIN and ON. JOIN specifies which tables should be connected and ON specifies which columns in each table are related.

We're going to use an inner join, which means that rows will only be returned where there is a match in the columns specified in ON. In this example, we will want to use JOIN on whichever table we didn't include in the FROM command. So we can either use FROM orders INNER JOIN customers or FROM customers INNER JOIN orders.

As we discussed earlier, these tables are connected on the customer_id column in each table. Therefore, we will want to use ON to tell the database that these two columns refer to the same thing like this:

ON orders.customer_ID = customers.customer_id

Once again we use the . to make sure the database knows which table each of these columns is in. So when we put all of this together, we get a query that looks like this:

SELECT orders.order_number, customers.name
FROM orders
INNER JOIN customers
ON orders.customer_id = customers.customer_id

This query will return the order number of every order in the database along with the customer name that is associated with each.

Returning to our Hubway database, we can now write some queries to see JOIN in action.

Before we get started, we should take a look at the rest of the columns in the stations table. Here's a query to show us the first 5 rows so we can see what the stations table looks like:

query = '''
SELECT * FROM stations
LIMIT 5;
'''
run_query(query)
id station municipality lat lng
0 3 Colleges of the Fenway Boston 42.340021 -71.100812
1 4 Tremont St. at Berkeley St. Boston 42.345392 -71.069616
2 5 Northeastern U / North Parking Lot Boston 42.341814 -71.090179
3 6 Cambridge St. at Joy St. Boston 42.361284999999995 -71.06514
4 7 Fan Pier Boston 42.353412 -71.044624
  • id — A unique identifier for each station (corresponds to the start_station and end_station columns in the trips table)
  • station — The station name
  • municipality — The municipality that the station is in (Boston, Brookline, Cambridge or Somerville)
  • lat — The latitude of the station
  • lng — The longitude of the station
  • Which stations are most frequently used for round trips?
  • How many trips start and end in different municipalities?

Like before, we'll try to answer some questions in the data, starting with which station is the most frequent starting point? Let's work through it step by step:

  • First we want to use SELECT to return the station column from the stations table and the COUNT of the number of rows.
  • Next we specify the tables we want to JOIN and tell the database to connect them ON the start_station column in the trips table and the id column in the stations table.
  • Then we get into the meat of our query - we GROUP BY the station column in the stations table so that our COUNT will count up the number of trips for each station separately
  • Finally we can ORDER BY our COUNT and LIMIT the output to a manageable number of results
query = '''
SELECT stations.station AS "Station", COUNT(*) AS "Count"
FROM trips INNER JOIN stations
ON trips.start_station = stations.idGROUP BY stations.stationORDER BY COUNT(*) DESC
LIMIT 5;
'''
run_query(query)
Station Count
0 South Station - 700 Atlantic Ave. 56123
1 Boston Public Library - 700 Boylston St. 41994
2 Charles Circle - Charles St. at Cambridge St. 35984
3 Beacon St / Mass Ave 35275
4 MIT at Mass Ave / Amherst St 33644

If you're familiar with Boston, you'll understand why these are the most popular stations. South Station is one of the main commuter rail stations in the city, Charles Street runs along the river close to some nice scenic routes, and Boylston and Beacon streets are right downtown near a number of office buildings.

The next question we'll look at is which stations are most frequently used for round trips? We can use much the same query as before. We will SELECT the same output columns and JOIN the tables in the same way, but this time we'll add a WHERE clause to restrict our COUNT to trips where the start_station is the same as the end_station.

query = '''SELECT stations.station AS "Station", COUNT(*) AS "Count"
FROM trips INNER JOIN stations
ON trips.start_station = stations.id
WHERE trips.start_station = trips.end_station
GROUP BY stations.station
ORDER BY COUNT(*) DESC
LIMIT 5;
'''
run_query(query)
Station Count
0 The Esplanade - Beacon St. at Arlington St. 3064
1 Charles Circle - Charles St. at Cambridge St. 2739
2 Boston Public Library - 700 Boylston St. 2548
3 Boylston St. at Arlington St. 2163
4 Beacon St / Mass Ave 2144

As we can see, a number of these stations are the same as the previous question but the amounts are much lower. The busiest stations are still the busiest stations, but the lower numbers overall suggest that people are typically using Hubway bikes to get from point A to point B rather than cycling around for a while before returning to where they started.

There is one significant difference here — the Esplande, which was not one of the overall busiest stations from our first query, appears to be the busiest for round trips. Why? Well, a picture is worth a thousand words. This certainly looks like a nice spot for a bike ride:

esplanade

On to the next question: how many trips start and end in different municipalities? This question takes things a step further. We want to know how many trips start and end in a different municipality. To achieve this, we need to JOIN the trips table to the stations table twice. Once ON the start_station column and then ON the end_station column.

In order to do this, we have to create an alias for the stations table so that we are able to differentiate between data that relates to the start_station and data that relates to the end_station. We can do this in exactly the same way we've been creating aliases for individual columns to make them display with a more intuitive name, using AS.

For example we can use the following code to JOIN the stations table to the trips table using an alias of 'start'. We can then combine 'start' with our column names using . to refer to data that comes from this specific JOIN (rather than the second JOIN we will do ON the end_station column):

INNER JOIN stations AS start ON trips.start_station = start.id

Here's what the final query will look like when we run it. Note that we've used <> to represent "is not equal to", but != would also work.

query =
'''
SELECT COUNT(trips.id) AS "Count"
FROM trips INNER JOIN stations AS start
ON trips.start_station = start.id
INNER JOIN stations AS end
ON trips.end_station = end.id
WHERE start.municipality <> end.municipality;
'''
run_query(query)
Count
0 309748

This shows that about 300,000 out of 1.5 million trips (or 20%) ended in a different municipality than they started — further evidence that people mostly use Hubway bicycles for relatively short journeys rather than longer trips between towns.

If you've made it this far, congratulations! You've begun to master the basics of SQL. We have covered a number of important commands, SELECT, LIMIT, WHERE, ORDER BY, GROUP BY and JOIN, as well as aggregate and arithmetic functions. These will give you a strong foundation to build on as you continue your SQL journey.

You've mastered the SQL basics. Now what?

After finishing this beginner SQL tutorial, you should be able to pick up a database you find interesting and write queries to pull out information. A good first step might be to continue working with the Hubway database to see what else you can find out. Here are some other questions you might want to try and answer:

  • How many trips incurred additional fees (lasted longer than 30 minutes)?
  • Which bike was used for the longest total time?
  • Did registered or casual users take more round trips?
  • Which municipality had the longest average duration?

If you would like to take things a step further, check out our interactive SQL courses, which cover everything you'll need to know from beginning to advanced-level SQL for data analyst and data scientist jobs.

You also might want to read our post about exporting the data from your SQL queries into Pandas or check out our SQL Cheat Sheet and our article on SQL certification.

Learn SQL the right way!

  • Writing real queries
  • In your browser
  • On your schedule

Why watch video lectures when  you can learn by doing?

The post SQL Basics — Hands-On Beginner SQL Tutorial Analyzing Bike-Sharing appeared first on Dataquest.

]]>
Want a Job in Data? Learn SQL. https://www.dataquest.io/blog/why-sql-is-the-most-important-language-to-learn/ Fri, 29 Jan 2021 10:00:00 +0000 https://dq.t79ae38x-liquidwebsites.com/2018/01/17/why-sql-is-the-most-important-language-to-learn/ Learning SQL might not be as "sexy" as learning Python or R, but it's a fundamental skill for almost every data scientist and data analyst job. Here's why.

The post Want a Job in Data? Learn SQL. appeared first on Dataquest.

]]>
why-learn-sql

Why do you need to learn SQL?

1. SQL is used everywhere.


2. It’s in high demand because so many companies use it.


3. SQL is still the most popular language for data work in 2021.

COVID-19 Update: Is this career-related advice still relevant?

Yes! While the COVID-19 pandemic has dramatically increased the number of data professionals working from home, it hasn't caused companies to change the way they store their data — which is mostly using SQL-based database systems. 

 

SQL is old. There, I said it.

I first heard about SQL in 1997. I was in high school, and as part of a computing class we were working with databases in Microsoft Access. The computers we used were outdated, and the class was boring. Even then, it seemed that SQL was ancient.

SQL dates back almost 50 years to 1970 when Edgar Codd, a computer scientist working for IBM, wrote a paper describing a new system for organizing data in databases. By the end of the decade, several prototypes of Codd’s system had been built, and a query language — the Structured Query Language (SQL) — was born to interact with these databases.

In the years since, it has been widely adopted. Learning SQL — which can be pronounced either “sequel” or “S.Q.L.”, by the way — has been a rite of passage for programmers who need to work with databases for decades.

But why should someone who wants to get a job in data spend time learning this ‘ancient’ language in 2021?

Why not spend all your time mastering Python/R, or focusing on ‘sexier’ data skills, like Deep Learning, Scala, and Spark?

While knowing the fundamentals of a more general-purpose language like Python or R is critical, ignoring SQL will make it much harder to get a job in data. Here are three key reasons why:

1. SQL is everywhere

Almost all of the biggest names in tech use SQL. Uber, Netflix, Airbnb — the list goes on. Even within companies like Facebook, Google, and Amazon, which have built their own high-performance database systems, data teams use SQL to query data and perform analysis.

Companies using SQL
Image: StackShare.io

And it’s not just tech companies: companies big and small use SQL. A quick job search on LinkedIn, for example, will show you that more companies are looking for SQL skills than are looking for Python or R skills. SQL may be old, but it’s ubiquitous.

Data Scientist and former Dataquest student Vicknesh got his first job as a Data Analyst. He quickly found himself using SQL daily: “SQL is so pervasive, it permeates everything here. It’s like the SQL syntax persists through time and space. Everything uses SQL or a derivative of SQL.”

vicknesh-quote-using-sql-job

2. SQL is in demand

If you want to get a job in data, your focus should be the skills that employers want.

To demonstrate the importance of SQL specifically in data-related jobs, in early 2021 I analyzed more than 32,000 data jobs advertised on Indeed, looking at key skills mentioned in job ads with ‘data’ in the title. 

skills required for jobs in data, as listed on linkedin. sql is the most in-demand.

SQL is the most in-demand technical skill for data jobs. (Data: Indeed.com, 1/29/2021)

As we can see, SQL is the most in-demand skill among all jobs in data, appearing in 42.7% of all job postings. 

Interestingly, the proportion of data jobs listing SQL actually seems to be increasing! When I performed this same analysis in 2017, SQL was also the most in-demand skill, but it was listed in 35.7% of ads. 

If you're looking for your first job in data, it turns out knowing SQL is even more critical.

Most entry-level jobs in data are Data Analyst roles, so I took a look at jobs ads with ‘data analyst’ in the title, and those numbers are even more conclusive:

Skills listed in data analyst job posts, SQL is the most in-demand skill

SQL is easily the most in-demand skill for Data Analyst roles. (Data: Indeed.com, 1/29/2021)

For data analyst roles, SQL is again the most in-demand skill, listed in 57.4% of all data analyst jobs. SQL appears in 1.5 times as many "data analyst" job postings as Python, and nearly 2.5 times as many job postings as R.

There's no doubt that if you're looking for a role as a data analyst, learning SQL should be at the top of your to-do list.

In fact, even if you're interested in more advanced roles, SQL skills are critical.

I performed the same analysis on "Data Scientist" and "Data Engineer" job postings, and while SQL isn't the top skill for either of those jobs, it's still listed in 58.2% of data scientist job postings, and 56.4% of data engineer job postings.

skills listed for data scientists and data engineers

SQL is listed in more than half of all DS (left) and DE (right) job roles (Data: Indeed.com, 1/29/2021)

That means that even if you're a Python master already, you're going to miss out on 3 out of 5 data science and data engineer job openings unless you've got SQL skills on your resume, too.

Long story short: yes, you need to learn SQL, for any role in the data science industry. (You do not need a SQL certification, though!)

It will not only make you more qualified for these jobs, it will set you apart from other candidates who’ve only focused on the “sexy” stuff like machine learning in Python.

3. SQL is still the top language for data work

SQL is more popular among data scientists and data engineers than even Python or R. In fact, it's one of the most-used languages in the entire tech industry!

In the chart below, the "most used" technologies from StackOverflow’s 2020 developer survey, we can see that SQL eclipses even Python in terms of popularity. In fact, it's the third-most-popular language among all developers:

sql use among all developers

Source: StackOverflow 2020 Developer Survey

But we're concerned specifically with jobs within the field of data science, so let's filter things down a little further. If we dig into the raw data from the 2020 survey, we find that SQL is even more imptorant and widely-used in the context of data jobs.

In the complete dataset, which StackOverflow has released here, we can see that among developers who work with data (including data scentists, data analysts, database adminstrators, data engineers, etc.), more than 70% use SQL — more than any other language.

What Languages Do People with Jobs in Data Use

Data Source: StackOverflow 2020 Survey

And if we filter down still further, into just data scientists and analysts, we can see that SQL is still the most popular technology. 65% of data scientists and data analysts said they used SQL, compared to 64% for Python, and 28% for R.

What Languages Do People with Data Scientist_Analyst Jobs Use

Data Source: StackOverflow 2020 Survey

In other words: SQL is the most-used language in data science, according to the 10,000+ data professionals who responded to StackOverflow's 2020 survey.

Despite lots of hype around NOSQL, Hadoop and other technologies, SQL remains the most popular language for data work, and one of the most popular languages for developers of all stripes. 

So, what’s the best way to learn SQL?

Now that we why we should learn SQL, the obvious question is how?

There are literally thousands of SQL courses online, but most of them don’t prepare you for using SQL in in the real world. The best way to illustrate this is to look at the queries they teach you to write:

The way most courses teach SQL

The queries above demonstrate the complexity of the SQL taught at the end of SQL courses by three of the more popular online learning sites. The problem is that real-world SQL doesn’t look like that. Real-world SQL looks like this:

What SQL actually looks like

When you’re answering business questions with data, you often write SQL queries that need to combine data from lots of tables, wrangling it into its final form.

The end result is students finding themselves unprepared to get the jobs they want, just like this recent post from a data science forum:

complex_sql_slack-2

What we’re doing about it

Here at Dataquest, we believe that SQL competency is the one of the key skills for anyone who wants to get a job in data.

We’re not suggesting you learn SQL instead of Python and/or R, but instead thoroughly learn SQL as your second language — becoming familiar with writing queries at a high level.

We understand that learning SQL is incredibly important for data science, and that’s why we offer a number of interactive SQL courses. In our Data Analyst and Data Scientist paths:

Our Data Engineering path also includes a couple of unique courses:

We've  also put together a downloadable SQL Cheat Sheet as a useful reference for the SQL basics.

Our interactive courses are written with goal of equipping our students with the skills they need at the level they’ll need. You won’t spend time watching videos — instead, you’ll be writing your first queries in minutes, and be on your way to mastering the most important data skill.

While we start from zero, our courses go beyond the basics so you can become a SQL master. As an example, the ‘real-life’ SQL image above is taken from our SQL Intermediate course.

You can sign up and complete the first mission in each course for free, and we encourage you to try them out and let us know what you think.

Learn SQL the right way!

  • Writing real queries
  • In your browser
  • On your schedule

Why passively watch video lectures when  you can learn by doing?

We Love SQL!

I hope I’ve persuaded you that mastering SQL is key to starting your career in data. While it’s easy to be distracted by the latest and greatest new language or framework, learning SQL will pay dividends on your path to break into the data industry.

It might just be the most important language you learn.

The post Want a Job in Data? Learn SQL. appeared first on Dataquest.

]]>
SQL Cheat Sheet — SQL Reference Guide for Data Analysis https://www.dataquest.io/blog/sql-cheat-sheet/ Wed, 20 Jan 2021 21:35:26 +0000 https://www.dataquest.io/?p=27055 Whether you’re learning SQL through one of our interactive SQL courses or by some other means, it can be really helpful to have a SQL cheat sheet.Bookmark this article, or download and print the PDF, and keep it handy for quick reference the next time you’re writing an SQL query!Our SQL cheat sheet goes a […]

The post SQL Cheat Sheet — SQL Reference Guide for Data Analysis appeared first on Dataquest.

]]>

Whether you’re learning SQL through one of our interactive SQL courses or by some other means, it can be really helpful to have a SQL cheat sheet.

Bookmark this article, or download and print the PDF, and keep it handy for quick reference the next time you’re writing an SQL query!

Our SQL cheat sheet goes a bit more in-depth than this handwritten one!

Need to brush up on your SQL before you're ready for the cheat sheet? Check out our interactive online SQL Fundamentals course, read about why you should learn SQL, or do some research about SQL certifications and whether you'll need one.

SQL Basics

SQL stands for Structured Query Language. It is a system for querying — requesting, filtering, and outputting — data from relational databases.

Developed in the 1970s, SQL was originally called SEQUEL. For this reason, today it is sometimes pronounced “Sequel” and sometimes pronounced “S.Q.L.” Either pronunciation is acceptable

Although there are many “flavors” of SQL, SQL in some form can be used for querying data from most relational database systems, including MySQL, SQLite, Oracle, Microsoft SQL Server, PostgreSQL, IBM DB2, Microsoft Azure SQL Database, Apache Hive, etc. databases.

SQL Cheat Sheet: Fundamentals

Performing calculations with SQL

Performing a single calculation:
SELECT 1320+17;

Performing multiple calculations:
SELECT 1320+17, 1340-3, 7*191, 8022/6;

Performing calculations with multiple numbers:
SELECT 1*2*3, 1+2+3;

Renaming results:
SELECT 2*3 AS mult, 1+2+3 AS nice_sum;


Selecting tables, columns, and rows:

Remember: The order of clauses matters in SQL. SQL uses the following order of precedence: FROM, SELECT, LIMIT.

Display the whole table:

SELECT *
  FROM table_name;

Select specific columns from a table:

SELECT column_name_1, column_name_2
  FROM table_name;

Display the first 10 rows on a table:

SELECT *
  FROM table_name
  LIMIT 10;

Adding comments to your SQL queries

Adding single-line comments:

-- First comment
SELECT column_1, column_2, column_3 -- Second comment
  FROM table_name; -- Third comment

Adding block comments:

/*
This comment
spans over
multiple lines
 */
SELECT column_1, column_2, column_3
  FROM table_name;

SQL Intermediate: Joins & Complex Queries

Many of these examples use table and column names from the real SQL databases that learners work with in our interactive SQL courses. For more information, sign up for a free account and try one out!


Joining data in SQL:

Joining tables with INNER JOIN:

SELECT column_name_1, column_name_2 FROM table_name_1
INNER JOIN table_name_2 ON table_name_1.column_name_1 = table_name_2.column_name_1;

Joining tables using a LEFT JOIN:

SELECT * FROM facts
LEFT JOIN cities ON cities.facts_id = facts.id;

Joining tables using a RIGHT JOIN:

SELECT f.name country, c.name city
FROM cities c
RIGHT JOIN facts f ON f.id = c.facts;

Joining tables using a FULL OUTER JOIN:

SELECT f.name country, c.name city
FROM cities c
FULL OUTER JOIN facts f ON f.id = c.facts_id;

Sorting a column without specifying a column name:

SELECT name, migration_rate FROM FACTS
ORDER BY 2 desc; -- 2 refers to migration_rate column

Using a join within a subquery, with a limit:

SELECT c.name capital_city, f.name country
FROM facts f
INNER JOIN (
        SELECT * FROM cities
				WHERE capital = 1
				) c ON c.facts_id = f.id
LIMIT 10;

Joining data from more than two tables:

SELECT [column_names] FROM [table_name_one]
   [join_type] JOIN [table_name_two] ON [join_constraint]
	 [join_type] JOIN [table_name_three] ON [join_constraint]
	 ...
	 ...
	 ...
	 [join_type] JOIN [table_name_three] ON [join_constraint]

Other common SQL operations:

Combining columns into a single column:

SELECT
		album_id,
		artist_id,
		"album id is " || album_id col_1,
		"artist id is " || artist_id col2,
		album_id || artist_id col3
FROM album LIMIT 3;

Matching part of a string:

SELECT
	first_name,
	last_name,
	phone
FROM customer
WHERE first_name LIKE "%Jen%";

Using if/then logic in SQL with CASE:

CASE
	WHEN [comparison_1] THEN [value_1]
	WHEN [comparison_2] THEN [value_2]
	ELSE [value_3]
	END
AS [new_column_name]

Using the WITH clause:

WITH track_info AS
(
	SELECT
		t.name,
		ar.name artist,
		al.title album_name,
	FROM track t
	INNER JOIN album al ON al.album_id = t.album_id
	INNER JOIN artist ar ON ar.artist_id = al.artist_id
)
SELECT * FROM track_info
WHERE album_name = "Jagged Little Pill";

Creating a view:

CREATE VIEW chinook.customer_2 AS
SELECT * FROM chinook.customer;

Dropping a view:

DROP VIEW chinook.customer_2;

Selecting rows that occur in one or more SELECT statements:

[select_statement_one]
UNION
[select_statement_two];

Selecting rows that occur in both SELECT statements:

SELECT * from customer_usa
INTERSECT
SELECT * from customer_gt_90_dollars;

Selecting rows that occur in the first SELECT statement but not the second SELECT statement:

SELECT * from customer_usa
EXCEPT
SELECT * from customer_gt_90_dollars;

Chaining WITH statements:

WITH
usa AS
	(
	SELECT * FROM customer
	WHERE country = "USA"
	),
last_name_g AS
	(
	SELECT * FROM usa
	WHERE last_name LIKE "G%"
	),
state_ca AS
	(
	SELECT * FROM last_name_g
	WHERE state = "CA"
	)
SELECT
	first_name,
	last_name,
	country,
	state
FROM state_ca

Important Concepts and Resources:

Reserved words

Reserved words are words that cannot be used as identifiers (such as variable names or function names) in a programming language, because they have a specific meaning in the language itself. Here is a list of reserved words in SQL.

Download the SQL Cheat Sheet PDF

Click on the button below to download the cheat sheet (PDF, 3 MB, color).

Looking for more than just a quick reference? Dataquest's interactive SQL courses will help you get hands-on with SQL as you learn to build the complex queries you'll need to write for real-world data work.

Click the button below to sign up for a free account and start learning right now!

Learn SQL the right way!

  • Writing real queries
  • In your browser
  • On your schedule

Why watch video lectures when you can learn by doing?

The post SQL Cheat Sheet — SQL Reference Guide for Data Analysis appeared first on Dataquest.

]]>
Do You Need a SQL Certification to Get a Data Job in 2021? https://www.dataquest.io/blog/sql-certification/ Tue, 19 Jan 2021 23:17:27 +0000 https://www.dataquest.io/?p=27022 If you want to work in data, do you need a SQL certification? That’s a question that can be difficult to answer, especially with different organizations pushing to get you to spend money on their certificate programs. Table Of Contents (click to expand) 1Do you need to learn SQL? Yes.2Do you need a SQL certificate? […]

The post Do You Need a SQL Certification to Get a Data Job in 2021? appeared first on Dataquest.

]]>

If you want to work in data, do you need a SQL certification? That’s a question that can be difficult to answer, especially with different organizations pushing to get you to spend money on their certificate programs.

sql-certification

Will getting a SQL certification actually help you get a job?

Do you need to learn SQL? Yes.

Before we dive into the certification question, it’s worth asking: do you need to learn SQL at all to get a job in data?

Short answer: yes, you do. SQL is an absolutely critical skill for working with data. It may be a decades-old language, but it’s still as relevant as ever.

In fact, we recently crunched some numbers from 2020 and found that SQL is still the most commonly-used in data science, ahead of even Python or R!

So you need to learn SQL. That begs another question: do you need a SQL certificate of some kind? Is having a certification going to be helpful when it’s time to apply for jobs?

Do you need a SQL certificate? It depends.

If you aspire to work as a data analyst or data scientist, the answer is no, you do not need a SQL certificate.

You certainly need SQL skills for these jobs, but certification won’t be required. In fact, it probably won’t even help.

Case in point: I recently interviewed data science hiring managers, recruiters, and other professionals for a data science career guide, asking them about the skills and qualifications they wanted to see in great data analyst and data scientist job candidates. The transcripts of these interviews, put together, cover nearly 200 pages!

Throughout those 200 pages, the term “SQL” is mentioned a lot, because that’s a skill that most hiring managers want to see. But “certification” and “certificate”? Those words don’t appear in the transcripts at all. Not a single person I spoke to thought certificates were important enough to even mention them in the context of data analyst and data scientist jobs.

In other words: the people who hire data analysts and data scientists typically don’t care about certifications. Having a SQL certificate on your resume isn’t likely to impact their decision making.

employer who is skeptical

You can put a SQL certificate on your resume, but you can't make employers care about it.

You may wonder: why not? The short answer is that there’s no “standard” certification for the SQL that’s required for these roles. And there are so many different online and offline SQL certification options that employers can’t really assess whether or not they’re meaningful. It’s easier for employers to simply look at an applicant’s project portfolio — that’s a much more tangible, trustworthy representation of their SQL skills than a certification.

(Many data science employers also incorporate a SQL skills test or SQL interview questions into their hiring process so that they get an even clearer picture of your SQL skills before making a hiring decision.)

If you aspire to work in something closer to database administration, or you’re looking at a very specific company or industry, it gets a little bit blurrier. There are many “flavors” of SQL tied to different database systems and tools. There may be official certifications associated with the specific flavor of SQL a company uses that are valuable, or even mandatory.

For example, if you’re applying for a database job at a company that uses Microsoft’s SQL Server, getting one of Microsoft’s Azure Database Administrator certificates could be helpful. If you’re applying for a job at a company that uses Oracle, getting an Oracle Database SQL certification may be required.

But again, in data analysis and data science roles, these kinds of certifications are rarely required. The different sub-flavors of SQL rarely differ too much from “base” SQL, and employers generally won’t be concerned about whether you’ve mastered the minutiae of a particular brand’s proprietary tweaks.

They just want to see proof that you’ve got the fundamental SQL skills required to access and filter the data you need. Certifications don’t really prove that you have a particular skill, so the best way to demonstrate your SQL knowledge on a job application is to include projects that show off your SQL skill, not to list certifications.

Is there an “official” SQL certificate?

In a word, no. While there are official certifications for some of the proprietary sub-flavors of SQL and SQL-based database technologies, there’s no official certification for SQL itself.

There are, of course, a wide variety of unofficial SQL certifications with varying levels of quality, rigor, and price. But because there’s no official certification, or even a widely-accepted standard, none of these certifications are particularly useful because employers don’t trust them as proof of SQL skill.

Will getting a certificate help with the job hunt?

Yes and no. As mentioned above, having the right certification can definitely help for database jobs.

For data analyst and data scientist jobs, the certification itself typically is not helpful. Employers aren’t looking for a certificate on your resume, and they’re not likely to care whether or not they see one.

I spoke to a wide variety of employers about what makes a great data science resume, and over nearly 200 pages of interview transcripts, not one of them mentioned wanting to see certifications.

They do want to see proof of SQL skills, though, and this means that SQL certifications can be very useful if they’re teaching you the things you need to know. In that sense, getting a certification can help you with the job hunt — but it’s important to remember that the value you’re getting is in the skills you’re learning.

this is a stock photo that is not related to sql specifically

The code in this is not SQL, but it's a pretty cool looking stock photo, right?

Is getting a SQL certification worth it for data science?

It depends whether the certification program is teaching you valuable skills or just giving you a bullet point for your LinkedIn. The former can be worth it; the latter is definitely not.

Price is also an important consideration. Even if you have the money to spend thousands on a SQL certification, there’s no good reason to pay that much when you can learn SQL interactively and get certified for a much lower price on platforms like Dataquest.

Of course, the best way to determine if something is worth it is always to try it for yourself. At Dataquest, you can sign up for a free account and dive right into learning SQL. That way, you’ll know you’re making the right decision when you decide to invest in learning SQL skills with us.

How can you learn SQL?

Learning SQL on your own can be challenging, because to actually practice anything you’re learning, you’ll need to find data, set up a local SQL database, and figure out how to connect to it. That creates a lot of up-front challenge, especially if you don’t have any prior experience with SQL!

At Dataquest, we have interactive online SQL courses that allow you to write and run real queries right in your browser. There’s no need to download anything or set anything up locally. Just sign up (it’s free) and you’ll be writing and running your first query in less than five minutes.

And unlike some of the other platforms out there, we don’t just drop you into the deep end after covering the simple queries — we’ll walk you through the more complex SQL queries that are a part of everyday data science work in the real world.

Learn SQL the right way!

  • Writing real queries
  • In your browser
  • On your schedule

Why watch video lectures when  you can learn by doing?

If you already know some SQL, you might also find this SQL cheat sheet we put together helpful!

What SQL certificate is best?

As we’ve mentioned above, there’s a good chance you don’t need a SQL certificate. But if you do feel you need one, or you’d just like to have one, here are some of the best certifications and the specific reasons you might want to have them.

  • Dataquest’s SQL courses. These are great options for learning SQL for data science and data analysis. They’ll take you hands-on with real SQL databases, and show you how to write queries to pull, filter, and analyze the data you need. All of our SQL courses offer certifications that you can add to your LinkedIn after you’ve completed them. They also include guided projects that you can complete and add to your GitHub and resume!
  • MTA: Database Fundamentals. This is a Microsoft certification that covers some of the fundamentals of SQL for database administration. It is focused on Microsoft’s SQL Server product, but many of the skills it covers will be relevant to other SQL-based relational database systems.
  • Microsoft’s Azure Database Administrator certificate. This is a great option if you’re applying to database administrator jobs at companies that use Microsoft SQL Server. Note that if you’re looking into this, you’ll see mentions to Microsoft’s MCSA certification, too. However, that certification is older, and is being retired in January 2021. The Azure certification linked above is the newest and most relevant certification related to Microsoft SQL Server.
  • Oracle Database SQL certification. This could be a good certification for anyone who’s interested in database jobs at companies that use Oracle.
  • Koenig SQL certifications. Koenig offers a variety of SQL-related certification programs, although they tend to be quite pricey (over US$1,000 for most programs). Most of these certifications are specific to particular database technologies (like Microsoft SQL Server) rather than being aimed at building general SQL knowledge, and would probably be best for those who know they’ll need training in a specific type of database for a job as a database administrator.

Should I get a university, edX, or Coursera certification in SQL?

One option that can seem appealing if you’re interested in a more general SQL certification is getting certified through a university-affiliated program, either online or in-person. For example, there’s a Stanford program at EdX, programs affiliated with UC Davis and the University of Michigan at Coursera, etc.

These programs can seem to offer some of the cachet of a university degree without the expense or the time commitment. Unfortunately, hiring managers don’t see them that way.

stanford university campus

This is Stanford University. Unfortunately, getting a Stanford certificate from EdX will not trick employers into thinking you went here.

Employers know that a Stanford certificate and a Stanford degree, for example, are very different things. Even if the certificate program uses video lectures from real courses, employers know that certificate programs rarely include rigorous testing or project assessment. Often, they don’t even do anything to verify student identities!

Most online university certificate programs follow a basic formula:

  • You watch video lectures to learn the material.
  • You take multiple-choice or fill-in-the-blank quizzes to test your knowledge.
  • If you complete any kind of hands-on project, it is ungraded, or graded by your peers (other learners in your cohort).

This format is immensely popular because it is the most economical way for universities to monetize their course material.

All they have to do is record some lectures, write a few quizzes, and then hundreds of thousands of students can move through the courses with no additional effort or expense required.

Employers know that these certificate programs are not rigorous. Often it’s possible to complete an online programming certification without ever having written or run a line of code!

What employers want to see on a resume is proof of your SQL skills, and they know that this type of certificate doesn’t prove anything.

So, university certificates aren’t going to impress anyone on your resume! But can they be a valuable resource for actually learning SQL?

Theoretically, yes. But they have some major shortcomings.

First and most important: they generally don’t include any hands-on practice. You can certainly try to set up practice on your own, but if your course isn’t requiring you to do this, it’s easy to put it off or forget.

Going hands-on and actually writing and running SQL queries is imperative, though. So is working with real data. The best way to learn to do these critical professional tasks is by doing them, not by watching a professor talk about them.

That’s why at Dataquest, we have an interactive online platform that lets you write and run real SQL queries on real data right from your browser window. As you’re learning new SQL concepts, you’ll be immediately applying them in a real-world setting.

dataquest sql learning platform looks like this

This is how we teach SQL at Dataquest

And after each course, you’ll be asked to synthesize your new learnings into a longer-form guided project that you can customize and put on your resume and GitHub!

Second, because of the peer-graded projects and lecture-based format in university certificate courses, they tend to come with time constraints. You’ll have to wait for a set data for a course to open before you can start. You’ll have to dedicate a set amount of time to watching each lecture, and you’ll have to finish by a certain date to get some of the benefits.

This isn’t true of interactive learning platforms like Dataquest, which you can start at anytime. Because our courses are not lecture-based, or even video-based at all, your study sessions can be completely flexible. You can learn and apply something new in as little as five minutes!

But don’t just take our word for it! You can sign up for a free account and within a few minutes, you’ll have written your first SQL query.

Why not give it a try?

Learn SQL the right way!

  • Writing real queries
  • In your browser
  • On your schedule

Why watch video lectures when  you can learn by doing?

The post Do You Need a SQL Certification to Get a Data Job in 2021? appeared first on Dataquest.

]]>
SQL Joins Tutorial: Working with Databases https://www.dataquest.io/blog/sql-joins-tutorial/ Mon, 18 Jan 2021 22:47:06 +0000 https://www.dataquest.io/?p=21484 Learn how to master joins in the SQL joins tutorial. Learn to use inner, left, right, and outer joins while analyzing CIA factbook data.

The post SQL Joins Tutorial: Working with Databases appeared first on Dataquest.

]]>
sql-joins-tutorial

SQL joins don’t have to be this challenging!

When first learning SQL, it’s common to work with data in a single table. In the real world, databases generally have data in more than one table. If we want to be able to work with that data, we’ll have to combine multiple tables within a query. In this SQL joins tutorial, we’ll learn how to use joins to select data from multiple tables.

We’ll assume that you know the fundamentals of working in SQL including filtering, sorting, aggregation, and subqueries. If you don’t, our SQL Fundamentals course teaches all of these concepts, and you can sign up and start that course for free

Learn SQL the right way!

  • Writing real queries
  • In your browser
  • On your schedule

Why passively watch video lectures when  you can learn by doing?

The Factbook Database

We’re going to be using a version of the CIA World Factbook (Factbook) database that has two tables. The first table is called facts, and each row represents a country from the Factbook. Here are the first 5 rows of the facts table:

id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
1 af Afghanistan 652230 652230 0 32564342 2.32 38.57 13.89 1.51
2 al Albania 28748 27398 1350 3029278 0.30 12.92 6.58 3.30
3 ag Algeria 2381741 2381741 0 39542166 1.84 23.67 4.31 0.92
4 an Andorra 468 468 0 85580 0.12 8.13 6.96 0.00
5 ao Angola 1246700 1246700 0 19625353 2.78 38.78 11.49 0.46

In addition to the facts table is a second table called cities which contains information on major urban areas from countries in the Factbook (for the rest of this tutorial, we’ll use the word ‘cities’ to mean the same as ‘major urban areas’). Let’s take a look at the first few rows of this table and a description of what each column represents:

id name population capital facts_id
1 Oranjestad 37000 1 216
2 Saint John’S 27000 1 6
3 Abu Dhabi 942000 1 184
4 Dubai 1978000 0 184
5 Sharjah 983000 0 184
  • id – A unique ID for each city.
  • name – The name of the city.
  • population – The population of the city.
  • capital – Whether the city is a capital city: 1 if it is, 0 if it isn’t.
  • facts_id – The ID of the country, from the facts table.

The last column is of particular interest to us, as it is a column of data that also exists in our original facts table. This link between tables is important as it’s used to combine the data in our queries. Below is a schema diagram, which shows the two tables in our database, the columns within them and how the two are linked.

schema diagram

The line in the schema diagram clearly shows the link between the id column in the facts table and the facts_id column in the cities table.

If you’d like to download the database to follow along on your own computer you can download the dataset as a SQLite database.

Our First SQL Join

The most common way to join data using SQL is using an inner join. The syntax for an inner join is:

SELECT [column_names] FROM [table_name_one]
INNER JOIN [table_name_two] ON [join_constraint];

The inner join clause is made up of two parts:

  • INNER JOIN, which tells the SQL engine the name of the table you wish to join in your query, and that you wish to use an inner join.
  • ON, which tells the SQL engine what columns to use to join the two tables.

Joins are usually used in a query after the FROM clause. Let’s look at a basic inner join where we combine the data from both of our tables.

SELECT * FROM facts
INNER JOIN cities ON cities.facts_id = facts.id
LIMIT 5;

Let’s look at the line of the query with the join in it:

  • INNER JOIN cities: This tells the SQL engine that we wish to join the cities table to our query using an inner join.
  • ON cities.facts_id = facts.id: This tells the SQL engine which columns to use when joining the data, following the syntax table_name.column_name.

You might presume that SELECT * FROM facts will mean that the query returns only columns from the facts table, however the * wildcard when used with a join will give you all columns from both tables. Here is the result of this query:

id code name area area_land area_water population population_growth birth_rate death_rate migration_rate id name population capital facts_id
216 aa Aruba 180 180 0 112162 1.33 12.56 8.18 8.92 1 Oranjestad 37000 1 216
6 ac Antigua and Barbuda 442 442 0 92436 1.24 15.85 5.69 2.21 2 Saint John’S 27000 1 6
184 ae United Arab Emirates 83600 83600 0 5779760 2.58 15.43 1.97 12.36 3 Abu Dhabi 942000 1 184
184 ae United Arab Emirates 83600 83600 0 5779760 2.58 15.43 1.97 12.36 4 Dubai 1978000 0 184
184 ae United Arab Emirates 83600 83600 0 5779760 2.58 15.43 1.97 12.36 5 Sharjah 983000 0 184

This query gives us all columns from both tables and every row where there is a match between the id column from facts and the facts_id from cities, limited to the first 5 rows.

Understanding SQL Inner Joins

We’ve now joined the two tables to give us extra information about each row in cities. Let’s take a closer look at how this inner join works.

An inner join works by including only rows from each table that have a match as specified using the ON clause. Let’s look at a diagram of how our join from the previous screen works. We have included a selection of rows which best illustrate the join:

inner join

Our inner join will include:

  • Rows from the cities table that have a cities.facts_id that matches a facts.id from facts.

Our inner join will not include:

  • Rows from the cities table that have a cities.facts_id that don’t match any facts.id from facts.
  • Rows from the facts table that have a facts.id that don’t match any cities.facts_id from cities.

You can see this represented as a Venn diagram:

Inner join Venn diagram

We already know how to use aliases to specify custom names for columns, eg:

SELECT AVG(population) AS average_population

We can also create aliases for table names, which makes queries with joins easier to both read and write. Instead of:

SELECT * FROM facts
INNER JOIN cities ON cities.facts_id = facts.id

We can write:

SELECT * FROM facts AS f
INNER JOIN cities AS c ON c.facts_id = f.id

Just like with column names, using AS is optional. We can get the same result by writing:

SELECT * FROM facts f
INNER JOIN cities c ON c.facts_id = f.id

We can also combine aliases with wildcards – for instance, using the aliases created above, c.* would give us all columns from the table cities.

While our query from the previous screen included both columns from the ON clause, we don’t need to use either column from our ON clause in our final list of columns. This is useful as it means we can show only the information we’re interested in, rather than having to include the two join columns every time.

Let’s use what we’ve learned to build on our original query. We’ll:

  • Join cities to facts using an INNER JOIN.
  • Use aliases for table names.
  • Include, in order:

    • All columns from cities.
    • The name column from facts aliased to country_name.
  • Include only the first 5 rows.
SELECT
    c.*,
    f.name country_name
FROM facts f
INNER JOIN cities c ON c.facts_id = f.id
LIMIT 5;

id name population capital facts_id country_name
1 Oranjestad 37000 1 216 Aruba
2 Saint John’S 27000 1 6 Antigua and Barbuda
3 Abu Dhabi 942000 1 184 United Arab Emirates
4 Dubai 1978000 0 184 United Arab Emirates
5 Sharjah 983000 0 184 United Arab Emirates

Practicing Inner Joins in SQL

Let’s practice writing a query to answer a question from our database using an inner join. Say we want to produce a table of countries and their capital cities from our database using what we’ve learned so far. Our first step is to think about what columns we’ll need in our final query. We’ll need:

  • The name column from facts
  • The name column from cities

Given that we’ve identified that we need data from two tables, we need to think about how to join them. The schema diagram from earlier indicated that there is only one column in each table that links them together, so we can use an inner join with those columns to join the data.

So far, thinking through our question we can already write most of our query (it’s almost identical to the previous query we wrote):

SELECT
    f.name,
    c.name
FROM cities c
INNER JOIN facts f ON f.id = c.facts_id;

The last part of our process is to make sure we have the correct rows. From the previous two screens we know that a query like this will return all rows from cities that have a corresponding match from facts in the facts_id column. We’re only interested in the capital cities from the cities table, so we’ll need to use a WHERE clause on the capital column, which has a value of 1 if the city is a capital, and 0 if it isn’t:

WHERE c.capital = 1

We can now put this all together to write a query that answers our question. We’ll limit it to just the first 10 rows so the amount of output is managable.

SELECT
    f.name country,
    c.name capital_city
FROM cities c
INNER JOIN facts f ON f.id = c.facts_id
WHERE c.capital = 1
LIMIT 10;

country capital_city
Aruba Oranjestad
Antigua and Barbuda Saint John’S
United Arab Emirates Abu Dhabi
Afghanistan Kabul
Algeria Algiers
Azerbaijan Baku
Albania Tirana
Armenia Yerevan
Andorra Andorra La Vella
Angola Luanda

Left Joins in SQL

As we mentioned earlier, an inner join will not include any rows where there is not a mutual match from both tables. This means there could be information we are not seeing in our query where rows don’t match.

We can use SQL queries to explore this:

SELECT COUNT(DISTINCT(name)) FROM facts;

count
261
SELECT COUNT(DISTINCT(facts_id)) FROM cities;

count
210

By running these two queries, we can see that there are some countries in the facts table that don’t have corresponding cities in the cities table, which indicates we may have some incomplete data.

Let’s look at how we can create a query to explore the missing data using a new type of join— the left join.

A left join includes all the rows that an inner join will select, plus any rows from the first (or left) table that don’t have a match in the second table. We can see this represented as a Venn diagram.

Venn diagram left join

Let’s look at an example by replacing INNER JOIN with LEFT JOIN from the first query we wrote, and looking at the same selection of rows from our earlier diagram

SELECT * FROM facts
LEFT JOIN cities ON cities.facts_id = facts.id

left join

Here we can see that for the rows where facts.id doesn’t match any values in cities.facts_id (237, 238, 240, and 244), the rows are still included in the results. When this happens, all of the columns from the cities table are populated with null values.

We can use these null values to filter our results to just the countries that don’t exist in cities with a WHERE clause. When making a comparison to null in SQL, we use the IS keyword, rather than the = sign. If we want to select rows where a column is null we can write:

WHERE column_name IS NULL

If we want to select rows where a column name isn’t null, we use:

WHERE column_name IS NOT NULL

Let’s use a left join to explore the countries that don’t exist in the cities table.

SELECT
    f.name country,
    f.population
FROM facts f
LEFT JOIN cities c ON c.facts_id = f.id
WHERE c.name IS NULL;

country population
Kosovo 1870981
Monaco 30535
Nauru 9540
San Marino 33020
Singapore 5674472
Holy See (Vatican City) 842
Taiwan 23415126
European Union 513949445
Ashmore and Cartier Islands
Christmas Island 1530
Cocos (Keeling) Islands 596
Coral Sea Islands
Heard Island and McDonald Islands
Norfolk Island 2210
Hong Kong 7141106
Macau 592731
Clipperton Island
French Southern and Antarctic Lands
Saint Barthelemy 7237
Saint Martin 31754
Curacao 148406
Sint Maarten 39689
Cook Islands 9838
Niue 1190
Tokelau 1337
Bouvet Island
Jan Mayen
Svalbard 1872
Akrotiri 15700
British Indian Ocean Territory
Dhekelia 15700
Gibraltar 29258
Guernsey 66080
Jersey 97294
Montserrat 5241
Pitcairn Islands 48
South Georgia and South Sandwich Islands
Navassa Island
Wake Island
United States Pacific Island Wildlife Refuges
Antarctica 0
Gaza Strip 1869055
Paracel Islands
Spratly Islands
West Bank 2785366
Arctic Ocean
Atlantic Ocean
Indian Ocean
Pacific Ocean
Southern Ocean
World 7256490011

Looking through the results of the query we wrote in the previous screen, we can see a number of different reasons that countries don’t have corresponding values in cities:

  • Countries with small populations and/or no major urban areas (which are defined as having populations of over 750,000), eg San Marino, Kosovo, and Nauru.
  • City-states, such as Monaco and Singapore.
  • Territories that are not themselves countries, such as Hong Kong, Gibraltar, and the Cook Islands.
  • Regions & Oceans that aren’t countries, such as the European Union and the Pacific Ocean.
  • Genuine cases of missing data, such as Taiwan.

It’s important whenever you use inner joins to be mindful that you might be excluding important data, especially if you are joining based on columns that aren’t linked in the database schema.

Right Joins and Outer Joins

There are two less-common join types SQLite does not support that you should be aware of. The first is a right join. A right join, as the name indicates, is exactly the opposite of a left join. While the left join includes all rows in the table before the JOIN clause, the right join includes all rows in the new table in the JOIN clause. We can see a right join in the Venn diagram below:

Venn diagram of a right join

The following two queries, one using a left join and one using a right join, produce identical results.

SELECT f.name country, c.name city
FROM facts f
LEFT JOIN cities c ON c.facts_id = f.id
LIMIT 5;

SELECT f.name country, c.name city
FROM cities c
RIGHT JOIN facts f ON f.id = c.facts_id
LIMIT 5;

The main reason a right join would be used is when you are joining more than two tables. In these cases, using a right join is preferable because it can avoid restructuring your whole query to join one table. Outside of this, right joins are used reasonably rarely, so for simple joins it’s better to use a left join than a right as it will be easier for your query to be read and understood by others.

The other join type not supported by SQLite is a full outer join. A full outer join will include all rows from the tables on both sides of the join. We can see a full outer join in the Venn diagram below:

Venn diagram of a full outer join

Like right joins, full outer joins are reasonably uncommon. The standard SQL syntax for a full outer join is:

SELECT f.name country, c.name city
FROM cities c
FULL OUTER JOIN facts f ON f.id = c.facts_id
LIMIT 5;

When joining cities and facts with a full outer join, the result will be be the same as our left and right joins above, because there are no values in cities.facts_id that don’t exist in facts.id.

Let’s look at the Venn diagrams of each join type side by side, which should help you compare the differences of each of the four joins we’ve discussed so far.

Join Venn Diagram

Next, let’s practice using joins to answer some questions about our data.

Finding the Most Populous Capital Cities

Previously, we’ve used column names when specifying order for our query results, like so:

SELECT
    name,
    migration_rate
FROM facts
ORDER BY migration_rate desc;

There is a handy shortcut we can use in our queries which lets us skip the column names, and instead use the order in which the columns appear in the SELECT clause. In this instance, migration_rate is the second column in our SELECT clause so we can just use 2 instead of the column name:

SELECT
    name,
    migration_rate
FROM facts
ORDER BY 2 desc;

You can use this shortcut in either the ORDER BY or GROUP BY clauses. Be mindful that you want to ensure your queries are still readable, so typing the full column name may be better for more complex queries.

Let’s use what we’ve learned to produce a list of the top 10 capital cities by population. Because we are not interested in countries from facts that don’t have corresponding cities in cities, we should use an INNER JOIN.

SELECT
    c.name capital_city,
    f.name country,
    c.population
FROM facts f
INNER JOIN cities c ON c.facts_id = f.id
WHERE c.capital = 1
ORDER BY 3 DESC
LIMIT 10;

capital_city country population
Tokyo Japan 37217000
New Delhi India 22654000
Mexico City Mexico 20446000
Beijing China 15594000
Dhaka Bangladesh 15391000
Buenos Aires Argentina 13528000
Manila Philippines 11862000
Moscow Russia 11621000
Cairo Egypt 11169000
Jakarta Indonesia 9769000

Combining SQL Joins with Subqueries

Subqueries can be used to substitute parts of queries, allowing us to find the answers to more complex questions. We can also join to the result of a subquery, just like we could a table.

Here’s an example of a using a join and a subquery to produce a table of countries and their capital cities, like we did earlier in the mission.

subqueries

Reading subqueries can be overwhelming at first, so we’ll break down what happens in this example in several steps. The important thing to remember is that the result of any subquery is always calculated first, so we read from the inside out.

  • The subquery, in the red box, is calculated first. This simple query selects all columns from cities, filtering rows that are marked as capital cities by having a value for capital of 1.
  • The INNER JOIN joins the subquery result, aliased as c, to the facts table based on the ON clause.
  • Two columns are selected from the results of the join:

    • f.name, aliased as country.
    • c.name, aliased as capital_city.
  • The results are limited to the first 10 rows.

Below is the output of this query:

country capital_city
Aruba Oranjestad
Antigua and Barbuda Saint John’S
United Arab Emirates Abu Dhabi
Afghanistan Kabul
Algeria Algiers
Azerbaijan Baku
Albania Tirana
Armenia Yerevan
Andorra Andorra La Vella
Angola Luanda

Using this example as a model, we’ll write a similar query to find non-capital cities with populations of over 10 million.

SELECT
    c.name city,
    f.name country,
    c.population population
FROM facts f
INNER JOIN (
            SELECT * FROM cities
            WHERE capital = 0
            AND population > 10000000
           ) c ON c.facts_id = f.id
ORDER BY 3 DESC;

city country population
New York-Newark United States 20352000
Shanghai China 20208000
Sao Paulo Brazil 19924000
Mumbai India 19744000
Marseille-Aix-en-Provence France 14890100
Kolkata India 14402000
Karachi Pakistan 13876000
Los Angeles-Long Beach-Santa Ana United States 13395000
Osaka-Kobe Japan 11494000
Istanbul Turkey 11253000
Lagos Nigeria 11223000
Guangzhou China 10849000

SQL Challenge: Complex Query with Joins and Subqueries

Let’s take everything we’ve learned before and use it to write a more complex query. It’s not uncommon to find that ‘thinking in SQL’ takes a bit of getting used to, so don’t be discouraged if this query seems difficult to understand at first. It will get easier with practice!

When you’re writing complex queries with joins and subqueries, it helps to follow this process:

  • Think about what data you need in your final output
  • Work out which tables you’ll need to join, and whether you will need to join to a subquery.

    • If you need to join to a subquery, write the subquery first.
  • Then start writing your SELECT clause, followed by the join and any other clauses you will need.
  • Don’t be afraid to write your query in steps, running it as you go— for instance you can run your subquery as a ‘stand alone’ query first to make sure it looks like you want before writing the outer query.

We will be writing a query to find the countries where the urban center (city) population is more than half of the country’s total population. There are multiple ways to write this query but we’ll step through one approach.

We can start by writing a query to sum all the urban populations for cities in each country. We can do this without a join by grouping on the facts_id (we’ll use a limit in our example below to make the output managable):

SELECT
    facts_id,
    SUM(population) urban_pop
FROM cities
GROUP BY 1
LIMIT 5;

facts_id urban_pop
1 3097000
10 172000
100 1127000
101 5000
102 546000

Next, we’ll join the facts table to that subquery, selecting the country name, urban population and total population (again, we’ve used a limit to keep things tidy):

SELECT
    f.name country,
    c.urban_pop,
    f.population total_pop
FROM facts f
INNER JOIN (
            SELECT
                facts_id,
                SUM(population) urban_pop
            FROM cities
            GROUP BY 1
           ) c ON c.facts_id = f.id
LIMIT 5;

country urban_pop total_pop
Afghanistan 3097000 32564342
Austria 172000 8665550
Libya 1127000 6411776
Liechtenstein 5000 37624
Lithuania 546000 2884433

Lastly, we’ll create a new column that divides the urban population by the total population, and use a WHERE and ORDER BY to filter/rank the results:

SELECT
    f.name country,
    c.urban_pop,
    f.population total_pop
FROM facts f
INNER JOIN (
            SELECT
                facts_id,
                SUM(population) urban_pop
            FROM cities
            GROUP BY 1
           ) c ON c.facts_id = f.id
LIMIT 5;

country urban_pop total_pop urban_pct
Uruguay 1672000 3341893 0.500315
Congo, Republic of the 2445000 4755097 0.514185
Brunei 241000 429646 0.560927
New Caledonia 157000 271615 0.578024
Virgin Islands 60000 103574 0.579296
Falkland Islands (Islas Malvinas) 2000 3361 0.595061
Djibouti 496000 828324 0.598800
Australia 13789000 22751014 0.606083
Iceland 206000 331918 0.620635
Israel 5226000 8049314 0.649248
United Arab Emirates 3903000 5779760 0.675288
Puerto Rico 2475000 3598357 0.687814
Bahamas, The 254000 324597 0.782509
Kuwait 2406000 2788534 0.862819
Saint Pierre and Miquelon 5000 5657 0.883861
Guam 169000 161785 1.044596
Northern Mariana Islands 56000 52344 1.069846
American Samoa 64000 54343 1.177705

You can see that while our final query is complex, it’s much easier to understand if you build it step-by-step.

SQL Joins Tutorial: Next Steps

In this sql joins tutorial we learned:

  • The difference between inner and left joins.
  • The role of right and outer joins
  • How to choose which join is appropriate for your task.
  • Using joins with subqueries, aggregate functions, and other SQL techniques.

Other resources that might interest you include our SQL cheat sheet, our article on SQL certification, our rundown of SQL interview questions for job interviews, and of course our interactive SQL courses. Click below to sign up and get started for free!

Learn SQL the right way!

  • Writing real queries
  • In your browser
  • On your schedule

Why passively watch video lectures when  you can learn by doing?

The post SQL Joins Tutorial: Working with Databases appeared first on Dataquest.

]]>
45 Fun (and Unique) Python Project Ideas for Easy Learning https://www.dataquest.io/blog/python-projects-for-beginners/ Wed, 13 Jan 2021 09:01:12 +0000 https://www.dataquest.io/?p=20545 Building projects is an extremely succesful way to learn, but building Python projects for beginners can be difficult. Learn how to build with success!

The post 45 Fun (and Unique) Python Project Ideas for Easy Learning appeared first on Dataquest.

]]>
python tutorials for data science

If I could give my former self one piece of advice when I was struggling to learn Python as a beginner, it would be this: create more Python projects.

Learning Python can be difficult. You can spend time reading a textbook or watching videos, but then struggle to actually put what you've learned into practice. Or you might spend a ton of time learning syntax and get bored or lose motivation. (That happened to me. A lot).

How can you increase your chances of success? By building Python projects. That way you're learning by actually doing what you want to do!

Python Projects: Why Are They So Important?

Building projects helped me bring together everything I was learning. Once I started building projects, I immediately felt like I was making more progress. 

Project-based learning is also the philosophy behind our teaching method at Dataquest, where we teach data science skills using Python. Why? Because time and time again, we’ve seen that it works!

Working on things that you care about helps you stick with your studies, even when the going gets tough.

But it can be difficult to build Python projects for beginners. Where do you start? What makes a good project? What do you do when you get stuck? In this article, we’re going to talk about:

  • What you need to do before you build your first project.
  • What makes a successful project.
  • Strategies to use when you get stuck.
  • Examples of how to select the perfect project.

Why Building Projects Is the Best Way to Learn

First, let's take a look at why a project-based learning approach is so effective.

Motivation: Have the Momentum to Keep Going

First, building Python projects helps you learn more effectively because you can choose a project or topic that interests you. 

This helps you stay motivated, which is important in preventing you from giving up when things get tough.

Efficiency: Only Learn What You Need To

The second reason a project-based approach works is that there's no gap between learning the skill and putting it into practice. You won't waste time learning irrelevant things, because you’ll be actively trying to learn the specific things you need to build your project. 

This also means you will get where you want to go a lot faster. If you’re trying to learn Python for data science by building data science projects, for example, you won’t be wasting time learning Python concepts that might be important for robotics programming but aren’t relevant to your data science goals.

Problem-Solving: Learn the Key Programming Skill

Problem-solving is a key skill when working with Python (or any other programming language). When you're building a project, you're going to have to come up with ways of approaching problems and solving them using code. 

Building projects thus forces you to practice what is perhaps the most important skill in programming. And the more practice you can give your brain in solving problems with code, the faster your skills will develop.

Portfolio: Use Your Projects to Help You Get a Job

The fourth and final reason that building Python projects works for beginners is that you can get a head-start on getting your first job (if that's your goal). 

When employers are looking to hire entry-level candidates, they want to see that you have the key skills they need. A great way of achieving this is having a portfolio of relevant projects that demonstrate your skills. 

If you’re looking for your first job in the field, employers are going to want to see tangible proof of your Python skills. In other words, they’re going to want to see what projects you’ve built. 

If you're interested, you can read more about building a portfolio in our Data Science Career Guide (which while aimed specifically at people looking to get into data has advice that's equally valuable if your goal is another application of Python!).

Before You Build Your First Python Project

If you have some programming experience, you might be able to dive straight into building a project. For most people, however, you'll need to take a little time to learn some of the basics of Python first. The idea here is to spend a small amount of time to learn these basics so you have what you need to dive into projects.

There are a few resources that you can use at this stage:

Learning with Dataquest

  • Codecademy - One of the best-known sites for learning the basics

Once you have learned some of the basics, it's normal to feel a bit overwhelmed. You are learning something totally new, after all. Even though you might not feel ready to start building a project, you probably are.

As a first step, you might like to try building a structured or guided project. Structured projects are important because they allow you to build something without having to start from scratch, which can be difficult if you're a beginner.

At Dataquest, we include guided projects in every course which are designed to help bridge the gap between learning from a course and being able to build a project on your own. An alternative path would be following along with Python tutorial blog posts that you can find on either the Dataquest site or on thousands of other sites online.

What Makes a Great Python Project for Beginners?

Now that it's time to build your Python project, you need to decide what to build! Choosing what to build is extremely important — it will impact whether your project will be successful or not. So what makes for a great Python project for beginners?

Choose a Topic You're Interested In

The first and most important factor is choosing a topic that interests you. If you're interested in what you're building, you'll have more motivation. Motivation is important because it's the momentum that carries you through when you hit roadblocks (more on that later!).

Some people might be motivated by sports, others by a project that relates to social good. Others might be motivated by something to do with finance or the stock market. You might be obsessed with movies or a favorite TV series. Whatever that "thing" is for you, that's what your project should be about.

Think about your goals

The second factor to consider is what your overall goal is in learning Python. If you want to get into web development, then a project that builds a small web app is ideal. If you want to get into data science, then a project that analyzes a dataset is a good choice. By aligning your project with your goals, you'll be taking yourself closer to your eventual goal, rather than going on a "detour".

Start Small

The last factor is not being too ambitious. It's natural to come up with a grand plan, e.g. "I want to build a website that allows people to build custom shot charts of using NBA data." This project idea sounds like it is based around a motivating topic (presuming you like basketball) and intersects with a goal (learning to make websites).

The difficulty with this project choice is that it's too big. In order to execute it, a beginner will need to learn the basics of building an online application, how to store and retrieve a large amount of data, how to create shot-chart visualizations, and how to display them to a user upon request.

It's much better to start with an extremely small and simple version of your project and then add more functionality later. If you don't, it will take a long time before you get any sense of accomplishment from finishing and you might even give up. By starting small and expanding, you're much more likely to have success.

Matryoshka Dolls

Start from a small project and build it up over time

A better version of this project might be to create a simple web app that will show a single NBA statistic for a small selection of players. Once you've built that, you can choose to expand it out by adding more players, more statistics, or any other extra piece of complexity that might appeal to you.

Building Your Python Project: Roadblocks and Difficulties

You've learned the basics of Python, completed a guided project, selected the perfect topic for your first solo project, and you're ready to get started. After about half an hour, you run into a problem: there's something you don't know how to do!

I promise you that this will happen, and it's not a nice feeling. No one likes getting stuck. That said, what you're being presented with is an opportunity. These moments — roadblocks — are where the learning actually happens. The key is knowing how to research to get yourself around the roadblock and keep working.

The good news is that most of the time, someone has been in the same situation — with the same roadblock — as you are in right now. What you need to be able to do is find the resources left behind by those people. Enter: Google (or your favorite alternative search engine).

How to Search for Help

The key to being able to find help is constructing a search for information about a general version of the thing you want to do. 

Say you have a Python dictionary where the keys of the dictionaries are NBA player names and the values are how many games they've played. You're trying to find out which player has the most games.

Searching for “how to find out which NBA player has the most games in Python dictionary” probably isn’t going to be helpful, though. You need to construct a general form for your question, which in this case might be: "Find which key of a Python dictionary has the maximum value."

In fact, that exact Google search seems to bring us to a Stack Overflow question with answers that look helpful!

Finding these general question forms can be tricky at first, but this is an important skill that almost every programmer uses daily, so don't be afraid to dive in there and get some practice. If you still can't find help, you might need to break your problem down into smaller chunks and search for each 'chunk' individually.

You'll find that most of your searches for help will end up on one of three places:

  • An online tutorial that explains the thing you want to do.
  • Stack Overflow (an online programming Q&A site) thread of someone in a similar situation.
  • The documentation for Python or the Python library you're using.

If you still aren't finding the answers, you should post your question on a place like Stack Overflow or the Dataquest community, where others might be able to answer your question. You may be surprised by how quickly other programmers will jump in to help out a beginner!

Python Project Examples

Now let’s look through a few fictional examples of people with interests and goals, and see how they can choose a Python project that suits their needs.

Data-Focused Danielle

Danielle wants to break into the data science space, and she's identified that an entry-level job in data is going to be an analyst type role. 

She loves Star Trek, so she's decided that an ideal project would be to analyze some data related to Star Trek episodes. 

In order to start small and build up, she's going to find a data set and summarize data about episodes (she'll probably use this list of places to find free data sets for projects to get started). 

Once she's done that, she plans to expand her project by creating some visualizations.

Fun Python project ideas for building data skills:

  • Find out How Much Money You've Spent on Amazon — Dig into your own spending habits with this beginner-level tutorial!
  • Analyze Your Own Netflix Data — Another beginner-to-intermediate tutorial that gets you working with your own personal data set.
  • Analyze Your Personal Facebook Posting Habits — Are you spending too much time posting on Facebook? The numbers don't lie, and you can find your numbers in this beginner-to-intermediate Python data project.
  • Analyze Survey Data — This walk-through will show you how to get Python set up and how to filter survey data from any data set you can find (or just use the sample data linked in the article).
  • Dataquest's Guided Projects — These guided projects walk you through building real-world data projects of increasing complexity, with suggestions for how each project can be expanded.
  • Analyze Everything — Grab a free data set that interests you and start poking around! If you get stuck or aren't sure where to start, our Python courses are here to help you, and you can try them for free!

Gamer Greg

Greg wants to learn Python in order to build games for fun and loves puzzles. 

Greg has decided that he's going to learn Python by building games using the Pygame library. He'll start by building a structured project using some Pygame tutorials and then go onto create a simple version of Rock–paper–scissors before gradually increasing the complexity of his projects.

Video Game

Building a video game using Python

Cool Python projects for game devs:

  • Rock, Paper, Scissors — Start your Python learning journey with a simple but fun game that everybody knows.
  • Build a Text Adventure Game — This is a classic Python beginner project (it also pops up in this book) that'll teach you a lot of basic game setup concepts that'll be useful for more advanced games in the future.
  • Guessing Game — This is another beginner-level project that'll help you learn and practice the basics.
  • Mad Libs — Learn how to make interactive Python Mad Libs!
  • Hangman — Another childhood classic that you can make in Python to stretch your skills.
  • Snake — This is a bit more complex, but it's a classic (and surprisingly fun) game to make and play.

Website Wanda

Wanda wants to get a job building websites using Python, and she loves fitness and exercising. She's going to start by following a tutorial for the Python flask web framework, and then try to build a very basic website that she can use to log each time that she exercises. 

Once she's built this simple version, she plans to expand and add new features one by one.

Simple Python projects for beginner web devs:

  • URL shortener — This free video course will show you how to build your own URL shortener like Bit.ly using Python and Django.
  • Build a Simple Web Page with Django — This is a very in-depth, from-scratch tutorial for building a website with Python and Django that even has cartoon illustrations!

App Dev Aaron

Aaron wants to learn Python so that he can build apps for mobile devices and the web. 

Easy Python projects for aspiring developers:

Additional Python Project Ideas

Still haven't found a project idea that appeals to you? Here are a whole bunch more, separated out (roughly) by experience level.

These aren't tutorials, they're ideas that you'll have to dig into and research on your own, but that's part of the fun! And it's part of the natural process of learning to code, and even working as a programmer. The pros Google for answers all the time — so don't be afraid to dive in and get your hands dirty!

Python Project Ideas: Beginner Level

  • Create a "Code" Generator that takes text as input and replaces each letter with another letter, and outputs the "encoded" message.
  • Build a "countdown calculator." Write some code that can take two dates as input, and calculate the amount of time between them. This will be a great way to familiarize yourself with Python's datetime module.
  • Write a Sorting Method. Given a list, can you write some code that sorts it alphabetically, or numerically? Yes, Python has this functionality built-in, but see if you can do it without using sort()!
  • Build an Interactive Quiz. Which Avenger are you? Build a personality or recommendation quiz that can asks users some questions, stores their answers, and then perform some kind of calculation to give the user a personalized end result that's based on their answers
  • Tic-Tac-Toe by Text. Build a Tic-Tac-Toe game that's playable like a text adventure. Can you make it print a text-based representation of the board after each move?
  • Make a Temperature/Measurement Converter. Write a script that can convert Fahrenheit to Celcius and back, or inches to centimeters and back, etc. How far ca you take it?
  • Build a counter app. Take your first steps into the world of UI by building a very simple app that counts up by one each time a user clicks a button.
  • Build a number guessing game. Think of this as a bit like a text adventure, but with numbers. How far can you take it?
  • Build an alarm clock. This is borderline beginner/intermediate, but it's worth trying to build an alarm clock for yourself. Can you create different alarms? A snooze function?

Python Project Ideas: Intermediate Level

  • Build an Upgraded Code Generator. Starting with the project mentioned in the beginner section, see what you can do to make it more sophisticated. Can you make it generate different kinds of codes. Can you create a "decoder" app that reads encoded messages if the user inputs a secret key? Can you create a more sophisticated code that goes beyond simple letter-replacement?
  • Make your Tic-Tac-Toe Game clickable. Building off the beginner project, now make a version of Tic-Tac-Toe that has an actual UI, and that you play by clicking on open squares. Challenge: can you write a simple "AI" opponent for a human player to play against?
  • Scrape some data to analyze. This could really be anything, from any website you like. The web is full of interesting data, and if you learn a little about web-scraping, you can collect some really unique datasets.
  • Build a Clock Website. How close can you get it to real-time? Can you implement different time zone selectors, and add in the "countdown calculator" functionality to calculate lengths of time?
  • Automate some of your job. This will vary, but many jobs have some kind of repetitve process that can be automated!
  • Automate your personal habits. Do you want to remember to stand up once every hour during work? How about writing some code that generates you unique workout plans based on your goals and preferences? There are a variety of simple apps you can build for yourself to automate or enhance different aspects of your life.
  • Create a simple web browser. Build a simple UI that allows for URL enter and that can load webpages. PyWt will be helpful here! Can you add "back" button, bookmarks, and other cool features?
  • Write a notes app. Create an app that helps people write and store notes. Can you think of some interesting and unique features to add?
  • Build a Typing Tester. This should show the user some text, and then challenge them to type it, timing them for the length of time it takes them to finish, and scoring them on their accuracy.
  • Create a "site updated" notification system. Ever get annoyed with having to refresh a website to see if an out-of-stock product has been relisted, or to see if any new news has been posted? Write a Pythons script that automatically checks a given URL for updates and informs you instantly when it identifies one. (Be careful not to overload the servers of whatever site you're checking, though — keep the time interval reasonable between each check).
  • Recreate your favorite board game in Python. There are tons of options here, from something simple like Checkers all the way up to Risk or even more modern and advanced games like Ticket to Ride or Settlers of Catan. How close can you get to the real thing?
  • Build a Wikipedia Explorer. Build an app that displays a random Wikipedia page. The challenge here is in the details: can you add user-selected categories? Can you try a different "rabbit hole" version of the app where each article is randomly selected from the articles linked in the previous article? This might seem simple, but it can actually take some real web-scraping chops.

Python Project Ideas: Advanced Level

  • Build a Stock Market Prediction App. For this one, you'll need a source of stock market data and some machine learning chops, but tons of people have tried this, so there's a lot of source code out there to work from. 
  • Build a Chatbot. The challenge here isn't so much making the chatbot as making it good. Can you, for example, implement some Natural Language Processing techniques to make it sound more natural and spontaneous?
  • Program a robot. This requires some hardware (which isn't usually free) but there are lots of affordable options out there, and tons of learning resources too. Definitely look into Raspberry Pi if you're not already thinking along those lines.
  • Build an Image Recognition App. Starting with handwriting recognition is a good idea — Dataquest even has a guided project to help with that! — but once you've got that down, you can take it much bigger.
  • Make a Price Prediction Model. Pick an industry or product you're interested in, and build a machine learning model that predicts price changes.
  • Create your own Sentiment Analysis Model. Sure, there are plenty of pre-built ones out there, but can you collect a large corpus of text data and build one of your own? (Or, less challenging: optimize an existing sentiment analysis model for the particular text you're analyzing.)
  • Create an interactive map. This will require a mix of data skills and UI creation skills. Your map can display whatever you'd like — bird migrations, traffic data, crime reports — but it should be interactive in some way. How far can you take it?

 

Next Steps

Each of the examples in the previous section followed the advice on choosing a great Python project for beginners:

  • Think about what you're interested in and choose a project that overlaps with your interests to help with motivation.
  • Think about your goals in learning Python, and make sure your project moves you toward those goals.
  • Start small. Once you've built a small project you can either expand it or build another one.

Now you're ready to get started. If you haven't learned the basics of Python yet, I recommend diving in with Dataquest's Python Fundamentals course

If you already know the basics, there’s no reason to hesitate! Now is the time to dive in and find your perfect Python project.

(If you're stuck for ideas, this article contains lots of ideas as well as some resources for structured projects.)

The post 45 Fun (and Unique) Python Project Ideas for Easy Learning appeared first on Dataquest.

]]>
SQL Tutorial: Selecting Ungrouped Columns Without Aggregate Functions https://www.dataquest.io/blog/sql-tutorial-selecting-ungrouped-columns-without-aggregate-functions/ Tue, 12 Jan 2021 18:12:51 +0000 https://www.dataquest.io/?p=25115 When is a SQL query that returns the correct answer actually wrong? In this tutorial, we're going to take a close look at a very common mistake. It's one that will actually return the right answer, but it's still a mistake that's important to avoid.That probably sounds rather mysterious, so let's dive right in. We'll […]

The post SQL Tutorial: Selecting Ungrouped Columns Without Aggregate Functions appeared first on Dataquest.

]]>
sql-columns

When is a SQL query that returns the correct answer actually wrong? In this tutorial, we're going to take a close look at a very common mistake. It's one that will actually return the right answer, but it's still a mistake that's important to avoid.

That probably sounds rather mysterious, so let's dive right in. We'll illustrate the SQL mistake you might not even know you're making, and highlight how to approach the problem correctly.

The Problem: Right Answer, But Wrong SQL Query

At Dataquest, one of our favorite databases to teach SQL with is Chinook — a database of the records of a fictitious online music store. In one of the courses that we use it, learners are challenged to find the customer from each country who has spent the most money.

They often end up creating the following CTE. It contains a row per customer with their name, country, and total amount spent:

country customer_name total_purchased
Argentina Diego Gutiérrez 39.6
Australia Mark Taylor 81.18
Austria Astrid Gruber 69.3
Belgium Daan Peeters 60.39
Brazil Luís Gonçalves 108.9
Canada François Tremblay 99.99
Chile Luis Rojas 97.02
Czech Republic František Wichterlová 144.54
Denmark Kara Nielsen 37.62
Finland Terhi Hämäläinen 79.2
France Wyatt Girard 99.99
Germany Fynn Zimmermann 94.05
Hungary Ladislav Kovács 78.21
India Manoj Pareek 111.87
Ireland Hugh O’Reilly 114.84
Italy Lucas Mancini 50.49
Netherlands Johannes Van der Berg 65.34
Norway Bjørn Hansen 72.27
Poland Stanisław Wójcik 76.23
Portugal João Fernandes 102.96
Spain Enrique Muñoz 98.01
Sweden Joakim Johansson 75.24
USA Jack Smith 98.01
United Kingdom Phil Hughes 98.01

We’ll call this CTE customer_country_purchases.

Generally, they're getting to that output — which is correct — using this query:

SELECT country, customer_name,
       MAX(total_purchased)
  FROM customer_country_purchases
 GROUP BY country;

In English: select the country, the maximum amount spent for that country, and include the customer’s name.

This is a very natural try, and it yields correct output! However, as you may have expected from my wording, there’s more than meets the eye in this solution.

What's Wrong With That?

The goal of this post is to clarify what is objectionable about the approach above. To make it easier to visualize what’s going on, we’ll drop Chinook, and work with a smaller table. We’ll be using the elite_agent table.

(This is also a fictional database; think of it as a table of secret agents by city, gender, and age).

id city gender age
1 Lisbon M 21
2 Chicago F 20
3 New York F 20
4 Chicago M 27
5 Lisbon F 27
6 Lisbon M 19
7 Lisbon F 23
8 Chicago F 24
9 Chicago M 21

If you wish to experiment with it, here's a SQLite database with this table.

If we compare this table to the Chinook table we were using, we can see that they're very similar in terms of how we'll handle the data in each column:

  • country in the Chinook database is similar to city
  • name is similar to gender
  • total_purchased is similar to age

Given that, we can structure a query for this new table that's essentially the same as the problematic query we were looking at with Chinook:

SELECT city, gender,
       MAX(age) AS max_age
  FROM elite_agent
 GROUP BY city;

Code-wise, the queries are equivalent.

So what’s so wrong with them? Let’s start answering this question.

Presumably, this query’s goal is to determine the ages of the oldest agents. If we didn’t want the names, we would run the query below.

SELECT city,
       MAX(age) AS max_age
  FROM elite_agent
 GROUP BY city;

Here is the output using the SQLite engine:

city max_age
Chicago 27
Lisbon 27
New York 20

Because we grouped by city, each row represents a city. We also included the maximum age for each group.

If we include the gender, we'll reproduce the first query we saw for this table — the one that is incorrect.

Why