SQL Fundamentals

The pandas workflow is a common favorite among data analysts and data scientists. The workflow looks something like this: The pandas workflow works well when: the data fits in memory (a few gigabytes but not terabytes) the data is relatively static (doesn’t need to be loaded into memory every minute because the data has changed) only a single person is accessing the data (shared access to memory is difficult) security isn’t important (security is critical... »
Author's profile picture Srini Kadamati on tutorials

Loading Data into Postgres using Python and CSVs

An introduction to Postgres with Python Data storage is one of (if not) the most integral parts of a data system. You will find hundreds of articles online detailing how to write insane SQL analysis queries, how to run complex machine learning algorithms on petabytes of training data, and how to build statistical models on thousands of rows in a database. The only problem is: no one mentions how you get the data stored in... »
Author's profile picture Spiro Sideris on tutorials

Explore Happiness Data Using Python Pivot Tables

One of the biggest challenges when facing a new data set is knowing where to start and what to focus on. Being able to quickly summarize hundreds of rows and columns can save you a lot of time and frustration. A simple tool you can use to achieve this is a pivot table, which helps you slice, filter, and group data at the speed of inquiry and represent the information in a visually appealing way.... »
Author's profile picture Michal Weizman on python

How to Generate FiveThirtyEight Graphs in Python

If you read data science articles, you may have already stumbled upon FiveThirtyEight’s content. Naturally, you were impressed by their awesome visualizations. You wanted to make your own awesome visualizations and so asked Quora and Reddit how to do it. You received some answers, but they were rather vague. You still can’t get the graphs done yourself. In this post, we’ll help you. Using Python’s matplotlib and pandas, we’ll see that it’s rather easy to... »
Author's profile picture Alexandru Olteanu on tutorials

Machine Learning Fundamentals: Predicting Airbnb Prices

Machine learning is easily one of the biggest buzzwords in tech right now. Over the past three years Google searches for “machine learning” have increased by over 350%. But understanding machine learning can be difficult — you either use pre-built packages that act like ‘black boxes’ where you pass in data and magic comes out the other end, or you have to deal with high level maths and linear algebra. This tutorial is designed to... »
Author's profile picture Josh Devlin on tutorials

Python Cheat Sheet for Data Science: Intermediate

The printable version of this cheat sheet The tough thing about learning data is remembering all the syntax. While at Dataquest we advocate getting used to consulting the Python documentation, sometimes it’s nice to have a handy reference, so we’ve put together this cheat sheet to help you out! This cheat sheet is the companion to our Python Basics Data Science Cheat Sheet If you’re interested in learning Python, we have a free Python Programming:... »
Author's profile picture Josh Devlin on resources and guides

Using pandas with large data

Tips for reducing memory usage by up to 90% When working using pandas with small data (under 100 megabytes), performance is rarely a problem. When we move to larger data (100 megabytes to multiple gigabytes), performance issues can make run times much longer, and cause code to fail entirely due to insufficient memory. While tools like Spark can handle large data sets (100 gigabytes to multiple terabytes), taking full advantage of their capabilities usually requires... »
Author's profile picture Josh Devlin on tutorial

Python Cheat Sheet for Data Science: Basics

The printable version of this cheat sheet It’s common when first learning Python for Data Science to have trouble remembering all the syntax that you need. While at Dataquest we advocate getting used to consulting the Python documentation, sometimes it’s nice to have a handy reference, so we’ve put together this cheat sheet to help you out! This cheat sheet is the companion to our Python Intermediate Data Science Cheat Sheet If you’re interested in... »
Author's profile picture Josh Devlin on resources and guides

Should I learn Python 2 or 3?

Image Credit: DigitalOcean One of the biggest sources of confusion and misinformation for people wanting to learn Python is which version they should learn. Should I learn Python 2.x or Python 3.x? Indeed, this is one of the questions we are asked most often at Dataquest, where we teach Python as part of our Data Science curriculum. This post gives some context behind the question, explains the pespective, and tells you which version you should... »
Author's profile picture Josh Devlin on python, data, and science

Understanding SettingwithCopyWarning in pandas

SettingWithCopyWarning is one of the most common hurdles people run into when learning pandas. A quick web search will reveal scores of Stack Overflow questions, GitHub issues and forum posts from programmers trying to wrap their heads around what this warning means in their particular situation. It’s no surprise that many struggle with this; there are so many ways to index pandas data structures, each with its own particular nuance, and even pandas itself does... »
Author's profile picture Benjamin Pryke on python

Web Scraping with Python and BeautifulSoup

To source data for data science projects, you’ll often rely on SQL and NoSQL databases, APIs, or ready-made CSV data sets. The problem is that you can’t always find a data set on your topic, databases are not kept current and APIs are either expensive or have usage limits. If the data you’re looking for is on an web page, however, then the solution to all these problems is web scraping. In this tutorial we’ll... »
Author's profile picture Alexandru Olteanu on tutorials and python

Getting Started with Kaggle: House Prices Competition

Founded in 2010, Kaggle is a Data Science platform where users can share, collaborate, and compete. One key feature of Kaggle is “Competitions”, which offers users the ability to practice on real world data and to test their skills with, and against, an international community. This guide will teach you how to approach and enter a Kaggle competition, including exploring the data, creating and engineering features, building models, and submitting predictions. We’ll use Python 3... »
Author's profile picture Adam Massachi on tutorials, python, and kaggle

How to become a data scientist

Data science is one of the most buzzed about fields right now, and data scientists are in extreme demand. And with good reason – data scientists are doing everything from creating self-driving cars to automatically captioning images. Given all the interesting applications, it makes sense that data science is a very sought-after career. Data science is applied in many field, including in developing self-driving cars. If you’re reading this post, I’m assuming that you’d like... »
Author's profile picture Vik Paruchuri on resources and guides

NumPy Cheat Sheet - Python for Data Science

NumPy is the library that gives Python its ability to work with data at speed. Originally, launched in 1995 as ‘Numeric,’ NumPy is the foundation on which many important Python data science libraries are built, including Pandas, SciPy and scikit-learn. The printable version of this cheat sheet It’s common when first learning NumPy to have trouble remembering all the functions and methods that you need, and while at Dataquest we advocate getting used to consulting... »
Author's profile picture Josh Devlin on resources and guides

Turbocharge Your Data Acquisition using the data.world Python Library

When working with data, a key part of your workflow is finding and importing data sets. Being able to quickly locate data, understand it and combine it with other sources can be difficult. One tool to help with this is data.world, where you can search for, copy, analyze, and download data sets. In addition, you can upload your data to data.world and use it to collaborate with others. In this tutorial, we’re going to show... »
Author's profile picture Josh Devlin on python, tutorials, and project

Building An Analytics Data Pipeline In Python

If you’ve ever wanted to work with streaming data, or data that changes quickly, you may be familiar with the concept of a data pipeline. Data pipelines allow you transform data from one representation to another through a series of steps. Data pipelines are a key part of data engineering, which we teach in our new Data Engineer Path. A common use case for a data pipeline is figuring out information about the visitors to... »
Author's profile picture Vik Paruchuri on python and tutorials

Pandas Cheat Sheet - Python for Data Science

Pandas is arguably the most important Python package for data science. Not only does it give you lots of methods and functions that make working with data easier, but it has been optimized for speed which gives you a significant advantage compared with working with numeric data using Python’s built-in functions. The printable version of this cheat sheet It’s common when first learning pandas to have trouble remembering all the functions and methods that you... »
Author's profile picture Josh Devlin on resources and guides

1 tip for effective data visualization in Python

Yes, you read correctly – this post will only give you 1 tip. I know most posts like this have 5 or more tips. I once saw a post with 15 tips, but I may have been daydreaming at the time. You’re probably wondering what makes this 1 tip so special. “Vik”, you may ask, “I’ve been reading posts that have 7 tips all day. Why should I spend the time and effort to read... »
Author's profile picture Vik Paruchuri on tutorials

What is Data Engineering?

This is the first in a series of posts on Data Engineering. If you like this and want to know when the next post in the series is released, you can subscribe at the bottom of the page. From helping cars drive themselves to helping Facebook tag you in photos, data science has attracted a lot of buzz recently. Data scientists have become extremely sought after, and for good reason – a skilled data scientist... »
Author's profile picture Vik Paruchuri on careers

How to present your data science portfolio on Github

This is the fifth and final post in a series of posts on how to build a Data Science Portfolio. In the previous posts in our portfolio series, we talked about how to build a storytelling project, how to create a data science blog, how to create a machine learning project, and how to construct a portfolio. In this post, we’ll discuss how to present and share your portfolio. You’ll learn how to showcase your... »
Author's profile picture Vik Paruchuri on tutorials, python, portfolio, and project

The Six Elements of the Perfect Data Science Learning Tool

When I launched Dataquest a little under two years ago, one of the first things I did was write a blog post about why. At the time, if you wanted to become a data scientist, you were confronted with dozens of courses on sites like edX or Coursera with no easy path to getting a job. I saw many promising students give up on learning data science because they got stuck in a loop of... »
Author's profile picture Vik Paruchuri on motivation and making-of

Preparing and Cleaning Data for Machine Learning

Cleaning and preparing data is a critical first step in any machine learning project. In this blog post, Dataquest student Daniel Osei takes us through examining a dataset, selecting columns for features, exploring the data visually and then encoding the features for machine learning. After first reading about Machine Learning on Quora in 2015, Daniel became excited at the prospect of an area that could combine his love of Mathematics and Programming. After reading this... »
Author's profile picture Josh Devlin on tutorials

How to get a data science job

You’ve done it. You just spent months learning how to analyze data and make predictions. You’re now able to go from raw data to well structured insights in a matter of hours. After all that effort, you feel like it’s time to take the next step, and get your first data science job. Unfortunately for you, this is where the process starts to get much harder. There’s no clear path to go from having data... »
Author's profile picture Vik Paruchuri on careers

Pandas Tutorial: Data analysis with Python: Part 2

We covered a lot of ground in Part 1 of our pandas tutorial. We went from the basics of pandas DataFrames to indexing and computations. If you’re still not confident with Pandas, you might want to check out the Dataquest pandas Course. In this tutorial, we’ll dive into one of the most powerful aspects of pandas – its grouping and aggregation functionality. With this functionality, it’s dead simple to compute group summary statistics, discover patterns,... »
Author's profile picture Vik Paruchuri on tutorials and python

Python Web Scraping Tutorial using BeautifulSoup

When performing data science tasks, it’s common to want to use data found on the internet. You’ll usually be able to access this data in csv format, or via an Application Programming Interface(API). However, there are times when the data you want can only be accessed as part of a web page. In cases like this, you’ll want to use a technique called web scraping to get the data from the web page into a... »
Author's profile picture Vik Paruchuri on tutorials and python

Pandas Tutorial: Data analysis with Python: Part 1

Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages, and makes importing and analyzing data much easier. Pandas builds on packages like NumPy and matplotlib to give you a single, convenient, place to do most of your data analysis and visualization work. In this introduction, we’ll use Pandas to analyze data on video game reviews from IGN, a popular... »
Author's profile picture Vik Paruchuri on tutorials and python

NumPy Tutorial: Data analysis with Python

Don’t miss our FREE NumPy cheat sheet at the bottom of this post NumPy is a commonly used Python data analysis package. By using NumPy, you can speed up your workflow, and interface with other packages in the Python ecosystem, like scikit-learn, that use NumPy under the hood. NumPy was originally developed in the mid 2000s, and arose from an even older package called Numeric. This longevity means that almost every data analysis or machine... »
Author's profile picture Vik Paruchuri on tutorials, python, and numpy

28 Jupyter Notebook tips, tricks and shortcuts

This post is based on a post that originally appeared on Alex Rogozhnikov’s blog, ‘Brilliantly Wrong’. We have expanded the post and will continue to do so over time - if you have a suggestion please let us know in the comments. Thanks to Alex for graciously letting us republish his work here. Jupyter Notebook Jupyter notebook, formerly known as the IPython notebook, is a flexible tool that helps you create readable analyses, as you... »
Author's profile picture Josh Devlin on resources and guides

Working with SQLite Databases using Python and Pandas

SQLite is a database engine that makes it simple to store and work with relational data. Much like the csv format, SQLite stores data in a single file that can be easily shared with others. Most programming languages and environments have good support for working with SQLite databases. Python is no exception, and a library to access SQLite databases, called sqlite3, has been included with Python since version 2.5. In this post, we’ll walk through... »
Author's profile picture Vik Paruchuri on tutorials, python, sqlite, and sql

Learn Python the right way in 5 steps

Python is an amazingly versatile programming language. You can use it to build websites, machine learning algorithms, and even autonomous drones. A huge percentage of programmers in the world use Python, and for good reason. It gives you the power to create almost anything. But – and this is a big but – you have to learn it first. Learning any programming language can be intimidating. I personally think that Python is better to learn... »
Author's profile picture Vik Paruchuri on tutorials and python

18 places to find data sets for data science projects

This is the fifth post in a series of posts on how to build a Data Science Portfolio. You can find links to the others in this series at the bottom of the post. If you’ve ever worked on a personal data science project, you’ve probably spent a lot of time browsing the internet looking for interesting data sets to analyze. It can be fun to sift through dozens of data sets to find the... »
Author's profile picture Vik Paruchuri on tutorials, python, portfolio, and project

Working with streaming data: Using the Twitter API to capture tweets

If you’ve done any data science or data analysis work, you’ve probably read in a csv file or connected to a database and queried rows. A typical data analysis workflow involves retrieving stored data, loading it into an analysis tool, and then exploring it. This works well when you’re dealing with historical data such as analyzing what products a customer at your online store is most likely to purchase, or whether people’s diets changed in... »
Author's profile picture Vik Paruchuri on tutorials, python, and data

The key to building a data science portfolio that will get you a job

This is the fourth post in a series of posts on how to build a Data Science Portfolio. You can find links to the other posts in this series at the bottom of the post. In the past few posts in this series, we’ve talked about how to build a data science project that tells a story, how to build an end to end machine learning project, and how to setup a data science blog.... »
Author's profile picture Vik Paruchuri on tutorials, python, portfolio, and project

How I built a Slack bot to help me find an apartment in San Francisco

I moved from Boston to the Bay Area a few months ago. Priya (my girlfriend) and I heard all sorts of horror stories about the rental market. The fact that searching for “How to find an apartment in San Francisco” on Google yields dozens of pages of advice is a good indicator that apartment hunting is a painful process. Boston is cold, but finding an apartment in SF is scary We read that landlords hold... »
Author's profile picture Vik Paruchuri on tutorials, python, portfolio, and project

Building a data science portfolio: Machine learning project

This is the third in a series of posts on how to build a Data Science Portfolio. You can find links to the other posts in this series at the bottom of the post. Data science companies are increasingly looking at portfolios when making hiring decisions. One of the reasons for this is that a portfolio is the best way to judge someone’s real-world skills. The good news for you is that a portfolio is... »
Author's profile picture Vik Paruchuri on tutorials, python, data, pandas, portfolio, and scikit

Building a data science portfolio: Making a data science blog

This is the second in a series of posts on how to build a Data Science Portfolio. You can find links to the other posts in this series at the bottom of the post. Blogging can be a fantastic way to demonstrate your skills, learn topics in more depth, and build an audience. There are quite a few examples of data science and programming blogs that have helped their authors land jobs or make important... »
Author's profile picture Vik Paruchuri on tutorials, python, matplotlib, blog, data, pandas, and portfolio

Building a data science portfolio: Storytelling with data

This is the first in a series of posts on how to build a Data Science Portfolio. You can find links to the other posts in this series at the bottom of the post. Data science companies are increasingly looking at portfolios when making hiring decisions. One of the reasons for this is that a portfolio is the best way to judge someone’s real-world skills. The good news for you is that a portfolio is... »
Author's profile picture Vik Paruchuri on tutorials, python, matplotlib, folium, data, pandas, and portfolio

Matplotlib tutorial: Plotting tweets mentioning Trump, Clinton & Sanders

Analyzing Tweets with Pandas and Matplotlib Python has a variety of visualization libraries, including seaborn, networkx, and vispy. Most Python visualization libraries are based wholly or partially on matplotlib, which often makes it the first resort for making simple plots, and the last resort for making plots too complex to create in other libraries. In this matplotlib tutorial, we’ll cover the basics of the library, and walk through making some intermediate visualizations. We’ll be working... »
Author's profile picture Vik Paruchuri on tutorials, python, matplotlib, and data

How to get into the top 15 of a Kaggle competition using Python

Kaggle competitions are a fantastic way to learn data science and build your portfolio. I personally used Kaggle to learn many data science concepts. I started out with Kaggle a few months after learning programming, and later won several competitions. Doing well in a Kaggle competition requires more than just knowing machine learning algorithms. It requires the right mindset, the willingness to learn, and a lot of data exploration. Many of these aspects aren’t typically... »
Author's profile picture Vik Paruchuri on tutorials, python, data, science, kaggle, and expedia

Python & JSON: Working with large datasets using Pandas

Working with large JSON datasets can be a pain, particularly when they are too large to fit into memory. In cases like this, a combination of command line tools and Python can make for an efficient way to explore and analyze the data. In this post, we’ll look at how to leverage tools like Pandas to explore and map out police activity in Montgomery County, Maryland. We’ll start with a look at the JSON data,... »
Author's profile picture Vik Paruchuri on tutorials, python, data, science, and pandas

Python for data science: Getting started

Python is becoming an increasingly popular language for data science, and with good reason. It’s easy to learn, has powerful data science libraries, and integrates well with databases and tools like Hadoop and Spark. With Python, we can perform the full lifecycle of data science projects, including reading data in, analyzing data, visualizing data, and making predictions with machine learning. In this post, we’ll walk through getting started with Python for data science. If you... »
Author's profile picture Vik Paruchuri on tutorials, python, data, science, and pandas

DigitalOcean & Docker for Data Science

Creating a cloud-based data science environment for faster analysis There are times when working on data science problems with your local machine just doesn’t cut it anymore. Maybe your computer is old, and can’t work with larger datasets. Or maybe you want to be able to access your work from anywhere, and collaborate with others. Or maybe you have an analysis that will take a long time to run, and you don’t want to tie... »
Author's profile picture Vik Paruchuri on docker, data, python, introduction, and intro

Docker: Data Science Environment with Jupyter

Configuring a data science environment can be a pain. Dealing with inconsistent package versions, having to dive through obscure error messages, and having to wait hours for packages to compile can be frustrating. This makes it hard to get started with data science in the first place, and is a completely arbitrary barrier to entry. The past few years have seen the rise of technologies that help with this by creating isolated environments. We’ll be... »
Author's profile picture Vik Paruchuri on docker, data, python, introduction, and intro

Python data visualization: Comparing 7 tools

The Python scientific stack is fairly mature, and there are libraries for a variety of use cases, including machine learning, and data analysis. Data visualization is an important part of being able to explore data and communicate results, but has lagged a bit behind other tools such as R in the past. Luckily, many new Python data visualization libraries have been created in the past few years to close the gap. matplotlib has emerged as... »
Author's profile picture Vik Paruchuri on data-visualization, data, python, introduction, and intro

Data Scientist Interview: Benjamin Root

Overview At Dataquest, we strive to help our users get a better sense of how data science works in industry as part of the data science educational process. We’ve started a series where we interview experienced data scientists. We highlight their stories, advice they have for budding data scientists, and the kinds of problems they’ve worked on. This is our second post in this series and is an interview with data scientist and engineer Benjamin... »
Author's profile picture Srini Kadamati on interview, professional, matplotlib, python, and dataviz

PySpark: How to install and Integrate with the Jupyter Notebook

At Dataquest, we’ve released an interactive course on Spark, with a focus on PySpark. We explore the fundamentals of Map-Reduce and how to utilize PySpark to clean, transform, and munge data. In this post, we’ll dive into how to install PySpark locally on your own computer and how to integrate it into the Jupyter Notebbok workflow. Some familarity with the command line will be necessary to complete the installation. Overview At a high level, these... »
Author's profile picture Srini Kadamati on tutorials

Machine learning with Python: A Tutorial

Machine learning is a field that uses algorithms to learn from data and make predictions. Practically, this means that we can feed data into an algorithm, and use it to make predictions about what might happen in the future. This has a vast range of applications, from self-driving cars to stock price prediction. Not only is machine learning interesting, it’s also starting to be widely used, making it an extremely practical skill to learn. In... »
Author's profile picture Vik Paruchuri on tutorials

Python vs R: head to head data analysis

Which is better for data analysis? There have been dozens of articles written comparing Python and R from a subjective standpoint. We’ll add our own views at some point, but this article aims to look at the languages more objectively. We’ll analyze a dataset side by side in Python and R, and show what code is needed in both languages to achieve the same result. This will let us understand the strengths and weaknesses of... »
Author's profile picture Vik Paruchuri on comparison

Python API tutorial - An Introduction to using APIs

Application Program Interfaces, or APIs, are commonly used to retrieve data from remote websites. Sites like Reddit, Twitter, and Facebook all offer certain data through their APIs. To use an API, you make a request to a remote web server, and retrieve the data you need. But why use an API instead of a static dataset you can download? APIs are useful in the following cases: The data is changing quickly. An example of this... »
Author's profile picture Vik Paruchuri on tutorials

Data Cleaning with Python - MoMA's Artwork Collection

Art is a messy business. Over centuries, artists have created everything from simple paintings to complex sculptures, and art historians have been cataloging everything they can along the way. The Museum of Modern Art, or MoMA for short, is considered one of the most influential museums in the world and recently released a dataset of all the artworks they’ve cataloged in their collection. This dataset contains basic information on metadata for each artwork and is... »
Author's profile picture Srini Kadamati on tutorials

K nearest neighbors in python: A tutorial

In this post, we’ll be using the K-nearest neighbors algorithm to predict how many points NBA players scored in the 2013-2014 season. Along the way, we’ll learn about euclidean distance and figure out which NBA players are the most similar to Lebron James. If you want to follow along, you can grab the dataset in csv format here. A look at the data Before we dive into the algorithm, let’s take a look at our... »
Author's profile picture Vik Paruchuri on tutorials

How to actually learn data science

It’s an exciting time for data science. The field is new, but growing quickly. There’s huge demand for data scientists – average compensation in SF is well north of 100 thousand dollars a year. Where there’s money, there are also people trying to earn it. The data science skills gap means that many people are learning or trying to learn data science. The first step to learning data science is usually asking “how do I... »
Author's profile picture Vik Paruchuri on guides

Natural Language Processing with Python Tutorial

Predicting Hacker News upvotes using headlines Python has some powerful tools that enable you to do natural language processing (NLP). In this tutorial, we’ll learn about how to do some basic NLP in python. Looking at the data We’ll be looking at a dataset consisting of submissions to Hacker News from 2006 to 2015. The data was taken from here. Arnaud Drizard used the Hacker News API to scrape it. We’ve sampled 10000 rows from... »
Author's profile picture Vik Paruchuri on tutorials

Python Counter Class and Probability Mass Functions

The Python Counter Class The Counter class in python is part of the collections module. Counter provides a fast way to count up the number of unique items that exist in a list. The Counter class can also be extended to represent probability mass functions and suites of bayesian hypotheses. A counter is a map from values to their frequencies. If you initialize a counter with a string, you get a map from each letter... »
Author's profile picture Vik Paruchuri on tutorials

Naive bayes: Predicting movie review sentiment

Sentiment analysis is a field dedicated to extracting subjective emotions and feelings from text. One common use of sentiment analysis is to figure out if a text expresses negative or positive feelings. Written reviews are great datasets for doing sentiment analysis, because they often come with a score that can be used to train an algorithm. Naive bayes is a popular algorithm for classifying text. Although it is fairly simple, it often performs as well... »
Author's profile picture Vik Paruchuri on tutorials

k-means clustering US Senators

Clustering is a powerful way to split up datasets into groups based on similarity. A very popular clustering algorithm is k-means clustering. In k-means clustering, we divide data up into a fixed number of clusters while trying to ensure that the items in each cluster are as similar as possible. In this post, we’ll explore cluster US Senators using an interactive python environment. We’ll use the voting history from the 114th Congress to split Senators... »