Category Archives for "Data Science Tutorials"

Using Linear Regression for Predictive Modeling in R

Predictive models are extremely useful, when learning r language, for forecasting future outcomes and estimating metrics that are impractical to measure. For example, data scientists could use predictive models to forecast crop yields based on rainfall and temperature, or to determine whether patients with certain traits are more likely to react badly to a new […]

Using Box Plots to Explore Women’s Height Data

I’ve recently been working on the Digital Panopticon, a digital history project that has brought together (and created) massive amounts of data about British prisoners and convicts in the long 19th century, including several datasets which include heights for women. Adult height is strongly influenced by environmental factors in childhood, one of the most important […]

Visualizing Women’s Marches: Part 2

This post is the second in a series on visualizing the Women’s Marches from January 2017. In the first post, we explored the intensive data collection and data cleaning process necessary to produce clean pandas dataframes. Data Enrichment Because we eventually want to be able to build maps visualizing the marches, we need latitude and […]

Exploring Women’s Army Auxiliary Corps Data

Today I want to go on an excursion in “catalogues as data“. The UK National Archives’ Discovery catalogue is an excellent resource for this activity, because a) it has a lot of records that have document descriptions at ‘item’ or ‘piece’ level in the catalogue, containing quite structured information (like dates, places, occupations) that can […]

Visualizing Women’s Marches: Part 1

In celebration of Women’s History Month, I wanted to better understand the scale of the Women’s Marches that occurred in January 2017. Shortly after the marches, Vox published a map visualizing the estimated turnout across the entire country. This map is excellent at displaying: locations with the highest relative turnouts hubs and clusters of where […]

Data Science Terms and Jargon: A Glossary

Getting started in data science can be overwhelming, especially when you consider the variety of concepts and techniques a data scienctist needs to master in order to do her job effectively. Even the term “data science” can be somewhat nebulous, and as the field gains popularity it seems to lose definition. To help those new […]

Introduction to AWS for Data Scientists

These days, many businesses use cloud based services; as a result various companies have started building and providing such services. Amazon began the trend, with Amazon Web Services (AWS). While AWS began in 2006 as a side business, it now makes $14.5 billion in revenue each year. Other leaders in this area include: Google—Google Cloud […]

Introduction to Python Ensembles

Stacking models in Python efficiently Ensembles have rapidly become one of the hottest and most popular methods in applied machine learning. Virtually every winning Kaggle solution features them, and many data science pipelines have ensembles in them. Put simply, ensembles combine predictions from different models to generate a final prediction, and the more models we […]

Adding Axis Labels to Plots With pandas

Pandas plotting methods provide an easy way to plot pandas objects. Often though, you’d like to add axis labels, which involves understanding the intricacies of Matplotlib syntax. Thankfully, there’s a way to do this entirely using pandas. Let’s start by importing the required libraries: import pandas as pd import numpy as np import matplotlib.pyplot as […]

Setting Up the PyData Stack on Windows

The speed of modern electronic devices allows us to crunch large amounts of data at home. However, these devices require the right software in order to reach peak performance. Luckily, it’s now easier than ever to set up your own data science environment. One of the most popular stacks for data science is PyData, a […]

Kaggle Fundamentals: The Titanic Competition

Kaggle is a site where people create algorithms and compete against machine learning practitioners around the world. Your algorithm wins the competition if it’s the most accurate on a particular data set. Kaggle is a fun way to practice your machine learning skills. This tutorial is based on part of our free, four-part course: Kaggle […]

SQL Fundamentals

The pandas workflow is a common favorite among data analysts and data scientists. The workflow looks something like this: The pandas workflow works well when: the data fits in memory (a few gigabytes but not terabytes) the data is relatively static (doesn’t need to be loaded into memory every minute because the data has changed) […]