# Getting Started with Kaggle: House Prices Competition

Founded in 2010, Kaggle is a Data Science platform where users can share, collaborate, and compete. One key feature of Kaggle is “Competitions”, which offers users the ability to practice on real-world data and to test their skills with, and against, an international community.

This guide will teach you how to approach and enter a Kaggle competition, including exploring the data, creating and engineering features, building models, and submitting predictions. We’ll use Python 3 and Jupyter Notebook.

## The Competition

We’ll work through the House Prices: Advanced Regression Techniques competition.

We’ll follow these steps to a successful Kaggle Competition submission:

• Acquire the data
• Explore the data
• Engineer and transform the features and the target variable
• Build a model
• Make and submit predictions

## Step 1: Acquire the data and create our environment

We need to acquire the data for the competition. The descriptions of the features and some other helpful information are contained in a file with an obvious name, data_description.txt.

Download the data and save it into a folder where you’ll keep everything you need for the competition.

We will first look at the train.csv data. After we’ve trained a model, we’ll make predictions using the test.csv data.

First, import Pandas, a fantastic library for working with data in Python. Next we’ll import Numpy.


import pandas as pd
import numpy as np

We can use Pandas to read in csv files. The pd.read_csv() method creates a DataFrame from a csv file.


test = pd.read_csv('test.csv')

Let’s check out the size of the data.


print ("Train data shape:", train.shape)
print ("Test data shape:", test.shape)
Train data shape: (1460, 81)
Test data shape: (1459, 80)

We see that test has only 80 columns, while train has 81. This is due to, of course, the fact that the test data do not include the final sale price information!

Next, we’ll look at a few rows using the DataFrame.head() method.

train.head()
0160RL65.08450PaveNaNRegLvlAllPubInsideGtlCollgCrNormNorm1Fam2Story7520032003GableCompShgVinylSdVinylSdBrkFace196.0GdTAPConcGdTANoGLQ706Unf0150856GasAExYSBrkr85685401710102131Gd8Typ0NaNAttchd2003.0RFn2548TATAY0610000NaNNaNNaN022008WDNormal208500
1220RL80.09600PaveNaNRegLvlAllPubFR2GtlVeenkerFeedrNorm1Fam1Story6819761976GableCompShgMetalSdMetalSdNone0.0TATACBlockGdTAGdALQ978Unf02841262GasAExYSBrkr1262001262012031TA6Typ1TAAttchd1976.0RFn2460TATAY29800000NaNNaNNaN052007WDNormal181500
2360RL68.011250PaveNaNIR1LvlAllPubInsideGtlCollgCrNormNorm1Fam2Story7520012002GableCompShgVinylSdVinylSdBrkFace162.0GdTAPConcGdTAMnGLQ486Unf0434920GasAExYSBrkr92086601786102131Gd6Typ1TAAttchd2001.0RFn2608TATAY0420000NaNNaNNaN092008WDNormal223500
3470RL60.09550PaveNaNIR1LvlAllPubCornerGtlCrawforNormNorm1Fam2Story7519151970GableCompShgWd SdngWd ShngNone0.0TATABrkTilTAGdNoALQ216Unf0540756GasAGdYSBrkr96175601717101031Gd7Typ1GdDetchd1998.0Unf3642TATAY035272000NaNNaNNaN022006WDAbnorml140000
4560RL84.014260PaveNaNIR1LvlAllPubFR2GtlNoRidgeNormNorm1Fam2Story8520002000GableCompShgVinylSdVinylSdBrkFace350.0GdTAPConcGdTAAvGLQ655Unf04901145GasAExYSBrkr1145105302198102141Gd9Typ1TAAttchd2000.0RFn3836TATAY192840000NaNNaNNaN0122008WDNormal250000

We should have the data dictionary available in our folder for the competition. You can also find it here.

Here’s a brief version of what you’ll find in the data description file:

• SalePrice — the property’s sale price in dollars. This is the target variable that you’re trying to predict.
• MSSubClass — The building class
• MSZoning — The general zoning classification
• LotFrontage — Linear feet of street connected to property
• LotArea — Lot size in square feet
• Street — Type of road access
• Alley — Type of alley access
• LotShape — General shape of property
• LandContour — Flatness of the property
• Utilities — Type of utilities available
• LotConfig — Lot configuration

And so on.

The competition challenges you to predict the final price of each home.
At this point, we should start to think about what we know about housing prices, Ames, Iowa, and what we might expect to see in this dataset.

Looking at the data, we see features we expected, like YrSold (the year the home was last sold) and SalePrice. Others we might not have anticipated, such as LandSlope (the slope of the land the home is built upon) and RoofMatl (the materials used to construct the roof). Later, we’ll have to make decisions about how we’ll approach these and other features.

We want to do some plotting during the exploration stage of our project, and we’ll need to import that functionality into our environment as well. Plotting allows us to visualize the distribution of the data, check for outliers, and see other patterns that we might miss otherwise. We’ll use Matplotlib, a popular visualization library.


import matplotlib.pyplot as plt
plt.style.use(style='ggplot')
plt.rcParams['figure.figsize'] = (10, 6)

## Step 2: Explore the data and engineer Features

The challenge is to predict the final sale price of the homes. This information is stored in the SalePrice column. The value we are trying to predict is often called the target variable.

We can use Series.describe() to get more information.

train.SalePrice.describe()

count      1460.000000
mean     180921.195890
std       79442.502883
min       34900.000000
25%      129975.000000
50%      163000.000000
75%      214000.00000
0max      755000.000000
Name: SalePrice, dtype: float64

Series.describe() gives you more information about any series. count displays the total number of rows in the series. For numerical data, Series.describe() also gives the mean, std, min and max values as well.