May 3, 2016

How to Get Into the Top 15 of a Kaggle Competition Using Python

Kaggle competitions are a fantastic way to learn data science and build your portfolio. I personally used Kaggle to learn many data science concepts. I started out with Kaggle a few months after learning programming, and later won several competitions. Doing well in a Kaggle competition requires more than just knowing machine learning algorithms. It requires the right mindset, the willingness to learn, and a lot of data exploration. Many of these aspects aren't typically emphasized in tutorials on getting started with Kaggle, though. In this post, I'll cover how to get started with the Kaggle Expedia hotel recommendations competition, including establishing the right mindset, setting up testing infrastructure, exploring the data, creating features, and making predictions. At the end, we'll generate a sublesson file using the techniques in the this post. As of this writing, the sublesson would rank in the top 15. leaderboard_place

Where this sublesson would rank as of this writing.

The Expedia Kaggle competition

The Expedia competition challenges you with predicting what hotel a user will book based on some attributes about the search the user is conducting on Expedia. Before we dive into any coding, we'll need to put in time to understand both the problem and the data.

A quick glance at the columns

The first step is to look at the description of the columns of the dataset. You can find that here. Towards the bottom of the page, you'll see a description of each column in the data. Looking over this, it appears that we have quite a bit of data about the searches users are conducting on Expedia, along with data on what hotel cluster they eventually booked in test.csv and train.csv. destinations.csv contains information about the regions users search in for hotels. We won't worry about what we're predicting just yet, we'll focus on understanding the columns.

Expedia

Since the competition consists of event data from users booking hotels on Expedia, we'll need to spend some time understanding the Expedia site. Looking at the booking flow will help us contextualize the fields in the data, and how they tie into using Expedia.

expedia_main

The page you initially see when booking a hotel.

The box labeled Going To maps to the srch_destination_type_id, hotel_continent, hotel_country, and hotel_market fields in the data. The box labelled Check-in maps to the srch_ci field in the data, and the box labelled Check out maps to the srch_co field in the data. The box labeled Guests maps to the srch_adults_cnt, srch_children_cnt, and srch_rm_cnt fields in the data. The box labelled Add a Flight maps to the is_package field in the data. site_name is the name of the site you visited, whether it be the main Expedia.com site, or another. user_location_country, user_location_region, user_location_city, is_mobile, channel is_booking, and cnt are all attributes that are determined by where the user it, what their device is, or their session on the Expedia site. Just by looking at one screen, we can immediately contextualize all the variables. Playing around with the screen, filling in values, and going through the booking process can help further contextualize.

Exploring the Kaggle data in Python

Now that we have a handle on the data at a high level, we can do some exploration to take a deeper look.

Downloading the data

You can download the data here. The datasets are fairly large, so you'll need a good amount of disk space. You'll need to unzip the files to get raw .csv files instead of .csv.gz.

Exploring the data with Pandas

Given the amount of memory on your system, it may or may not be feasible to read all the data in. If it isn't, you should consider creating a machine on EC2 or DigitalOcean to process the data with.

Here's a tutorial on how to get started with that. Once we download the data, we can read it in using Pandas:


import pandas as pd
destinations = pd.read_csv("destinations.csv")
test = pd.read_csv("test.csv")
train = pd.read_csv("train.csv")

Let's first look at how much data there is:

train.shape
(37670293, 24)
test.shape
(2528243, 22)

We have about 37 million training set rows, and 2 million testing set rows, which will make this problem a bit challenging to work with. We can explore the first few rows of the data:

train.head(5)
date_time site_name posa_continent user_location_country user_location_region user_location_city orig_destination_distance user_id is_mobile is_package ... srch_children_cnt srch_rm_cnt srch_destination_id srch_destination_type_id is_booking cnt hotel_continent hotel_country hotel_market hotel_cluster
0 2014-08-11 07:46:59 2 3 66 348 48862 2234.2641 12 0 1 ... 0 1 8250 1 0 3 2 50 628 1
1 2014-08-11 08:22:12 2 3 66 348 48862 2234.2641 12 0 1 ... 0 1 8250 1 1 1 2 50 628 1
2 2014-08-11 08:24:33 2 3 66 348 48862 2234.2641 12 0 0 ... 0 1 8250 1 0 1 2 50 628 1
3 2014-08-09 18:05:16 2 3 66 442 35390 913.1932 93 0 0 ... 0 1 14984 1 0 1 2 50 1457 80
4 2014-08-09 18:08:18 2 3 66 442 35390 913.6259 93 0 0 ... 0 1 14984 1 0 1 2 50 1457 21

There are a few things that immediately stick out:

  • date_time could be useful in our predictions, so we'll need to convert it.
  • Most of the columns are integers or floats, so we can't do a lot of feature engineering. For example, user_location_country isn't the name of a country, it's an integer. This makes it harder to create new features, because we don't know exactly which each value means.
test.head(5)
id date_time site_name posa_continent user_location_country user_location_region user_location_city orig_destination_distance user_id is_mobile ... srch_ci srch_co srch_adults_cnt srch_children_cnt srch_rm_cnt srch_destination_id srch_destination_type_id hotel_continent hotel_country hotel_market
0 0 2015-09-03 17:09:54 2 3 66 174 37449 5539.0567 1 1 ... 2016-05-19 2016-05-23 2 0 1 12243 6 6 204 27
1 1 2015-09-24 17:38:35 2 3 66 174 37449 5873.2923 1 1 ... 2016-05-12 2016-05-15 2 0 1 14474 7 6 204 1540
2 2 2015-06-07 15:53:02 2 3 66 142 17440 3975.9776 20 0 ... 2015-07-26 2015-07-27 4 0 1 11353 1 2 50 699
3 3 2015-09-14 14:49:10 2 3 66 258 34156 1508.5975 28 0 ... 2015-09-14 2015-09-16 2 0 1 8250 1 2 50 628
4 4 2015-07-17 09:32:04 2 3 66 467 36345 66.7913 50 0 ... 2015-07-22 2015-07-23 2 0 1 11812 1 2 50 538

There are a few things we can take away from looking at test.csv:

  • It looks like all the dates in test.csv are later than the dates in train.csv, and the data page confirms this. The testing set contains dates from 2015, and the training set contains dates from 2013 and 2014.
  • It looks like the user ids in test.csv are a subset of the user ids in train.csv, given the overlapping integer ranges. We can confirm this later on.
  • The is_booking column always looks to be 1 in test.csv. The data page confirms this.

Figuring out what to predict

What we're predicting

We'll be predicting which hotel_cluster a user will book after a given search. According to the description, there are 100 clusters in total.

How we'll be scored

The evaluation page says that we'll be scored using Mean Average Precision @ 5, which means that we'll need to make 5 cluster predictions for each row, and will be scored on whether or not the correct prediction appears in our list. If the correct prediction comes earlier in the list, we get more points. For example, if the "correct" cluster is 3, and we predict 4, 43, 60, 3, 20, our score will be lower than if we predict 3, 4, 43, 60, 20. We should put predictions we're more certain about earlier in our list of predictions.

Exploring hotel clusters

Now that we know what we're predicting, it's time to dive in and explore hotel_cluster. We can use the value_counts method on Series to do this:

train["hotel_cluster"].value_counts()

91    1043720
41     772743
48     754033
64     704734
65     670960
5      620194
       ...
53     134812
88     107784
27     105040
74      48355

The output above is truncated, but it shows that the number of hotels in each cluster is fairly evenly distributed. There doesn't appear to be any relationship between cluster number and the number of items.

Exploring train and test user ids

Finally, we'll confirm our hypothesis that all the test user ids are found in the train DataFrame. We can do this by finding the unique values for user_id in test, and seeing if they all exist in train. In the code below, we'll:

  • Create a set of all the unique test user ids.
  • Create a set of all the unique train user ids.
  • Figure out how many test user ids are in the train user ids.
  • See if the count matches the total number of test user ids.

test_ids = set(test.user_id.unique())
train_ids = set(train.user_id.unique())
intersection_count = len(test_ids & train_ids)
intersection_count == len(test_ids)
True

Looks like our hypothesis is correct, which will make working with this data much easier!

Downsampling our Kaggle data

The entire train.csv dataset contains 37 million rows, which makes it hard to experiment with different techniques. Ideally, we want a small enough dataset that lets us quickly iterate through different approaches but is still representative of the whole training data. We can do this by first randomly sampling rows from our data, then selecting new training and testing datasets from train.csv. By selecting both sets from train.csv, we'll have the true hotel_cluster label for every row, and we'll be able to calculate our accuracy as we test techniques.

Add in times and dates

The first step is to add month and year fields to train. Because the train and test data is differentiated by date, we'll need to add date fields to allow us to segment our data into two sets the same way. If we add year and month fields, we can split our data into training and testing sets using them. The code below will:

  • Convert the date_time column in train from an object to a datetime value. This makes it easier to work with as a date.
  • Extract the year and month from from date_time, and assign them to their own columns.

train["date_time"] = pd.to_datetime(train["date_time"])
train["year"] = train["date_time"].dt.year
train["month"] = train["date_time"].dt.month

Pick 10000 users

Because the user ids in test are a subset of the user ids in train, we'll need to do our random sampling in a way that preserves the full data of each user. We can accomplish this by selecting a certain number of users randomly, then only picking rows from train where user_id is in our random sample of user ids.


import random

unique_users = train.user_id.unique()

sel_user_id = random.sample(unique_user_id,10000)
sel_train = train[train.user_id.isin(sel_user_ids)]

The above code creates a DataFrame called sel_train that only contains data from 10000 users.

Pick new training and testing sets

We'll now need to pick new training and testing sets from sel_train. We'll call these sets t1 and t2.


t1 = sel_train[((sel_train.year == 2013) | ((sel_train.year == 2014) & (sel_train.month < 8)))]
t2 = sel_train[((sel_train.year == 2014) & (sel_train.month >= 8))]

In the original train and test DataFrames, test contained data from 2015, and train contained data from 2013 and 2014. We split this data so that anything after July 2014 is in t2, and anything before is in t1. This gives us smaller training and testing sets with similar characteristics to train and test.

Remove click events

If is_booking is 0, it represents a click, and a 1 represents a booking. test contains only booking events, so we'll need to sample t2 to only contain bookings as well.

t2 = t2[t2.is_booking == True]

A simple algorithm

The most simple technique we could try on this data is to find the most common clusters across the data, then use them as predictions. We can again use the value_counts method to help us here:

most_common_clusters = list(train.hotel_cluster.value_counts().head().index)

The above code will give us a list of the 5 most common clusters in train. This is because the head method returns the first 5 rows by default, and the index property will return the index of the DataFrame, which is the hotel cluster after running the value_counts method.

Generating predictions

We can turn most_common_clusters into a list of predictions by making the same prediction for each row.

predictions = [most_common_clusters for i in range(t2.shape[0])]

This will create a list with as many elements as there are rows in t2. Each element will be equal to most_common_clusters.

Evaluating error

In order to evaluate error, we'll first need to figure out how to compute Mean Average Precision. Luckily, Ben Hamner has written an implementation that can be found here. It can be installed as part of the ml_metrics package, and you can find installation instructions for how to install it here. We can compute our error metric with the mapk method in ml_metrics:


import ml_metrics as metrics
target = [[l] for l in t2["hotel_cluster"]]
metrics.mapk(target, predictions, k=5)
0.058020770920711007

Our target needs to be in list of lists format for mapk to work, so we convert the hotel_cluster column of t2 into a list of lists. Then, we call the mapk method with our target, our predictions, and the number of predictions we want to evaluate (5). Our result here isn't great, but we've just generated our first set of predictions, and evaluated our error! The framework we've built will allow us to quickly test out a variety of techniques and see how they score. We're well on our way to building a good-performing solution for the leaderboard.

Finding correlations

Before we move on to creating a better algorithm, let's see if anything correlates well with hotel_cluster. This will tell us if we should dive more into any particular columns. We can find linear correlations in the training set using the corr method:

train.corr()["hotel_cluster"]

site_name                   -0.022408
posa_continent               0.014938
user_location_country       -0.010477
user_location_region         0.007453
user_location_city           0.000831
orig_destination_distance    0.007260
user_id                      0.001052
is_mobile                    0.008412
is_package                   0.038733
channel                      0.000707

This tells us that no columns correlate linearly with hotel_cluster. This makes sense, because there is no linear ordering to hotel_cluster. For example, having a higher cluster number isn't tied to having a higher srch_destination_id. Unfortunately, this means that techniques like linear regression and logistic regression won't work well on our data, because they rely on linear correlations between predictors and targets.

Creating better predictions for our Kaggle entry

This data for this competition is quite difficult to make predictions on using machine learning for a few reasons:

  • There are millions of rows, which increases runtime and memory usage for algorithms.
  • There are 100 different clusters, and according to the competition admins, the boundaries are fairly fuzzy, so it will likely be hard to make predictions. As the number of clusters increases, classifiers generally decrease in accuracy.
  • Nothing is linearly correlated with the target (hotel_clusters), meaning we can't use fast machine learning techniques like linear regression.

For these reasons, machine learning probably won't work well on our data, but we can try an algorithm and find out.

Generating features

The first step in applying machine learning is to generate features. We can generate features using both what's available in the training data, and what's available in destinations. We haven't looked at destinations yet, so let's take a quick peek.

Generating features from destinations

Destinations contains an id that corresponds to srch_destination_id, along with 149 columns of latent information about that destination. Here's a sample:

srch_destination_id d1 d2 d3 d4 d5 d6 d7 d8 d9 ... d140 d141 d142 d143 d144 d145 d146 d147 d148 d149
0 0 -2.198657 -2.198657 -2.198657 -2.198657 -2.198657 -1.897627 -2.198657 -2.198657 -1.897627 ... -2.198657 -2.198657 -2.198657 -2.198657 -2.198657 -2.198657 -2.198657 -2.198657 -2.198657 -2.198657
1 1 -2.181690 -2.181690 -2.181690 -2.082564 -2.181690 -2.165028 -2.181690 -2.181690 -2.031597 ... -2.165028 -2.181690 -2.165028 -2.181690 -2.181690 -2.165028 -2.181690 -2.181690 -2.181690 -2.181690
2 2 -2.183490 -2.224164 -2.224164 -2.189562 -2.105819 -2.075407 -2.224164 -2.118483 -2.140393 ... -2.224164 -2.224164 -2.196379 -2.224164 -2.192009 -2.224164 -2.224164 -2.224164 -2.224164 -2.057548
3 3 -2.177409 -2.177409 -2.177409 -2.177409 -2.177409 -2.115485 -2.177409 -2.177409 -2.177409 ... -2.161081 -2.177409 -2.177409 -2.177409 -2.177409 -2.177409 -2.177409 -2.177409 -2.177409 -2.177409
4 4 -2.189562 -2.187783 -2.194008 -2.171153 -2.152303 -2.056618 -2.194008 -2.194008 -2.145911 ... -2.187356 -2.194008 -2.191779 -2.194008 -2.194008 -2.185161 -2.194008 -2.194008 -2.194008 -2.188037

The competition doesn't tell us exactly what each latent feature is, but it's safe to assume that it's some combination of destination characteristics, like name, description, and more. These latent features were converted to numbers, so they could be anonymized. We can use the destination information as features in a machine learning algorithm, but we'll need to compress the number of columns down first, to minimize runtime.

We can use PCA to do this. PCA will reduce the number of columns in a matrix while trying to preserve the same amount of variance per row. Ideally, PCA will compress all the information contained in all the columns into less, but in practice, some information is lost. In the code below, we:

  • Initialize a PCA model using scikit-learn.
  • Specify that we want to only have 3 columns in our data.
  • Transform the columns d1-d149 into 3 columns.

from sklearn.decomposition import PCA

pca = PCA(n_components=3)
dest_small = pca.fit_transform(destinations[["d{0}".format(i + 1) for i in range(149)]])
dest_small = pd.DataFrame(dest_small)
dest_small["srch_destination_id"] = destinations["srch_destination_id"]

The above code compresses the 149 columns in destinations down to 3 columns, and creates a new DataFrame called dest_small. We preserve most of the variance in destinations while doing this, so we don't lose a lot of information, but save a lot of runtime for a machine learning algorithm.

Generating features

Now that the preliminaries are done with, we can generate our features. We'll do the following:

  • Generate new date features based on date_time, srch_ci, and srch_co.
  • Remove non-numeric columns like date_time.
  • Add in features from dest_small.
  • Replace any missing values with -1.

def calc_fast_features(df):
    df["date_time"] = pd.to_datetime(df["date_time"])
    df["srch_ci"] = pd.to_datetime(df["srch_ci"], format=
    df["srch_co"] = pd.to_datetime(df["srch_co"], format=
    
    props = {}
    for prop in ["month", "day", "hour", "minute", "dayofweek", "quarter"]:
        props[prop] = getattr(df["date_time"].dt, prop)
    
    carryover = [p for p in df.columns if p not in ["date_time", "srch_ci", "srch_co"]]
    for prop in carryover:
        props[prop] = df[prop]
    
    date_props = ["month", "day", "dayofweek", "quarter"]
    for prop in date_props:
        props["ci_{0}".format(prop)] = getattr(df["srch_ci"].dt, prop)
        props["co_{0}".format(prop)] = getattr(df["srch_co"].dt, prop)
    props["stay_span"] = (df["srch_co"] - df["srch_ci"]).astype('timedelta64[h]')
        
    ret = pd.DataFrame(props)
    
    ret = ret.join(dest_small, on="srch_destination_id", how='left', rsuffix="dest")
    ret = ret.drop("srch_destination_iddest", axis=1)
    return ret

df = calc_fast_features(t1)
df.fillna(-1, inplace=True)

The above will calculate features such as length of stay, check in day, and check out month. These features will help us train a machine learning algorithm later on. Replacing missing values with -1 isn't the best choice, but it will work fine for now, and we can always optimize the behavior later on.

Machine learning

Now that we have features for our training data, we can try machine learning. We'll use 3-fold cross validation across the training set to generate a reliable error estimate. Cross validation splits the training set up into 3 parts, then predicts hotel_cluster for each part using the other parts to train with. We'll generate predictions using the Random Forest algorithm. Random forests build trees, which can fit to nonlinear tendencies in data. This will enable us to make predictions, even though none of our columns are linearly related. We'll first initialize the model and compute cross validation scores:


predictors = [c for c in df.columns if c not in ["hotel_cluster"]]
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=10, min_weight_fraction_leaf=0.1)
scores = cross_validation.cross_val_score(clf, df[predictors], df['hotel_cluster'], cv=3)
scores
array([ 0.06203556,  0.06233452,  0.06392277])

The above code doesn't give us very good accuracy, and confirms our original suspicion that machine learning isn't a great approach to this problem. However, classifiers tend to have lower accuracy when there is a high cluster count. We can instead try training 100 binary classifiers. Each classifier will just determine if a row is in it's cluster, or not. This will entail training one classifier per label in hotel_cluster.

Binary classifiers

We'll again train Random Forests, but each forest will predict only a single hotel cluster. We'll use 2 fold cross validation for speed, and only train 10 trees per label. In the code below, we:

  • Loop across each unique hotel_cluster.
    • Train a Random Forest classifier using 2-fold cross validation.
    • Extract the probabilities from the classifier that the row is in the unique hotel_cluster
  • Combine all the probabilities.
  • For each row, find the 5 largest probabilities, and assign those hotel_cluster values as predictions.
  • Compute accuracy using mapk.

from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import KFold
from itertools import chain

all_probs = []
unique_clusters = df["hotel_cluster"].unique()
for cluster in unique_clusters:
    df["target"] = 1
    df["target"][df["hotel_cluster"] != cluster] = 0
    predictors = [col for col in df if col not in ['hotel_cluster', "target"]]
    probs = []
    cv = KFold(len(df["target"]), n_folds=2)
    clf = RandomForestClassifier(n_estimators=10, min_weight_fraction_leaf=0.1)
    for i, (tr, te) in enumerate(cv):
        clf.fit(df[predictors].iloc[tr], df["target"].iloc[tr])
        preds = clf.predict_proba(df[predictors].iloc[te])
        probs.append([p[1] for p in preds])
    full_probs = chain.from_iterable(probs)
    all_probs.append(list(full_probs))

prediction_frame = pd.DataFrame(all_probs).T
prediction_frame.columns = unique_clusters
def find_top_5(row):
    return list(row.nlargest(5).index)

preds = []
for index, row in prediction_frame.iterrows():
    preds.append(find_top_5(row))

metrics.mapk([[l] for l in t2.iloc["hotel_cluster"]], preds, k=5)
0.041083333333333326

Our accuracy here is worse than before, and people on the leaderboard have much better accuracy scores. We'll need to abandon machine learning and move to the next technique in order to compete. Machine learning can be a powerful technique, but it isn't always the right approach to every problem.

Top clusters based on hotel_cluster

There are a few Kaggle Kernels for the competition that involve aggregating hotel_cluster based on orig_destination_distance, or srch_destination_id. Aggregating on orig_destination_distance will exploit a data leak in the competition, and attempt to match the same user together. Aggregating on srch_destination_id will find the most popular hotel clusters for each destination. We'll then be able to predict that a user who searches for a destination is going to one of the most popular hotel clusters for that destination. Think of this as a more granular version of the most common clusters technique we used earlier. We can first generate scores for each hotel_cluster in each srch_destination_id. We'll weight bookings higher than clicks. This is because the test data is all booking data, and this is what we want to predict. We want to include click information, but downweight it to reflect this. Step by step, we'll:

  • Group t1 by srch_destination_id, and hotel_cluster.
  • Iterate through each group, and:
    • Assign 1 point to each hotel cluster where is_booking is True.
    • Assign .15 points to each hotel cluster where is_booking is False.
    • Assign the score to the srch_destination_id / hotel_cluster combination in a dictionary.

Here's the code to accomplish the above steps:


def make_key(items):
    return "_".join([str(i) for i in items])

match_cols = ["srch_destination_id"]
cluster_cols = match_cols + ['hotel_cluster']
groups = t1.groupby(cluster_cols)
top_clusters = {}
for name, group in groups:
    clicks = len(group.is_booking[group.is_booking == False])
    bookings = len(group.is_booking[group.is_booking == True])
    
    score = bookings + .15 * clicks
    
    clus_name = make_key(name[:len(match_cols)])
    if clus_name not in top_clusters:
        top_clusters[clus_name] = {}
    top_clusters[clus_name][name[-1]] = score

At the end, we'll have a dictionary where each key is an srch_destination_id. Each value in the dictionary will be another dictionary, containing hotel clusters as keys with scores as values. Here's how it looks:

{'39331': {20: 1.15, 30: 0.15, 81: 0.3},
'511': {17: 0.15, 34: 0.15, 55: 0.15, 70: 0.15}}

We'll next want to transform this dictionary to find the top 5 hotel clusters for each srch_destination_id. In order to do this, we'll:

  • Loop through each key in top_clusters.
  • Find the top 5 clusters for that key.
  • Assign the top 5 clusters to a new dictionary, cluster_dict.

Here's the code:


import operator

cluster_dict = {}
for n in top_clusters:
    tc = top_clusters[n]
    top = [l[0] for l in sorted(tc.items(), key=operator.itemgetter(1), reverse=True)[:5]]
    cluster_dict[n] = top

Making predictions based on destination

Once we know the top clusters for each srch_destination_id, we can quickly make predictions. To make predictions, all we have to do is:

  • Iterate through each row in t2.
    • Extract the srch_destination_id for the row.
    • Find the top clusters for that destination id.
    • Append the top clusters to preds.

Here's the code:


preds = []
for index, row in t2.iterrows():
    key = make_key([row[m] for m in match_cols])
    if key in cluster_dict:
        preds.append(cluster_dict[key])
    else:
        preds.append([])

At the end of the loop, preds will be a list of lists containing our predictions. It will look like this:

[
   [2, 25, 28, 10, 64],
   [25, 78, 64, 90, 60],
   ...
]

Calculating error

Once we have our predictions, we can compute our accuracy using the mapk function from earlier:

metrics.mapk([[l] for l in t2["hotel_cluster"]], preds, k=5)
0.22388136288998359

We're doing pretty well! We boosted our accuracy 4x over the best machine learning approach, and we did it with a far faster and simpler approach. You may have noticed that this value is quite a bit lower than accuracies on the leaderboard. Local testing results in a lower accuracy value than submitting, so this approach will actually do fairly well on the leaderboard. Differences in leaderboard score and local score can come down to a few factors:

  • Different data locally and in the hidden set that leaderboard scores are computed on. For example, we're computing error in a sample of the training set, and the leaderboard score is computed on the testing set.
  • Techniques that result in higher accuracy with more training data. We're only using a small subset of data for training, and it may be more accurate when we use the full training set.
  • Different randomization. With certain algorithms, random numbers are involved, but we're not using any of these.

Generating better predictions for your Kaggle sublesson

The forums are very important in Kaggle, and can often help you find nuggets of information that will let you boost your score. The Expedia competition is no exception.

This post details a data leak that allows you to match users in the training set from the testing set using a set of columns including user_location_country, and user_location_region. We'll use the information from the post to match users from the testing set back to the training set, which will boost our score. Based on the forum thread, its okay to do this, and the competition won't be updated as a result of the leak.

Finding matching users

The first step is to find users in the training set that match users in the testing set. In order to do this, we need to:

  • Split the training data into groups based on the match columns.
  • Loop through the testing data.
  • Create an index based on the match columns.
  • Get any matches between the testing data and the training data using the groups.

Here's the code to accomplish this:

match_cols = ['user_location_country', 'user_location_region', 'user_location_city', 'hotel_market', 'orig_destination_distance']

groups = t1.groupby(match_cols)
    
def generate_exact_matches(row, match_cols):
    index = tuple([row[t] for t in match_cols])
    try:
        group = groups.get_group(index)
    except Exception:
        return []
    clus = list(set(group.hotel_cluster))
    return clus

exact_matches = []
for i in range(t2.shape[0]):
    exact_matches.append(generate_exact_matches(t2.iloc[i], match_cols))

At the end of this loop, we'll have a list of lists that contain any exact matches between the training and the testing sets. However, there aren't that many matches. To accurately evaluate error, we'll have to combine these predictions with our earlier predictions. Otherwise, we'll get a very low accuracy value, because most rows have empty lists for predictions.

Combining predictions

We can combine different lists of predictions to boost accuracy. Doing so will also help us see how good our exact match strategy is. To do this, we'll have to:

  • Combine exact_matches, preds, and most_common_clusters.
  • Only take the unique predictions, in sequential order, using the f5 function from here.
  • Ensure we have a maximum of 5 predictions for each row in the testing set.

Here's how we can do it:


def f5(seq, idfun=None): 
    if idfun is None:
        def idfun(x): return x
    seen = {}
    result = []
    for item in seq:
        marker = idfun(item)
        if marker in seen: continue
        seen[marker] = 1
        result.append(item)
    return result
    
full_preds = [f5(exact_matches[p] + preds[p] + most_common_clusters)[:5] for p in range(len(preds))]
mapk([[l] for l in t2["hotel_cluster"]], full_preds, k=5)
0.28400041050903119

This is looking quite good in terms of error -- we improved dramatically from earlier! We could keep going and making more small improvements, but we're probably ready to submit now.

Making a Kaggle sublesson file

Luckily, because of the way we wrote the code, all we have to do to submit is assign train to the variable t1, and test to the variable t2. Then, we just have to re-run the code to make predictions. Re-running the code over the train and test sets should take less than an hour. Once we have predictions, we just have to write them to a file:


write_p = [" ".join([str(l) for l in p]) for p in full_preds]
write_frame = ["{0},{1}".format(t2["id"][i], write_p[i]) for i in range(len(full_preds))]
write_frame = ["id,hotel_clusters"] + write_frame
with open("predictions.csv", "w+") as f:
    f.write("\n".join(write_frame))

We'll then have a sublesson file in the right format to submit. As of this writing, making this sublesson will get you into the top 15.

Summary

We came a long way in this post! We went from just looking at the data all the way to creating a sublesson and getting onto the leaderboard. Along the way, some of the key steps we took were:

  • Exploring the data and understanding the problem.
  • Setting up a way to iterate quickly through different techniques.
  • Creating a way to figure out accuracy locally.
  • Reading the forums, scripts, and the descriptions of the contest very closely to better understand the structure of the data.
  • Trying a variety of techniques and not being afraid to not use machine learning.

These steps will serve you well in any Kaggle competition.

Next Steps

In order to iterate quickly and explore techniques, speed is key. This is difficult with this competition, but there are a few strategies to try:

  • Sampling down the data even more.
  • Parallelizing operations across multiple cores.
  • Using Spark or other tools where tasks can be run on parallel workers.
  • Exploring various ways to write code and benchmarking to find the most efficient approach.
  • Avoiding iterating over the full training and testing sets, and instead using groups.

Writing fast, efficient code is a huge advantage in this competition. Once you have a stable foundation on which to run your code, there are a few avenues to explore in terms of techniques to boost accuracy:

  • Finding similarity between users, then adjusting hotel cluster scores based on similarity.
  • Using similarity between destinations to group multiple destinations together.
  • Applying machine learning within subsets of the data.
  • Combining different prediction strategies in a less naive way.
  • Exploring the link between hotel clusters and regions more.

I hope you have fun with this competition! I'd love to hear any feedback you have. If you want to learn more before diving into the competition, check out our courses on Dataquest to learn about data manipulation, statistics, machine learning, how to work with Spark, and more.

Vik Paruchuri

About the author

Vik Paruchuri

Vik is the CEO and Founder of Dataquest.