June 22, 2017

The Tips and Tricks I used to succeed on Kaggle

I learned machine learning through competing in Kaggle competitions. I entered my first competitions in 2011, with almost no data science knowledge. I soon ended up in fifth place out of a hundred or so in a stock trading competition. Over the next year, I won several competitions on automated essay scoring and bond price prediction, and placed well in others.

Kaggle competitions require a unique blend of skill, luck, and teamwork to win. The exact blend varies by competition, and can often be surprising.

For example, I was first and/or second for most of the time that the Personality Prediction Competition ran, but I ended up 18th, due to overfitting in the feature selection stage, something that I has never encountered before with the method I used. A good post on some of the seemingly semi-random shifts that happen at the end of a competition can be found on the Kaggle blog.

In this post, I'm going to share my tips for Kaggle success.

Be persistent

The number one factor that leads to success in Kaggle competitions is persistence. It's easy to become discouraged when you see the ranking of your first sublesson, but it is definitely worth it to keep trying. In one competition, I think that I literally tried every single published method on a topic.

In my first ever Kaggle competition, the Photo Quality Prediction competition, I ended up in 50th place, and had no idea what the top competitors had done differently from me.

I managed to learn from this experience, however, and did much better in the my second competition, the Algorithmic Trading Challenge.

What changed the result from the Photo Quality competition to the Algorithmic Trading competition was learning and persistence. I did not really spend much time on the first competition, and it showed in the results.

Expect to make many bad sublessons that do not score well. You should absolutely be reading as much relevant literature (and blog posts, etc), as you can while the competition is running. As long as you learn something new that you can apply to the competition later, or you learn something from your failed sublesson (maybe that a particular algorithm or approach is ill-suited to the data), you are on the right track.

This persistence needs to come from within, though. In order to make yourself willing to do this, you have to ask yourself why you are engaging in a particular competition. Do you want to learn? Do you want to gain opportunities by placing highly? Do you just want to prove yourself? The monetary reward in most Kaggle competitions is not enough to motivate a significant time investment, so unless you clearly know what you want and how to motivate yourself, it can be tough to keep trying. Does rank matter to you? If not, you have the luxury of learning about interesting things that may or may not impact score, but you don't if you are trying for first place.

Spend time on data preparation and feature engineering

The most important data-related factor (to me) is how you prepare the data, and what features you engineer. Algorithm selection is important, but much less so.

You really just have to use intuition and common sense, figure out what works, and throw out what doesn't. What really helps in this is creating a good cross validation framework so you can get a reliable error estimate.

Feature engineering is really why data science is so interesting/creative, and so different from some types of programming, where there is a "best" way to do something.

Don't ignore domain specific knowledge

Because feature engineering is very problem-specific domain knowledge helps a lot.

I find you can generally pick up domain specific knowledge by learning while you are competing. For example, I learned NLP methods while I competed in the Hewlett Foundation ASAP Competition. That said, you definitely need to quickly learn the relevant domain-specific elements that you don't know, or you will not really be able to compete in most competitions.

Pick your competitions wisely

Picking a less competitive competition can definitely be useful at first. The research competitions tend to have less competitors than the ones with large prizes. Later on, I find it useful to compete in more competitive competitions because it forces you to learn more and step outside your comfort zone.

Find a good team

Forming a good team is critical. I have been lucky enough to work with great people on two different competitions (ASAP and Bond), and I learned a lot from them. People tend to be split into those that almost always work alone and those that almost always team up, but it is useful to try to do both. You can learn a lot from working in a team, but working on your own can make you learn things that you might otherwise rely on a teammate for.

Other philosophies

There are a few things that are less strategic, but important to keep in mind as you continue.

Luck will play a part

I mentioned it before, but luck plays a part as well. In some competitions, .001% separates 3rd and 4th place, for example. At that point, its hard to say whose approach is "better", but only one is generally recognized as a winner. A fact of Kaggle, I suppose.

Don't get put off by competitions on topics you don't know about

The great thing about machine learning is that you can apply similar techniques to almost any problem. I don't think that you need to pick problems that you have a particular insight about or particular knowledge about, because frankly, it's more interesting to do something new and learn about it as you go along. Even if you have a great insight on day one, others will likely think of it, but they may do so on day 20 or day 60.

Stop worrying about your Kaggle 'profile'

Don't be afraid to get a low rank. Sometimes you see an interesting competition, but think that you won't be able to spend much time on it, and may not get a decent rank. Don't worry about this. Nobody is going to judge you!

It's much more important for you to dive in and get the experience and learning that comes from preparing any sublesson than it is to worry about what the ranking might look like on your profile.

A winning entry is made of many small steps

Every winning Kaggle entry is the combination of dozens of small insights. There is rarely one large aha moment that wins you everything. If you do all of the above, make sure you keep learning, and keep working to iterate your solution, you will do well.

Never Stop Learning

There are literally hundreds of Kaggle tutorials and articles out there, plus thousands more machine learning articles, books and resources. Never stop learning, and don't be afraid to use Google to answer your questions.

In addition, the Kaggle forums are an excellent resource, as is the KaggleNoobs slack community.

Lastly, the amazing Eliot Andres maintains a searchable and sortable compilation of Kaggle past solutions. Once you get started, this is a great way to get some insight into how competitions winners do it: Kaggle Past Solutions

In summary: persistence and learning

I think that the two main elements that I stressed here are persistence and learning. I think that these two concepts encapsulate my Kaggle experience nicely, and even if you don't win a competition, as long as you learned something, you spent your time wisely.

If you're interested in getting started with Kaggle, I highly recommend this Kaggle tutorial.

If you're interested in learning Data Science, Dataquest is the best online platform for learning Python & Data Science. We have graduates working at SpaceX, Amazon and more. If that interests you, you can signup and complete our first course for free at Dataquest.io

This blog post is based on my Quora Answer

Vik Paruchuri

About the author

Vik Paruchuri

Vik is the CEO and Founder of Dataquest.