Many aspiring data scientists focus on doing Kaggle competitions as a way to build their portfolios. Kaggle is an excellent way to practice, but it should only be one of many avenues you use to work on data science projects. This is because Kaggle competitions only focus on a narrow part of data science work. To be more specific:
- Kaggle mostly deals with machine learning, which is only one aspect of Data Science.
- When you work on Kaggle you are dealing largely with pre-cleaned data, so you don't get enough experience cleaning messy data, which is (colloquially) 80% of what a Data Scientist does.
- Because of the large volume of people entering Kaggle competitions, getting into the top few percent or winning a competition requires not only skill, but a lot of time and some luck.
To build your skills more holistically, it's a good idea to work on your own projects. It's common to share this code on Github to interest potential employers, but it's important to be very purposeful in what code you put up and how.
While it's fast to throw up some code on Github and hope someone looks at it, it's far more effective in the long run to put time and effort into how you construct and present your portfolio.
At Dataquest, we advocate building a portfolio of projects to help our students get their first data science jobs, and many have done this successfully.
I'm going to share some strategies, based on my experience, for building a data science portfolio that will get you noticed and help you get a job, even if you don't have experience in the industry.
Let's start by examining why a portfolio is effective in the first place.
Why make a portfolio
The reason a data science portfolio is useful is that it demonstrates that you can do the things that an employer wants to hire you for. It is effectively a substitute for the job experience that you lack.
Think about it from the employer's perspective - they want to maximize the chance of hiring a great candidate, and minimize the chance of hiring a weak candidate. As a candidate, your 'job' is to demonstrate to them that you have the skills and qualities they need for that role.
A strong data science portfolio is made up of several medium sized data science projects that combined demonstrate to the employer that you have the key skills that they're looking for.
Make sure your portfolio focuses on the right things
Although most data science learning focuses on machine learning, it's very unlikely that your first data science role will involve a lot of machine learning work.
This is your first job in the industry, so you should expect that you're going to be considered for junior roles, and then you can progressively work your way up from there.
The roles might not even be called 'Data Scientist', but something like 'Data Analyst', or 'Business Analyst'. Be humble and willing to do what it takes to get into the industry.
For this reason, filling your portfolio with machine learning projects is effort in the wrong direction (although I do recommend having at least one). Consider this when you think about what type of projects to include.
For even more detail on how to build a great data science portfolio, check out this six-part blog series we did on the topic.
Types of projects
Different projects can demonstrate different things. Here are a few different types of projects you can build:
- Data Cleaning Project - shows that you can take multiple messy data sets, clean them up, combine them and use them to perform analysis. (example)
- Data Storytelling Project - shows that you can extract insights from data, communicate these insights and reason with data. (example)
- Data Visualization Project - shows that you can communicate data visually using appropriate plots and charts.
- Machine Learning Project - shows that you can effectively build a model that makes accurate predictions with data (example)
- End to End Project - shows that you can build a stand-alone system that can take in data, processes it and produce output in a specific form. (example)
- Explanatory Post - shows that you can communicate and explain well with data by explaining a concept such a a statistical concept or a machine learning algorithm
You should think about the sort of job you want when selecting what projects to add to your portfolio. As mentioned above, they shouldn't be all machine learning projects.
If you have a particular interest in data visualization, for instance, you might make a couple of data visualization projects and maybe add some interactive visualizations to demonstrate your skills in that area.
You should familiarize yourself with the advertisements for the jobs you will be going for - look at the skills they ask for, and use that as an indication with how to select projects for your portfolio.
If you need help finding data sets for your project, I would check out this great resource: 18 places to find data sets for data science projects.
Present your projects well
An effective project is not simply doing some analysis and uploading it. You need to put time and effort into making your project easy to understand and digest.
This means giving your project an introduction or readme file. You need to 'sell' your project, keeping in mind that it's very possible that your readme is the only thing some people will look at. Make your project feel like you were hired on a contract to do the project - explain what the aim was, what approach you took, the data you used and the outcome.
You should also make sure your readme has instructions to install or run your project, so anyone who wants to reproduce your work can do so easily.
Because of this, you need to make sure you include all relevant files and data sets, as well as providing a list of libraries needed to run your projects (eg a requirements.txt file for a Python project)
If your project is composed of standalone scripts, you should make sure they are easy to read, and that comments are used in your code to explain what you are doing and why.
If your project uses notebooks, add markdown to your project that explains what you are doing, and interprets your results as you go.
If you want more information on presenting your portfolio, this article is a great resource.
Cater for the different types of people who will view your portfolio
Be mindful that within the hiring process, different types of people are going to look at your portfolio, and they will have different levels of skill and understanding.
A hiring manager who looks at your portfolio early in the process might have limited technical understanding. You should make sure that there are lots of explanation for this type of person, and maybe even consider putting your projects on a blog as well as GitHub so that you can write in more detail about your project and explain how it works for someone with less technical background.
Later on in the process, your portfolio may be looked at by a company manager, and they will be interested in how you can deliver value to the company and communicate. You should make sure all your explanations are clear and that your project delivers on its aim.
Lastly, you should expect that someone technical is going to evaluate your portfolio. You should make sure your code is clean, refactored, and efficient.
For different companies, the hiring process will look different and some may have only a few of these people examine your portfolio. You should deliberately consider and prepare for each.
Let employers know about your portfolio
A common mistake is to place some projects on GitHub, and then simply add your GitHub profile URL to the top of your resume. Remember that the hiring process is hard, so you need to make it easy for those looking at your application to find and evaluate your portfolio.
Rather than just 'adding the URL' and hoping someone finds it, explicitly mention your portfolio and specific projects in your cover letter. If you get someone on the phone, mention your portfolio and how it shows how you can provide value to the business. Take every opportunity you can to put your portfolio forward.
Another effective approach is to list your portfolio projects on your resume as if they were short term contracts (although be careful not to be deceptive). Give a short summary of the aim and the skills it demonstrates, and provide an easy-to-follow link.
It also helps to remember that generally speaking, you will encounter less technical people early in the hiring process and more technical people later on. Because of this, your initial application might list your portfolio 'blog' more prominently.
A portfolio is an extremely effective way of acting as a replacement for experience when looking for your first Data Science job. For your portfolio to be effective however, you need to put some thought and effort into how you construct and present your work.
You should build several, substantial projects that demonstrate specific skills that are relevant to the jobs you want to get. You should take time to present these well, and be mindful of the different types of people who will view your portfolio in the hiring process.
Lastly, you should make an effort to ensure your portfolio is a prominent part of your application, and consider presenting your portfolio projects like short term contracts.
Dataquest is the best online platform for learning to be a Data Scientist. Beyond teaching you the concepts you need, we favor a project-based approach to learning, and have many guided projects that can form the start of your data science portfolio.
You can signup and complete our first course for free at Dataquest.io
This post is based on this Quora answer