March 29, 2019

How to Create a Project Portfolio for Data Science Job Applications

Start building data science projects as early as you can.

Now we’ve reached the real meat of your job application. For entry-level positions, the project portfolio is where the rubber meets the road.

In fact, if you don’t have previous experience in the data science field, your portfolio of projects is probably what will determine whether you get that all-important call back for an interview. Projects often play a crucial role in the interview phase as well.

First, a word on terminology: you’ll hear terms like projects and portfolio used differently by different people in the data science world. To some, the term “portfolio” evokes a carefully-designed package of projects, like a custom-built website.

For our purposes here, we’re going to define portfolio as the group of projects you’re showcasing in your job application, regardless of the format in which they’re presented (we’ll talk about presentation later in this article).

Before we dive into the how of putting together a portfolio, let’s take a look at the why.

(This article is a part of our in-depth Data Science Career Guide. To read the other articles, please refer to the table of contents or the links that follow this post.)

Why Data Science Projects Are Crucial

Employers won’t pay you to do something you’ve never done before. That’s a fundamental rule of the labor market in any industry, and data science is no exception.

It’s quite logical, really: would you visit the restaurant of a chef who’d never cooked before? Or step onto a plane flown by a pilot who’d never been up in the air before? Probably not.

Whether you’re transitioning into data science from full-time study, some other career, or simply trying to get a different kind of data science job, you’re going to need experience.

Even for entry-level positions, if doing a job requires skill, you need to be able to show you can do it before anyone will pay you to do it.

But most entry-level applicants have little or no professional experience in data science. So how can you prove that you’ve got the skills the job requires? Portfolio projects.

A portfolio of projects takes the place of that work experience in your job applications. It shows potential employers that you really can do the kind of data science work you’re applying for.

In fact, projects might be the single most important part of your application, because they crop up at every stage of the process. They will be mentioned on your resume, linked in your application, and you can expect them to play an important role in many job interviews, too.

Most of the recruiters we spoke to while creating this guide said that they reviewed projects and portfolios when screening candidates, but they talked about the projects in interviews, too.

For example, G2 Crowd manager of data science and analytics Michael Hupp describes the first stage of data science interviews at his company: “We just ask them about their projects. We’ll try and quiz them on the technical skills, but we also want to make sure they’re able to talk about the project, and the results, in an understandable way.”

You might be asked to explain the statistical choices you made in a data analysis project, or talk a hiring manager through your code. You might be asked about your experiences working with others on a group project, or about challenges you faced when putting a particular project together.

Recruiters told us that they sometimes use projects to gauge everything from a candidate’s technical abilities to their level of passion for the subject matter.

Without prior professional experience in the field, you’ll probably have to lean heavily on your projects at various stages of the hiring process, so it’s vitally important you get them right.

What Your Portfolio Needs to Demonstrate

Precisely what you need to demonstrate with your portfolio is going to depend on the job you’re applying for.

Someone who’s looking for data analyst positions in marketing should have a portfolio of projects highlighting marketing-related analytical skills.

Someone who’s looking for a machine learning engineer position had better have a portfolio of impressive machine learning projects.

But whatever role you’re looking for, the mantra to remember is this: your portfolio should prove you can do the work.

Doing the work doesn’t just mean proving you have the technical skill. For most data-related positions, you’ll want your portfolio to demonstrate that you have:

  • The ability to communicate
  • The ability to collaborate with others
  • Technical competence
  • The ability to reason about data
  • The motivation and ability to take initiative

It’s also worth pointing out that the word you in “prove you can do the work” is important. Your portfolio projects should be unique.

“The point of a portfolio, and in large part the guiding principle of the entire application process, is being able to prove that you did work in a way that can be easily verified,” explains SharpestMinds co-founder Edouard Harris.

“If you choose to show off something that is commonly done and has existing tutorials out there already, it is very difficult for me as a hiring manager to evaluate whether you have actually done a bunch of work and thought, or whether you’ve simply followed along with a generic tutorial.”

Projects to Include in a Data Science Portfolio

A data science portfolio should consist of 3-5 projects that showcase your job-relevant skills. Again, the goal here is to prove you can do the work, so the more your portfolio looks like the day-to-day work of the jobs you’re applying for, the more convincing it’s going to be.

“Don’t pick just random projects to work on and add it to your resume or portfolio,” says Pramp CEO and co-founder Refael “Rafi” Zikavashvili. “Solve a problem that relates to the companies that you’re interested in.”

This applies to the kinds of tasks you’re taking on in your projects, but also the subject areas your projects examine, and the types of data sets you’re working with. Let’s take a closer look at each of these three factors:

Kinds of tasks: What sorts of things will you need to do in the job you’re applying to? Will you be doing a lot of data cleaning? Machine learning? Data visualization? Natural language processing? Will you be strictly doing analysis, or will you be building dashboards and other analytics tools for others?

Whatever the answers to these questions, they should be integrated into your portfolio.

Subject areas: Are you looking at positions in marketing? You’ll probably want to highlight projects aimed at answering marketing-related questions. If you’re looking for a data job in mobile app development, you’ll want to show off projects that demonstrate you can pull useful product insights from app data.

Using your projects to show that you have some knowledge of, or at least interest in, subjects and business problems relevant to the jobs you’ve applied to can help your application stand out.

Types of data sets: Different types of data may be common in different industries, so showing that you have some experience working with data sets similar to the ones you’d see on the job helps prove you’ve got what it takes to do the work.

If you’re likely to be looking at a lot of time series data in the target job, for example, it would be helpful to showcase some time series analysis skills in your portfolio.

When In Doubt, Include These Projects:

The more carefully tailored your portfolio is to the specific jobs you’re applying for, the better the results you’re likely to get.

But if you’re applying for entry-level positions, you’re probably casting a wide net, and you’re also likely to be looking at positions that require a lot of the same skills regardless of industry.

 If you put together a portfolio with at least one project in each of these categories, you’ll be off to an excellent start.

Data Cleaning Project: Data preparation, data, munging, data cleaning – whatever you want to call it, it accounts for 60-80% of most data science jobs, so you definitely need a project that demonstrates your data scrubbing skills.

At a bare minimum, you’ll want to find a messy data set (don’t pick anything that’s already been cleaned), come up with some interesting analytical questions to examine, and then clean the data and perform some basic analysis to answer those questions.

If you want to step the difficulty up here, collecting your own data (via APIs, web scraping, or some other method) demonstrates some additional skill. Working with unstructured data of some sort (as opposed to a messy-but-still-structured data set) also looks good.

Data Storytelling and Visualization Project: Telling stories, offering real insight, and convincing others with data are key parts of any data science job. The best analysis in the world is useless if you can’t get your CEO to understand or take action based on it.

This project should take readers on an analytical journey and bring them to a conclusion that’s understandable even to a layperson with little coding or statistical background.

Data visualization and communication skills will be important here to show and explain what your code is doing. It would be fine to present this in the form of a Jupyter Notebook or in R Markdown, but you could add some extra difficulty with additional polish, like customizing your chart designs or including some interactive elements.

A Group Project: Working together in a group demonstrates you’ve got communication and collaboration skills, both of which are important for data science jobs.

Any type of project could be a group project; what’s important here is to demonstrate that you can function in a team setting both in interpersonal terms (clear communication, fair division of labor, genuine collaboration) and in technical terms (managing projects with Git and GitHub).

If you want to up the difficulty here, try to get involved with a popular open source project, like contributing to a data-science-related open-source library in a language of your choice.

This can be quite difficult, but if you do manage to make a contribution to a popular library or package, this can really make your application stand out to employers.

For example, Alina Chistyakova, the Lead IT Recruiter at Spice IT Recruitment, says that “successful commits to well-known open-source projects” are one of the things that makes a data science portfolio stand out to her.

Kitware HR Director Jeff Hall said that “What really puts a plus in the column of candidates that apply here is having contributed to our specific open-source projects.”

Other Project Types to Consider

End-to-End System Building Project: A lot of data science jobs can include building systems that can efficiently analyze regular data sets as they come in, rather than analyzing a single specific data set.

 For example, you might be tasked with building a dashboard for the sales team that visualizes the company’s sales data and updates regularly as new data comes in.

This project should show that you’re capable of building a system that can perform the same analysis on new data sets as they’re input, as well as capable of building a system that can be understood and run with relative ease by others.

The simplest version of this would be well-commented code that can take data from a public, regularly-updated data set, and perform some analysis. Its README file should explain how it can be used by others, and the project should be relatively easy for other coders to run via the command line.

If you’d like to step up the difficulty here, the sky’s the limit: you could build full-fledged interactive web dashboards, or build a system that handles real-time/streaming data.

The key here is just to show that you can build an analytical system that’s reusable and that other people, or at the very least other programmers, can understand.

Explanatory Blog Post, Article, or Talk: Being able to explain complex technical concepts in simple, understandable terms is a valuable skill for any data scientist, so explaining some technical concept in a blog post, article, or conference talk can be a great addition to your portfolio if it’s done well.

Just be sure to pick a topic that’s suitably complex, and one that you understand and can explain.

A blog post explaining what’s happening under the hood in a machine learning algorithm that’s frequently used in your target industry, for example, could be a great inclusion to a portfolio.

Portfolio Project Formats and Presentation

Now that you’ve got some idea of what you might like to include in your portfolio, how should you present it? There are really only two common approaches: GitHub and personal portfolio websites.

Every recruiter we spoke with for this project agreed that applicants should have active GitHub accounts that showcase their projects, so if you’re aiming for broad appeal, that’s definitely where you should start.

Some recruiters said they were impressed by more carefully-constructed project presentations on portfolio sites, but others said they didn’t much care for separate portfolio sites, and would only look at a candidate’s GitHub.

For that reason, it makes sense to start with getting your GitHub ready.

GitHub for Data Science Projects

If you don’t know the basics of GitHub yet, check out this introductory blog post or our full, interactive course on Git and version control and get yourself up and running. If you’re creating a new GitHub account, make sure you choose a professional-sounding username (usernames are public and they’re how potential employers will find you).

Once you’re set up on GitHub, the good news is that your project presentation doesn’t have to be particularly complex: showing off your projects in Jupyter Notebook or R Markdown format is fine for most projects.

In the projects themselves, try to keep code blocks relatively short, and intersperse them with text blocks that explain clearly and concisely what the code is doing and why.

Use text formatting (headings and subheadings, bold, italics, code snippets, etc.) to keep things organized and easy to read.

You should always assume that your code will be read by someone who knows what they’re talking about. That means you should try to stick to naming conventions in your language, follow the preferred style, and try to keep your code efficient and clean.

It also means that you should add comments to your code whenever you think it might be helpful, so that it’s easy to see at a glance what’s happening.

(Commenting code is an especially important practice when working collaboratively as part of a team, so including good comments that make code easy to follow demonstrates good communication and teamwork skills, too.)

A few other potential tripping points to look out for in your code:

  • If you created a project locally, you may have hard-coded the file-path for your data, so that your code reads a very specific directory on your computer where you’ve stored the data. For public projects, it’s best to keep the data in the same folder as your notebook (or a subfolder) so that you can include a relative path that will work for anyone who’s downloading your repository and running your code.
  • You’ll probably want to include information on any packages and version details for external packages and libraries you’ve used, to make it easier for others to download and run your code. More information on how to do that can be found here.
  • If you’re pulling data from somewhere using an API key or other access credentials, you do not want to share those credentials publicly! This post includes a good walk-through explaining how you keep your key private while still making it easy for others to work with your code.
  • If you’re including the data you used in your project’s repository, you should check to be sure you have the legal right to redistribute it.

You should always include a README file, typically in Markdown format, with each project that contains a brief explanation of what the project is. That’s the file that GitHub will display by default when someone is looking at your project’s repository, so it should provide an overview of what they’re going to see.

That might include details like what your project analyzes, what your goals are with this project, what techniques you’ve used, and a summary of your conclusions. It should also include any information that someone else might need to install and run your project for themselves.

One important thing to remember with GitHub is that it will show anyone viewing your profile all of your public repositories, and it will also show all of your contribution activity.

This means that you need to keep your account clean and active. It will be off-putting for potential employers to click through to your profile and find hundreds of abandoned projects, and it will be off-putting for potential employers if they see you haven’t actually done anything in the past few months.

Along those lines, remember that projects aren’t cast in stone once you’ve added them to your GitHub. You can and should continue to iterate on them even as you’re applying for jobs.

If you get helpful feedback (or just come up with a great idea) there’s nothing wrong with implementing those changes in a project you’ve already published. In fact, continuing to iterate on your projects is a good idea - it shows employers you’re active, interested, and engaged with the same kind of work they’d be hiring you to do.

The final step in preparing your GitHub? Making sure it’s linked everywhere an employer might find you. As mentioned in our resume chapter, there should be a clickable GitHub link on your resume, but you also want to make sure to include one on any social sites you use (LinkedIn, Twitter, Instagram, personal websites, etc.) and to include the URL with any online application forms you submit.

You want to make it as easy as possible for anyone who’s looking you up to find your GitHub.

The Next Level: Dedicated Project or Portfolio Sites

Once you’ve got an active GitHub up and running, it may be worth taking some time to put together a more unique presentation for one or more of your projects.

Not every hiring manager will take the time to look at a dedicated project page or a special portfolio site, but for some, going this extra mile will be eye-catching.

“In general, you want something that’s visual,” SharpestMinds’s Edouard Harris says. “Ideally something that you have running on a server somewhere.”

“The optimal situation is: you’re at a meetup [talking with someone in the industry], you cleverly steer the conversation in the direction of this cool thing that you built. Then you can take out your phone and be like: check this out. Play with it. It’s right here.”

Having a web-based visual or interactive data project “sends a really good signal,” Harris says. “It sends a signal that this person knows enough to set up a server. That’s a nontrivial amount of work. [That this person knows how] to make the interface pretty enough that a human can use it. These are real, valuable things.”

Obviously, building a dedicated site for a project, particularly an interactive one, requires orders of magnitude more time than simply tossing a Jupyter Notebook up on GitHub. But while it requires more up-front investment, it can really pay dividends in the long run, particularly if you’re hustling and networking in-person at events (which you should be).

It’s going to be difficult to impress someone who’s scrolling through your GitHub on their phone in a crowded conference hall, squinting and trying to read your code. A clear, visual, data-based story or interactive project can leave a much deeper impression.

Just for the sake of inspiration, here’s an example of a very visual data story and here’s an example of a cool interactive data project. These are just for inspiration — don’t worry, entry-level job applicants aren’t expected to be able to produce this level of polished quality!

But you can see why for networking in person, having a project like those to show off would be more impactful than trying to walk somebody through your favorite GitHub repository.

Project Resources

At this point, you know why you need a project portfolio. You know what projects should be included in your portfolio, and how you should present them. Now, comes the hard part: actually doing the projects.

The projects you choose will vary tremendously based on both your personal interests and your target job roles. But if you need a good starting point, virtually all of our data science courses include open-ended guided projects.

You might also find our Python Projects for Beginners article helpful. Many of them will be too basic for a professional portfolio, but you can use them as a starting point to build something more complex. (If you learned R, don't worry — most of these project ideas are adaptable to R).

These could be useful in a portfolio if you take some time to adapt them and make them your own, and they’ll also be useful sources of inspiration. You could, for example, work through a guided project on our site and then find a new data set and attempt to apply a similar analysis on your own for a portfolio project.

Here are some additional resources that may be helpful when you’re putting together new projects or going back to improve and iterate on old ones ahead of a job search:

Data Sources

One of the most important choices you’ll make with any project is what data to analyze. If you want to work with an existing public data set, it may be best to avoid the big hits from sites like Kaggle – popular data sets on Kaggle will have been used in hundreds of projects, and employers will be sick of seeing them.

Luckily there are tons of places on the web where you can find less widely-used data to work with. Here are a few of our favorites:

  • Data Portals - a massive list of 551 (as of this writing) open data portals from all over the world, each of which has its own library of data sets to offer. You can browse geographically (or alphabetically) and you can also search by keyword. Most portals here are government-run open data portals.
  • Data.gov - the home of virtually all US government data, with nearly a quarter-million data sets on topics that range from industry to public health to finance.
  • AWS Open Data - Amazon’s portal has all sorts of interesting and unexpected things, from web-crawling data to satellite monitoring data from space.
  • Data.world - Kind of like GitHub for data. You’ll find all kinds of datasets here, although some of them will include common and popular data sets like the Titanic passenger data, and since they’re user-uploaded they may not always be accurate or reliable.
  • /r/datasets - A subreddit for sharing datasets. Years of history to browse through, new stuff every day, and you can even make requests!
  • AcademicTorrents - A site where scientists can upload datasets from their research and publications.

Of course, the best way to ensure that you’re working on something totally unique is to grab your own data set rather than downloading something that someone else has compiled. The two easiest ways to do this are via web-scraping or via accessing an API.

Dataquest offers a course that covers both APIs and web scraping, and we also have some free tutorials for using tools like BeautifulSoup to do web scraping and using APIs. For example, you could access the Twitter API and use that to do some analysis of tweets in real-time (we’ve got a tutorial for that, too).

If you want to really go the extra mile, you can also collect data by doing something like conducting your own survey or collecting data manually. Collecting your own data is very time-intensive, but if it’s the only way to get an interesting and unique data set, the “wow” factor you can create with your unique analysis later will be worth all of that pain up-front.

And don’t forget that you probably generate a fair amount of your own data - with a computer and a smartphone you can collect all kinds of data about yourself, from productivity levels to sleep habits. We've got tutorials that'll help you analyze your own Amazon spending, or your Facebook usage.

Going this route could be risky (you don’t want to come off as self-centered, and your personal data might not be as interesting to others as it is to you), but there are certainly ways you could turn data from your own life into an interesting data science project with broader appeal.

Design Resources

When a project is finished, one of the easiest ways to make it stand out can be upgrading the visualizations so that they don’t have that “default” look recruiters are seeing in lots of other data science portfolios.

There are ways to do this with code - for example, check out our tutorial on how to get the FiveThirtyEight chart look in Python. But more generally, applying some basic design principles to your work will help your charts stand out and tell their stories more clearly.

Here are some other helpful data visualization resources:

Sources of Inspiration

Sometimes, you just need a little spark to get a project started, or to give you the idea that takes it from good to great. Here are some places you can find truly great data science projects:

  • FiveThirtyEight - The reigning champions of data journalism, 538 is constantly publishing new data-based work on politics and sports. They also publish a lot of their data so you can try to reverse-engineer some of their work.
  • Information is Beautiful Awards - This site awards yearly prizes for a variety of data-based project categories, but they also publish regular highlights of great projects throughout the year.
  • Data is Beautiful - This subreddit plays host to both amateur and professional data science projects and visualizations. You can also share your own projects there to get some feedback from other reddit users.
  • Kaggle - Kaggle competitions can be a great place to find completed data science projects (look for completed competitions and then browse the most upvoted “Kernels.” The beauty here is that you get to see the entire project, including all the code.
  • Data Science Teams at National Newspapers - Major national and international papers and other media organizations often have “data” sections where you can find the results of interesting data science work. In some cases, they also have GitHub accounts where they share projects and/or data, too.

This article is part of our in-depth Data Science Career Guide.

Charlie Custer

About the author

Charlie Custer

Charlie is a student of data science, and also a content marketer at Dataquest. In his free time, he's learning to mountain bike and making videos about it.