July 9, 2018

An In-Depth Style Guide for Data Science Projects

When you're applying for a position related to data science, you will likely have to submit a portfolio or project. Although your abilities are mainly what the potential employer is looking at when reviewing your work, the stylistic aspects will play a role as well.

Employers usually give a lot of weight to a candidate's portfolio when hiring for a junior data science role. Although you may be capable of technically impressive projects, your job hunt will suffer if you don't pay enough attention to the stylistic aspects as well. A busy employer is not going to review poorly constructed projects.

At Dataquest, we've helped many data science students with portfolio project reviews. We learned the most common mistakes our students make, and we've put a lot of thought into what makes a project interesting to employers.

Reviewing the project of a Dataquest student.

Based on our experience, we've created the following style guide. We also created an example project so you can see these guidelines in action.

Note that our guide is aimed largely at notebook-style projects. This is the most common type of project that students send to employers, and it consists of a combination of code and narrative written in development environments such as Jupyter Notebook, Jupyter Lab, RStudio, nteract, etc.

Know your audience

Before you even start building your portfolio, have a very clear idea of who is going to review it.

In this post, we'll focus on employers as the audience of our projects, since that's our most common use case. In general, employers are risk averse — they'll be looking for cues that suggest you might be a risky investment. Adapting your work to their standards has two big challenges:

The audience is heterogeneous, made of both technical and non-technical people. Different kinds of employers may read your projects — for instance, a senior data scientist might asses your work, but someone with no technical knowledge could also take a look.
Some employers will just skim your work, while others will go into much more detail. This usually depends on the hiring stage. Initially, an employer has to deal with tens or hundreds of applications, and maybe they will spend five minutes per candidate. At some later stages, they usually go into much detail to single out the right candidate.

To address these two challenges, write projects that:

Make it obvious to both technical and non-technical employers that you can bring value to their company.
Do well on a quick scan.
Do well on a thorough reading.

Next, we'll discuss in more detail about each of these three action items above.

Writing for technical and non-technical readers

The fastest way you can show to both technical and non-technical employers that you can bring value to their company is by having relevant project themes. What counts as a relevant theme, however, depends on the industry you'd like to work in.

Let's say you have a portfolio that has several strong projects on baseball and basketball data. If you're applying for a data science role in the finance industry, an employer will almost certainly find your projects irrelevant. Your work might be outstanding, but they want to see something that indicates you're a good fit for the finance industry: investments recommendations, stock market predictions, selling recommendations, etc.

On the other side, if you're applying for a data journalist role for a publication that writes daily on sports, your work will definitely be relevant.

So before investing a huge amount of work into building some exceptional projects that are potentially irrelevant, try to figure out what are the data science industries and niches you'd like to work in.

Once you've reached a decision, start building relevant projects and including them in your portfolio.

Ideally, you'll find a project that is both relevant to your preferred industry, and sparks some passion. This should be doable, since you want to work in that industry for a reason. Otherwise, at least try to find a good trade-off.

Doing well on a quick scan

These are the main parts of a data science portfolio project that someone usually considers on a quick scan:

Title
Introduction
Subheadings
Conclusion
Graphs
Code

Let's now discuss how we can improve each of these sections such that it does well on a quick scan.

Title

It's possible that your title is all an employer will read. The title you choose tells them:

Whether your project is relevant for the role they're trying to hire for.
Whether your project is something that'd be interesting to read in depth.

Your lesson is to come up with a title that immediately tells the employer that your work is both relevant and interesting. We've already covered the part about relevance, so here are some tips to make your titles capture the attention of an employer:

Avoid vague titles. For instance, if you're analyzing real estate data for New York, don't title your work "Analyzing Real Estate Data." This is vague, and you can write an entire book under such a title. Think instead of something that reflects precisely the particularities of your analysis. Try to choose something like "Finding Investment Opportunities — Analyzing the Evolution of House Prices in New York."
Avoid emotionally-neutral titles. Between "Analyzing House Prices in New York" and "Finding Investment Opportunities — Analyzing the Evolution of House Prices in New York," you'd want to choose the latter because words like "investment" and "opportunities" spark some gain-related emotions inside a reader's mind. Generally, people are more likely to click on emotionally-active titles (whether the titles spark positive or negative emotions).
Avoid very long titles. We want to send a strong message quickly — long titles are generally harder to understand, so your goal here is to make sure the reader doesn't need to read the title again to understand what you want to say. As a rule, try to keep the title under 15 words. For instance, this seems too much: "Finding the Best Two Neighborhoods in New York to Invest in To Make Long-term Profitable Investments in the Real Estate Industry."

Introduction

You came up with a good title and convinced the employer to take a look at your data science project. The introduction is most likely the next thing they'll read. Again, you want to send the same message:

Your project is relevant for the role you're being considered.
Your project is something that'd be interesting to read in depth.

Here are some major dos and don'ts around writing a good introduction:

Have an introduction. It's worth stating the obvious here because we've seen projects without an introduction.
Describe your project concisely. In no more than three paragraphs (of maximum 3-4 sentences each), discuss concisely:
- The goal of your analysis.
- The approach you'll take to reach that goal.
- The most important result(s) you found — we mention results here to prove our ability to communicate our findings quickly.
Don't use a table of contents (TOC). Unless you're writing a huge project, don't use a TOC. A TOC is easy to scan, but it usually takes a lot of vertical space and makes the introduction look very bulky. Keeping the introduction simple and tidy will give the impression of a short fast read, and this should encourage employers to actually read it.
Be matter-of-fact. Don't be bombastic in your claims, and don't begin with quotes (especially from Einstein) — just stick with presenting concisely the goal of your analysis, the approach you'll take to reach that goal, and the most important results you found.

Subheadings

If an employer wants to skim your entire project, they'll find subheadings really useful. Again, make sure you send the message that your work is relevant and interesting. Here are some tips around writing good subheadings:

Have subheadings. Make sure you break down your project into several logical sections and add a subheading for each.
Use some of the tips for titles. Subheadings are just titles for the different logical sections of the project, which means we can use some of the tips we learned for titles:
- Avoid vague subtitles.
- Avoid very long subtitles.
- Ideally, try to avoid emotionally-neutral subtitles.

Conclusion

For our purposes here, a good conclusion:

Reminds the reader what the initial goal was and what was the main approach you took to reach that goal.
Summarizes the most important results.

Practically, the conclusion is similar to an introduction rephrased in the light of what has been done in the main body. To write a good conclusion, try to:

Be concise and don't use more than two paragraphs (of maximum 3-4 sentences each).
Don't be bombastic in your final claims. Also, it's better to stay away from trying to end triumphantly with a quote because most readers won't read all your work and, as a consequence, they won't be able to share your enthusiasm.

Graphs

If you generate graphs for your projects (we strongly recommend you do that), you can be fairly sure that each graph is going to be quickly looked at. Our brains seem to find it faster and easier to process information from images rather than from text, so we are more inclined to watch rather than read. So expect your readers to slow their scrolling down whenever they see a graph.

With this in mind, you need to make sure that your graphs communicate information fast and clearly, and look professional. Here are some tips to make that happen:

Give every graph a title which explains clearly what the graph is about. It's also a good idea to write the title bolded and with an increased font size.
Label each axis. Increase the fontsize of the label until it's easily readable. While you're selecting the font size, it's worth keeping in mind that some people may read your work on small screens (small laptops, tablets, mobile, etc.).
Ideally, customize your graphs extensively. If you're coding in Python, you may want to check out this tutorial on making FiveThirtyEight graphs.

Code

No one will read your code in depth at this stage, but experienced programmers may scan your code to asses elements like:

Code readability.
The algorithms you used.
The libraries you used; etc.

Later in this post, we offer tips on the readability aspect. These tips will help on both a quick scan and a thorough reading, so we've grouped them together in the next section.

Doing well on a thorough reading

Later in the hiring process, employers usually go into much more detail to single out the right candidate.

At this point, you should expect someone with technical knowledge to give your project an in-depth read. Doing well on a thorough reading depends mostly on the technical parts of the project:

The algorithms you used.
The problem you tried to solve.
The approach you took to solve the problem.
The depth of your approach; etc.

However, stylistic elements are important too, and they can really make a difference. What you code and write is crucial, but how you code and write is also important. So we'll focus next on learning a couple of tips that can help us code and write better in terms of style.

Code

How you write code matters because it shows an employer whether you can write clean code that can be read and understood by your potential future colleagues. Writing code that other people can easily understand is quite challenging.

Below we'll discuss some of the best practices around writing code that is easy to understand. You can start by reading the official style guide of the programming language you code in, where you should be able to find plenty of tips on improving code readability.

If there isn't an official guide, search for a guide that most people coding in that language use. For Python, you can find the official style guide here, and for R you can find a guide here.

Below we explore some major dos and don'ts with respect to code readability (we'll use Python code, but the guidelines apply to other languages as well):

Use a block comment to add a short description for each block of code with a distinct functionality. While you're writing code, most of the time adding block comments will seem redundant. Add them nonetheless because they help the reader, and they'll also help you easily recollect what you were trying to do if you read your code months later.

No:


opened_file = open('Super_Bowl.csv')
read_file = opened_file.read()
super_bowl_split = read_file.split('\\n')
super_bowl = []
for row in super_bowl_split:
    super_bowl.append(row.split(','))

Yes:


# Read in the data
opened_file = open('Super_Bowl.csv')
read_file = opened_file.read()
# Transform read_file into a list of lists
super_bowl_split = read_file.split('\n')
super_bowl =  []
for row in super_bowl_split:
    super_bowl.append(row.split(','))

Space vertically the blocks of code with distinct functionalities.

No:


# Read in the data
opened_file = open('Super_Bowl.csv')
read_file = opened_file.read()
# Transform read_file into a list of lists
super_bowl_split = read_file.split('\n')
super_bowl =  []
for row in super_bowl_split:
    super_bowl.append(row.split(','))

Yes:


# Read in the data
opened_file = open('Super_Bowl.csv')
read_file = opened_file.read()

# Transform read_file into a list of listssuper_bowl_split = read_file.split('\n')
super_bowl =  []
for row in super_bowl_split:
    super_bowl.append(row.split(','))

Add inline comments, but make sure they're useful and don't state the obvious.

No:


for row in super_bowl[1:]:    # Iterate through super_bowl[1:]
    row[2] = int(row[2])      # Convert to integer
    row[2] -= 5000            # Subtract 5000 from each row element in position 2

Yes:


for row in super_bowl[1:]:   # Omit the first row - it contains the headers
    row[2] = int(row[2])
    row[2] -= 5000            # Values in the "Attendance" column are wrong - correct by subtracting 5000

Name variables clearly — don't sacrifice precious code readability for saving a few keystrokes.

No:


o_f = open('Super_Bowl.csv')
r_f = opened_file.read()

Yes:


opened_file = open('Super_Bowl.csv')
read_file = opened_file.read()

Text

As we mentioned in the introduction, the most common type of data science projects that students can send to employers consists of a combination of code and narrative.

The narrative part is important because it can show an employer whether or not you have the right communication skills. Communication is key in most data science roles because data scientists frequently find themselves working in cross-functional teams, along with colleagues from marketing, product, engineering, etc. The team members are working together on a single complex project, and they need to communicate effectively with each other.

Writing a narrative that does well on a thorough reading depends on several factors, like:

Readability.
The depth and clarity of your explanations.
The flow of the ideas you're writing about.
The connection between narrative and code; etc.

Below, we'll explore a few major dos and don'ts that are only related to the style of the narrative and that are meant to improve readability:

Don't use long paragraphs. Walls of text scare readers off, and they significantly decrease readability. Ideally, a paragraph has three to four sentences, but if you think it makes sense, you can even use one-sentence paragraphs. Here is an example of what not to do:

Source: Wikipedia

Use the right sentence length. Long sentences (anything close to 40 words) are hard to understand on a first read and can make your writing feel heavy. Very short sentences (anything less than 5 words) are easy to understand and give a good flow to your writing, but they can also annoy some readers by fragmenting the narrative too much.

Ideally, avoid long sentences altogether, use very short sentences sparsely, and go with mid-sized sentences most of the time.
Other small but important details:
- Make sure your writing doesn't have typos.
- Stay in line with the rules of grammar and orthography.
- Avoid using passive voice.
- Avoid using unnecessary adverbs.

To write a good narrative in terms of style, you can help yourself by using the Hemingway App. This app can help you with:

Condensing your sentences and paragraphs.
Avoid typos.
Avoid passive voice and unnecessary adverbs.

Example project

We built an example project in Jupyter Lab that shows how to implement most of the guidelines we discussed here. While reading the project, notice that:

The title is specific, emotionally-active (it sparks gain-related emotions), and has under 15 words.
The introduction is matter-of-fact and it describes the goal of the analysis, the approach that we're going to take to reach that goal, and the most important results we found.
The project has subheadings.
The conclusion reiterates the initial goal, the main approach we took to reach that goal, and summarizes the most important results.
The project has graphs, and each graph has a title and axis labels. One graph is customized using the FiveThirtyEight style.
The code and the text are in line with the guidelines we discussed.

Also observe that:

We alternate one markdown cell with one code cell. Each markdown cell introduces and explains what happens in the code cell below it (except for the conclusion, which doesn't have a code cell below it). As a rule, alternate the types of cell (in other words, avoid having two cells of the same kind in a row).
The modules are imported elegantly right in the code cells they are needed (instead of cramming up everything in the first cell). This way we avoid overcrowding the first code cell unnecessarily. We also prevent confusion later on — imagine using in the last code cell a low-level function you imported in the first code cell from a lesser-known package. The reader won't remember all the imports you did, and they'll be confused.
We tell the reader where they can download the data.

Next steps

To recap briefly, a data science portfolio project should:

Do well on a quick scan.
- The project theme should be relevant for an employer.
- The project should be well-polished in terms of style. These are the elements that are usually considered on a quick scan: title, introduction, subheadings, conclusion, graphs, and code.
Do well on a thorough reading.
- The technical part should be strong and accurate.
- The code and the narrative should be well-polished in terms of style.

Here are some of the next steps you can take:

Build from scratch a project that respects the guidelines discussed in this guide. Check out our Guided Project Page for inspiration.
Adjust one of your projects to the guidelines discussed in this guide.
Read this blog post for a more specific discussion on what kind of projects you should add to your portfolio.

If you're still struggling to put together a polished and professional data science project, our team at Dataquest can help. You can take advantage of our helpful data science courses or work through one of our many guided projects.

An In-Depth Style Guide for Data Science Projects

Know your audience

Writing for technical and non-technical readers

Doing well on a quick scan

Title

Introduction

Subheadings

Conclusion

Graphs

Code

Doing well on a thorough reading

Code

Text

Example project

Next steps

21 Data Science Projects for Beginners (with Source Code)

Generating Climate Temperature Spirals in Python

An In-Depth Style Guide for Data Science Projects

Know your audience

Writing for technical and non-technical readers

Doing well on a quick scan

Title

Introduction

Subheadings

Conclusion

Graphs

Code

Doing well on a thorough reading

Code

Text

Example project

Next steps

More learning resources

21 Data Science Projects for Beginners (with Source Code)

Generating Climate Temperature Spirals in Python