June 14, 2016

Building a Data Science Blog for Your Portfolio

Data science blogs can be a fantastic way to demonstrate your skills, learn topics in more depth, and build an audience. There are quite a few examples of data science and programming blogs that have helped their authors land jobs or make important connections. Writing a data science blog is thus one of the most important things that any aspiring programmer or data scientist should be doing on a regular basis. (This is the second in a series of posts on how to build a Data Science Portfolio. You can find links to the other posts in this series at the bottom of the post.) Unfortunately, one very arbitrary barrier to blogging can be knowing how to set up a blog in the first place. In this post, we'll cover how to create a blog using Python, how to create posts using Jupyter notebook, and how to deploy the blog live using GitHub Pages. After reading this post, you'll be able to create your own data science blog, and author posts in a familiar and simple interface.

Static Sites

Fundamentally, a static site is just a folder full of HTML files. We can run a server that allows others to connect to this folder and retrieve files. The nice thing about this is that it doesn't require a database or any other moving parts, and it's very easy to host on sites like GitHub. It's a great idea to have your data science blog be a static site, because it makes maintaining it very simple. One way to create a static site is to manually edit HTML, then upload the folder full of HTML to a server. In this scenario, you would at the minimum need an

index.html file. If your website URL was thebestblog.com, and visitors visited https://www.thebestblog.com, they would be shown the contents of index.html. Here's how a folder of HTML might look for thebestblog.com:


thebestblog.com
│   index.html
│   first-post.html
│   how-to-use-python.html
│   how-to-do-machine-learning.html
│   styles.css

On the above site, visiting

https://www.thebestblog.com/first-post.html would show you the content in first-post.html, and so on. first-post.html might look like this:


<html>
<head>
  <title>The best blog!</title>
  <meta name="description" content="The best blog!"/>
  <link rel="stylesheet" href="styles.css" />
</head>
<body>
  <h1>First post!</h1>
  <p>This is the first post in what will soon become (if it already isn't) the best blog.</p>
  <p>Future posts will teach you about data science.</p>

<div class="footer">
  <p>Thanks for visiting!</p>
</div>
</body>
</html>

You might immediately notice a few problems with manually editing HTML:

  • Manually editing HTML is incredibly painful.
  • If you want to make multiple posts, you have to copy over styles, and other elements, like the title and footer.
  • If you want to integrate comments or other plugins, you'll have to write JavaScript.

Generally, when you're blogging, you want to focus on the content, not spend time fighting with HTML. Thankfully, you can create a data science blog without hand editing HTML using tools known as static site generators.

Static Site Generators

Static site generators allow you to write blog posts in simple formats, usually markdown, then define some settings. The generators then convert your posts into HTML automatically. Using a static site generator, we'd be able to dramatically simplify

first-post.html into first-post.md:


# First post!

This is the first post in what will soon become (if it already isn't) the best blog.

Future posts will teach you about data science.

This is much easier to manage than the HTML file! Common elements, like the title and the footer, can be placed into templates, so they can be easily changed. There are a few different static site generators. The most popular is called

Jekyll, and is written in Ruby. Since we'll be making a data science blog, we want a static site generator that can process Jupyter notebooks. Pelican is a static site generator that is written in Python that can take in Jupyter notebook files and convert them to HTML blog posts. Pelican also makes it easy to deploy our blog to GitHub Pages, where other people can read our blog.

Installing Pelican

Before we get started,

here's a repo that's an example of what we'll eventually get to. If you don't have Python installed, you'll need to do some preliminary setup before we get started. Here are setup instructions for Python. We recommend using Python 3.5. Once you have Python installed:

  • Create a folder — we'll put our blog content and styles in this folder. We'll refer to it in this tutorial as jupyter-blog, but you can call it whatever you want.
  • cd into jupyter-blog.
  • Create a file called .gitignore, and add in the content from this file. We'll need to eventually commit our repo to git, and this will exclude some files when we do.
  • Create and activate a virtual environment.
  • Create a file called requirements.txt in jupyter-blog, with the following content:

Markdown==2.6.6
pelican==3.6.3
jupyter>=1.0
ipython>=4.0
nbconvert>=4.0
beautifulsoup4
ghp-import==0.4.1
matplotlib==1.5.1
  • Run pip install -r requirements.txt in jupyter-blog to install all of the packages in requirements.txt.

Creating Your Data Science Blog

Once you've done the preliminary setup, you're ready to create your blog! Run

pelican-quickstart in jupyter-blog to start an interactive setup sequence for your blog. You'll get a sequence of questions that will help you setup your blog properly. For most of the questions, it's okay to just hit Enter and accept the default value. The only ones you should fill out are the title of the website, the author of the website, n for the URL prefix, and the timezone. Here's an example:


(jupyter-blog)➜  jupyter-blog ✗ pelican-quickstart
Welcome to pelican-quickstart v3.6.3.

This script will help you create a new Pelican-based website.

Please answer the following questions so this script can generate the files
needed by Pelican.


> Where do you want to create your new web site? [.]
> What will be the title of this web site? Vik's Blog
> Who will be the author of this web site? Vik Paruchuri
> What will be the default language of this web site? [en]
> Do you want to specify a URL prefix? e.g., https://example.com   (Y/n) n
> Do you want to enable article pagination? (Y/n)
> How many articles per page do you want? [10]
> What is your time zone? [Europe/Paris] America/Los_Angeles
> Do you want to generate a Fabfile/Makefile to automate generation and publishing? (Y/n)
> Do you want an auto-reload & simpleHTTP script to assist with theme and site development? (Y/n)
> Do you want to upload your website using FTP? (y/N)
> Do you want to upload your website using SSH? (y/N)
> Do you want to upload your website using Dropbox? (y/N)
> Do you want to upload your website using S3? (y/N)
> Do you want to upload your website using Rackspace Cloud Files? (y/N)
> Do you want to upload your website using GitHub Pages? (y/N)

After running

pelican-quickstart, you should have two new folders in jupyter-blog, content, and output, along with several files, such as pelicanconf.py and publishconf.py. Here's an example of what should be in the folder:


jupyter-blog
│   output
│   content
│   .gitignore
│   develop_server.sh
│   fabfile.py
│   Makefile
│   requirements.txt
│   pelicanconf.py
│   publishconf.py

Installing the Jupyter Plugin

Pelican doesn't support writing blog posts using Jupyter by default — we'll need to install a

plugin that enables this behavior. We'll install the plugin as a git submodule to make it easier to manage. If you don't have git installed, you can find instructions here. Once you have git installed:

  • Run git init to initialize the current folder as a git repository.
  • Create the folder plugins.
  • Run git submodule add git://github.com/danielfrg/pelican-ipynb.git plugins/ipynb to add in the plugin.

You should now have a

.gitmodules file and a plugins folder:


jupyter-blog
│   output
│   content
│   plugins
│   .gitignore
│   .gitmodules
│   develop_server.sh
│   fabfile.py
│   Makefile
│   requirements.txt
│   pelicanconf.py
│   publishconf.py

In order to activate the plugin, we'll need to modify

pelicanconf.py and add these lines at the bottom:


MARKUP = ('md', 'ipynb')

PLUGIN_PATH = './plugins'
PLUGINS = ['ipynb.markup']

These lines tell Pelican to activate the plugin when generating HTML.

Writing your First Post

Once the plugin is installed, we can create the first post:

  • Create a Jupyter notebook with some basic content. Here's an example you can download if you want.
  • Copy the notebook file into the content folder.
  • Create a file that has the same name as your notebook, but with the extension .ipynb-meta. Here's an example.
  • Add the following content to the ipynb-meta file, but change the fields to match your own post:

Title: First Post
Slug: first-post
Date: 2016-06-08 20:00
Category: posts
Tags: python firsts
author: Vik Paruchuri
Summary: My first post, read it to find out.

Here's an explanation of the fields:

  • Title — the title of the post.
  • Slug — the path at which the post will be accessed on the server. If the slug is first-post, and your server is jupyter-blog.com, you'd access the post at .
  • Date — the date the post will be published.
  • Category — a category for the post (this can be anything).
  • Tags — a space-separated list of tags to use for the post. These can be anything.
  • Author — the name of the author of the post.
  • Summary — a short summary of your post.

You'll need to copy in a notebook file, and create an

ipynb-meta file whenever you want to add a new post to your blog. Once you've created the notebook and the meta file, you're ready to generate your blog HTML files. Here's an example of what the jupyter-blog folder should look like now:


jupyter-blog
│   output
│   content
    │   first-post.ipynb
    │   first-post.ipynb-meta
│   plugins
│   .gitignore
│   .gitmodules
│   develop_server.sh
│   fabfile.py
│   Makefile
│   requirements.txt
│   pelicanconf.py
│   publishconf.py

Generating HTML

In order to generate HTML from our post, we'll need to run Pelican to convert the notebooks to HTML, then run a local server to be able to view them:

  • Switch to the jupyter-blog folder.
  • Run pelican content to generate the HTML.
  • Switch to the output directory.
  • Run python -m pelican.server.
  • Visit localhost:8000 in your browser to preview the blog.

You should be able to browse a listing of all the posts in your data science blog, along with the specific post you created.

Creating a GitHub Page

GitHub Pages is a feature of GitHub that allows you to quickly deploy a static site and let anyone access it using a unique URL. In order to set it up, you'll need to:

  • Sign up for GitHub if you haven't already.
  • Create a repository called username.github.io, where username is your GitHub username. Here's a more detailed guide on how to do this.
  • Switch to the jupyter-blog folder.
  • Add the repository as a remote for your local git repository by running git remote add origin [email protected]:username/username.github.io.git — replace both references to username with your GitHub username.

A GitHub page will display whatever HTML files are pushed up to the

master branch of the repository username.github.io at the URL username.github.io (the repository name and the URL are the same). First, we'll need to modify Pelican so that URLs point to the right spot:

  • Edit SITEURL in publishconf.py, so that it is set to https://username.github.io, where username is your GitHub username.
  • Run pelican content -s publishconf.py. When you want to preview your blog locally, run pelican content. Before you deploy, run pelican content -s publishconf.py. This uses the correct settings file for deployment.

Committing Your Files

If you want to store your actual notebooks and other files in the same Git repo as a GitHub Page, you can use git branches.

  • Run git checkout -b dev to create and switch to a branch called dev. We can't use master to store our notebooks, since that's the branch that's used by GitHub Pages.
  • Create a commit and push to GitHub like normal (using git add, git commit, and git push).

Deploy to GitHub Pages

We'll need to add the content of the blog to the

master branch for GitHub Pages to work properly. Currently, the HTML content is inside the folder output, but we need it to be at the root of the repository, not in a subfolder. We can use the ghp-import tool for this:

  • Run ghp-import output -b master to import everything in the output folder to the master branch.
  • Use git push origin master to push your content to GitHub.
  • Try visiting username.github.io — you should see your page!

Whenever you make a change to your data science blog, just re-run the

pelican content -s publishconf.py, ghp-import and git push commands above, and your GitHub Page will be updated.

Add Comments

Comments are one way to interact with your guests. Disqus is a good tool for this, and it integrates with Pelican seamlessly. Follow these steps:

  1. Go to the Disqus site and register.
  2. Click "Get Started", then choose "I want to install Disqus on my site".
  3. Enter your Website Name. This will serve as a unique key to link Disqus to your blog, by passing it into the publishconf.py file. (More on this in a future step.)
  4. Choose a Disqus subscription plan — a basic plan is perfect for a personal blog.
  5. When Disqus asks which platform your site is on, scroll down and choose "I don't see my platform listed, install manually with Universal Code".
  6. On the Universal Code page, scroll down again and click "Configure".
  7. On the Configure page, fill in the "Website URL" section with your actual website address (https://username.github.io). You can also add information about your comment policy (if you don't have one, Disqus gives suggestions), and enter a description for your site. Click "Complete Setup".
  8. You'll now have the ability to configure your site's community settings. Click into this section and look around. Among other things, you'll be able to control whether guests can comment, and activate ads.
  9. In the toolbar on the left, click "Advanced" and add your website into Trusted Domains as username.github.io.
  10. Lastly, update publishconf.py. Make sure to specify DISQUS_SITENAME = "website-name", where "website-name" comes from step 3.

Now rerun the

pelican content -s publishconf.py, ghp-import output -b master and git push origin master commands to update your GitHub Page. Refresh your website and you'll see Disqus appearing under each post.

Choose a Theme

The Pelican community offers a variety of themes at

pelicanthemes.com. You can choose any theme you like, but here are some quick tips:

  • Keep it simple. The design should not distract from the actual content.
  • Remember the "Rule of Three Colors". According to University of Toronto study, most people prefer combinations of two to three colors. This way colors don’t fight for attention.
  • Pay attention to the width of your page — it should be enough to contain infographics and code that you may want to publish.

Once you've picked a theme, go to the folder where you wish to store your theme and create a repo:

git clone --recursive https://github.com/getpelican/pelican-themes pelican-themes. Create a THEME variable in your pelicanconf.py file and set its value to the location of the theme: THEME = 'E:\\Pelican\\pelican-themes\\flex Here, we are using a nice flex theme by Alexandre Vincenzi. Run the usual finishing commands — pelican content, ghp-import, and git push — and enjoy a new look!

Next Steps

We've come a long way! You now should be able to author blog posts and push them to GitHub Pages. Anyone should be able to access your data science blog at

username.github.io (replacing username with your GitHub username). This gives you a great way to show off your data science portfolio. As you write more posts and gain an audience, you may want to dive more into a few areas:

  • Your own custom URL.

    • Using username.github.io is nice, but sometimes you want a more custom domain. Here's a guide on using a custom domain with GitHub Pages.
  • Plugins

    • Check out the list of plugins here. Plugins can help you setup analytics, commenting, and more.
  • Promotion

    • Trying promoting your blog posts on sites Twitter, Quora, and others to build an audience.

At

Dataquest, our interactive guided projects are designed to help you start building a data science portfolio to demonstrate your skills to employers and get a job in data. If you're interested, you can signup and do our first module for free.


If you liked this, you might like to read the other posts in our 'Build a Data Science Portfolio' series:

Vik Paruchuri

About the author

Vik Paruchuri

Vik is the CEO and Founder of Dataquest.

Learn data skills 10x faster

Headshot Headshot

Join 1M+ learners

Enroll for free