Building a Data Science Blog for Your Portfolio
Data science blogs can be a fantastic way to demonstrate your skills, learn topics in more depth, and build an audience. There are quite a few examples of data science and programming blogs that have helped their authors land jobs or make important connections. Writing a data science blog is thus one of the most important things that any aspiring programmer or data scientist should be doing on a regular basis.
(This is the second in a series of posts on how to build a Data Science Portfolio. You can find links to the other posts in this series at the bottom of the post.)
Unfortunately, one very arbitrary barrier to blogging can be knowing how to set up a blog in the first place. In this post, we’ll cover how to create a blog using Python, how to create posts using Jupyter notebook, and how to deploy the blog live using GitHub Pages.
After reading this post, you’ll be able to create your own data science blog, and author posts in a familiar and simple interface.
Fundamentally, a static site is just a folder full of HTML files. We can run a server that allows others to connect to this folder and retrieve files. The nice thing about this is that it doesn’t require a database or any other moving parts, and it’s very easy to host on sites like GitHub. It’s a great idea to have your data science blog be a static site, because it makes maintaining it very simple.
One way to create a static site is to manually edit HTML, then upload the folder full of HTML to a server. In this scenario, you would at the minimum need an
index.html file. If your website URL was
thebestblog.com, and visitors visited
http://www.thebestblog.com, they would be shown the contents of
index.html. Here’s how a folder of HTML might look for
thebestblog.com │ index.html │ first-post.html │ how-to-use-python.html │ how-to-do-machine-learning.html │ styles.css
On the above site, visiting
http://www.thebestblog.com/first-post.html would show you the content in
first-post.html, and so on.
first-post.html might look like this:
<html> <head> <title>The best blog!</title> <meta name="description" content="The best blog!"/> <link rel="stylesheet" href="styles.css" /> </head> <body> <h1>First post!</h1> <p>This is the first post in what will soon become (if it already isn't) the best blog.</p> <p>Future posts will teach you about data science.</p> <div class="footer"> <p>Thanks for visiting!</p> </div> </body> </html>
You might immediately notice a few problems with manually editing HTML:
- Manually editing HTML is incredibly painful.
- If you want to make multiple posts, you have to copy over styles, and other elements, like the title and footer.
Generally, when you’re blogging, you want to focus on the content, not spend time fighting with HTML. Thankfully, you can create a data science blog without hand editing HTML using tools known as static site generators.
Static Site Generators
Static site generators allow you to write blog posts in simple formats, usually markdown, then define some settings. The generators then convert your posts into HTML automatically. Using a static site generator, we’d be able to dramatically simplify
# First post! This is the first post in what will soon become (if it already isn't) the best blog. Future posts will teach you about data science.
This is much easier to manage than the HTML file! Common elements, like the title and the footer, can be placed into templates, so they can be easily changed.
There are a few different static site generators. The most popular is called Jekyll, and is written in Ruby. Since we’ll be making a data science blog, we want a static site generator that can process Jupyter notebooks.
Pelican is a static site generator that is written in Python that can take in Jupyter notebook files and convert them to HTML blog posts. Pelican also makes it easy to deploy our blog to GitHub Pages, where other people can read our blog.
Before we get started, here’s a repo that’s an example of what we’ll eventually get to.
If you don’t have Python installed, you’ll need to do some preliminary setup before we get started. Here are setup instructions for Python. We recommend using
Python 3.5. Once you have Python installed:
- Create a folder — we’ll put our blog content and styles in this folder. We’ll refer to it in this tutorial as
jupyter-blog, but you can call it whatever you want.
- Create a file called
.gitignore, and add in the content from this file. We’ll need to eventually commit our repo to git, and this will exclude some files when we do.
- Create and activate a virtual environment.
- Create a file called
jupyter-blog, with the following content:
Markdown==2.6.6 pelican==3.6.3 jupyter>=1.0 ipython>=4.0 nbconvert>=4.0 beautifulsoup4 ghp-import==0.4.1 matplotlib==1.5.1
pip install -r requirements.txtin
jupyter-blogto install all of the packages in
Creating Your Data Science Blog
Once you’ve done the preliminary setup, you’re ready to create your blog! Run
jupyter-blog to start an interactive setup sequence for your blog. You’ll get a sequence of questions that will help you setup your blog properly. For most of the questions, it’s okay to just hit
Enter and accept the default value. The only ones you should fill out are the title of the website, the author of the website,
n for the URL prefix, and the timezone. Here’s an example:
(jupyter-blog)➜ jupyter-blog ✗ pelican-quickstart Welcome to pelican-quickstart v3.6.3. This script will help you create a new Pelican-based website. Please answer the following questions so this script can generate the files needed by Pelican. > Where do you want to create your new web site? [.] > What will be the title of this web site? Vik's Blog > Who will be the author of this web site? Vik Paruchuri > What will be the default language of this web site? [en] > Do you want to specify a URL prefix? e.g., http://example.com (Y/n) n > Do you want to enable article pagination? (Y/n) > How many articles per page do you want?  > What is your time zone? [Europe/Paris] America/Los_Angeles > Do you want to generate a Fabfile/Makefile to automate generation and publishing? (Y/n) > Do you want an auto-reload & simpleHTTP script to assist with theme and site development? (Y/n) > Do you want to upload your website using FTP? (y/N) > Do you want to upload your website using SSH? (y/N) > Do you want to upload your website using Dropbox? (y/N) > Do you want to upload your website using S3? (y/N) > Do you want to upload your website using Rackspace Cloud Files? (y/N) > Do you want to upload your website using GitHub Pages? (y/N)
pelican-quickstart, you should have two new folders in
output, along with several files, such as
publishconf.py. Here’s an example of what should be in the folder:
jupyter-blog │ output │ content │ .gitignore │ develop_server.sh │ fabfile.py │ Makefile │ requirements.txt │ pelicanconf.py │ publishconf.py
Installing the Jupyter Plugin
Pelican doesn’t support writing blog posts using Jupyter by default — we’ll need to install a plugin that enables this behavior. We’ll install the plugin as a git submodule to make it easier to manage. If you don’t have git installed, you can find instructions here. Once you have git installed:
git initto initialize the current folder as a git repository.
- Create the folder
git submodule add git://github.com/danielfrg/pelican-ipynb.git plugins/ipynbto add in the plugin.
You should now have a
.gitmodules file and a
jupyter-blog │ output │ content │ plugins │ .gitignore │ .gitmodules │ develop_server.sh │ fabfile.py │ Makefile │ requirements.txt │ pelicanconf.py │ publishconf.py
In order to activate the plugin, we’ll need to modify
pelicanconf.py and add these lines at the bottom:
MARKUP = ('md', 'ipynb') PLUGIN_PATH = './plugins' PLUGINS = ['ipynb.markup']
These lines tell Pelican to activate the plugin when generating HTML.
Writing your First Post
Once the plugin is installed, we can create the first post:
- Create a Jupyter notebook with some basic content. Here’s an example you can download if you want.
- Copy the notebook file into the
- Create a file that has the same name as your notebook, but with the extension
.ipynb-meta. Here’s an example.
- Add the following content to the
ipynb-metafile, but change the fields to match your own post:
Title: First Post Slug: first-post Date: 2016-06-08 20:00 Category: posts Tags: python firsts author: Vik Paruchuri Summary: My first post, read it to find out.
Here’s an explanation of the fields:
Title— the title of the post.
Slug— the path at which the post will be accessed on the server. If the slug is
first-post, and your server is
jupyter-blog.com, you’d access the post at
Date— the date the post will be published.
Category— a category for the post (this can be anything).
Tags— a space-separated list of tags to use for the post. These can be anything.
Author— the name of the author of the post.
Summary— a short summary of your post.
You’ll need to copy in a notebook file, and create an
ipynb-meta file whenever you want to add a new post to your blog.
Once you’ve created the notebook and the meta file, you’re ready to generate your blog HTML files. Here’s an example of what the
jupyter-blog folder should look like now:
jupyter-blog │ output │ content │ first-post.ipynb │ first-post.ipynb-meta │ plugins │ .gitignore │ .gitmodules │ develop_server.sh │ fabfile.py │ Makefile │ requirements.txt │ pelicanconf.py │ publishconf.py
In order to generate HTML from our post, we’ll need to run Pelican to convert the notebooks to HTML, then run a local server to be able to view them:
- Switch to the
pelican contentto generate the HTML.
- Switch to the
python -m pelican.server.
localhost:8000in your browser to preview the blog.
You should be able to browse a listing of all the posts in your data science blog, along with the specific post you created.
Creating a GitHub Page
GitHub Pages is a feature of GitHub that allows you to quickly deploy a static site and let anyone access it using a unique URL. In order to set it up, you’ll need to:
- Sign up for GitHub if you haven’t already.
- Create a repository called
usernameis your GitHub username. Here’s a more detailed guide on how to do this.
- Switch to the
- Add the repository as a remote for your local git repository by running
git remote add origin [email protected]:username/username.github.io.git— replace both references to
usernamewith your GitHub username.
A GitHub page will display whatever HTML files are pushed up to the
master branch of the repository
username.github.io at the URL
username.github.io (the repository name and the URL are the same).
First, we’ll need to modify Pelican so that URLs point to the right spot:
publishconf.py, so that it is set to
usernameis your GitHub username.
pelican content -s publishconf.py. When you want to preview your blog locally, run
pelican content. Before you deploy, run
pelican content -s publishconf.py. This uses the correct settings file for deployment.
Committing Your Files
If you want to store your actual notebooks and other files in the same Git repo as a GitHub Page, you can use git branches.
git checkout -b devto create and switch to a branch called
dev. We can’t use
masterto store our notebooks, since that’s the branch that’s used by GitHub Pages.
- Create a commit and push to GitHub like normal (using
git commit, and
Deploy to GitHub Pages
We’ll need to add the content of the blog to the
master branch for GitHub Pages to work properly. Currently, the HTML content is inside the folder
output, but we need it to be at the root of the repository, not in a subfolder. We can use the ghp-import tool for this:
ghp-import output -b masterto import everything in the
outputfolder to the
git push origin masterto push your content to GitHub.
- Try visiting
username.github.io— you should see your page!
Whenever you make a change to your data science blog, just re-run the
pelican content -s publishconf.py,
git push commands above, and your GitHub Page will be updated.
Comments are one way to interact with your guests. Disqus is a good tool for this, and it integrates with Pelican seamlessly.
Follow these steps:
- Go to the Disqus site and register.
- Click “Get Started”, then choose “I want to install Disqus on my site”.
- Enter your Website Name. This will serve as a unique key to link Disqus to your blog, by passing it into the
publishconf.pyfile. (More on this in a future step.)
- Choose a Disqus subscription plan — a basic plan is perfect for a personal blog.
- When Disqus asks which platform your site is on, scroll down and choose “I don’t see my platform listed, install manually with Universal Code”.
- On the Universal Code page, scroll down again and click “Configure”.
- On the Configure page, fill in the “Website URL” section with your actual website address (
https://username.github.io). You can also add information about your comment policy (if you don’t have one, Disqus gives suggestions), and enter a description for your site. Click “Complete Setup”.
- You’ll now have the ability to configure your site’s community settings. Click into this section and look around. Among other things, you’ll be able to control whether guests can comment, and activate ads.
- In the toolbar on the left, click “Advanced” and add your website into Trusted Domains as
- Lastly, update
publishconf.py. Make sure to specify
DISQUS_SITENAME = "website-name", where “website-name” comes from step 3.
Now rerun the
pelican content -s publishconf.py,
ghp-import output -b master and
git push origin master commands to update your GitHub Page. Refresh your website and you’ll see Disqus appearing under each post.
Choose a Theme
The Pelican community offers a variety of themes at pelicanthemes.com.
You can choose any theme you like, but here are some quick tips:
- Keep it simple. The design should not distract from the actual content.
- Remember the “Rule of Three Colors”. According to University of Toronto study, most people prefer combinations of two to three colors. This way colors don’t fight for attention.
- Pay attention to the width of your page — it should be enough to contain infographics and code that you may want to publish.
Once you’ve picked a theme, go to the folder where you wish to store your theme and create a repo:
git clone --recursive https://github.com/getpelican/pelican-themes pelican-themes.
THEME variable in your
pelicanconf.py file and set its value to the location of the theme:
THEME = 'E:\\Pelican\\pelican-themes\\flex
Here, we are using a nice flex theme by Alexandre Vincenzi.
Run the usual finishing commands —
git push — and enjoy a new look!
We’ve come a long way! You now should be able to author blog posts and push them to GitHub Pages. Anyone should be able to access your data science blog at
username with your GitHub username). This gives you a great way to show off your data science portfolio.
As you write more posts and gain an audience, you may want to dive more into a few areas:
- Your own custom URL.
username.github.iois nice, but sometimes you want a more custom domain. Here’s a guide on using a custom domain with GitHub Pages.
- Check out the list of plugins here. Plugins can help you setup analytics, commenting, and more.
At Dataquest, our interactive guided projects are designed to help you start building a data science portfolio to demonstrate your skills to employers and get a job in data. If you’re interested, you can signup and do our first module for free.
If you liked this, you might like to read the other posts in our ‘Build a Data Science Portfolio’ series:
Vik is the CEO and Founder of Dataquest.