/ Docker

DigitalOcean and Docker for Data Science

Creating a cloud-based data science environment for faster analysis

There are times when working on data science problems with your local machine just doesn't cut it anymore. Maybe your computer is old, and can't work with larger datasets. Or maybe you want to be able to access your work from anywhere, and collaborate with others. Or maybe you have an analysis that will take a long time to run, and you don't want to tie up your own computer. In these cases, it is useful to run Jupyter on a server, so you can access it through a browser.

We can do this easily by using Docker. See our earlier post on how to setup a data science environment using Docker for background. This post builds on that one, and sets up Docker and Jupyter on a server.

Cloud hosting

The first step is to initialize a server. You can requisition servers in the cloud using sites like Amazon Web Services, or DigitalOcean. Both of these are cloud hosting providers -- they have a pool of servers, and they rent them out by the hour to people who want to run programs. When you rent a server, you get full access to it, and can install and run programs, just like on your local computer. A nice way to think of a server is as a computer that is physically located somewhere else. There's nothing special about a server, it's just another computer running an operating system that you can access over the internet.

To initialize our server, we'll use DigitalOcean, because it's a bit cheaper and simpler than Amazon Web Services.

Starting a server

With DigitalOcean, starting a server is fast and easy. The first step is to sign up for an account here. The second step is to create a droplet, which is the DigitalOcean term for a server. There's a good tutorial on how to do that here. As you go through the tutorial, make sure you select a droplet with at least 2GB of RAM, pick Ubuntu 14.04 64-bit as the OS, and make sure you add an ssh key by following this tutorial.

Once you create the server, be sure to note down the IP address. It should be at the top left of the page after you create the server, and will look like 162.243.1.205. This address is how your server is located on the internet, and you'll be using it to access it later.

digitalocean_ip

The IP address should be at the top left

Logging into the server

Once the server is setup, you can login to it via SSH, or Secure Shell. SSH enables you to remotely login to a machine using a special key to authenticate (you generated this key in the tutorial earlier). Once you're logged in, you'll be given access to the command line on the server. You can execute commands just like you could on your local computer.

See this tutorial for how to SSH into your server. If you get an authentication error, you may need to add the right SSH key first. You can do this using the ssh-add command on Linux and OSX. Type ssh-add /home/vik/.ssh/id_rsa (replace /home/vik/.ssh/id_rsa with the path to your key) to add the right ssh key, and retry the tutorial steps to ssh into the server.

Creating a new user account

We're currently logged into the server as the root user. This user has full access to everything on the system. This is fine for an initial login, but can have security issues as we install software using the root user. We'll instead create a new user, called ds (short for data science). Follow this tutorial to create a new user. Remember to use the user name ds instead of demo, as it is in the tutorial.

When you're done making a new user, quit the SSH session by typing exit. Login to the server again as the new user ds by typing ssh ds@SERVER_IP (replace SERVER_IP with the IP address of your server) .

Installing Docker

Once you're connected to the server via ssh as the user ds, you should see a command prompt. This prompt will let you execute any bash shell commands, such as cd to change directories, and mv to move files. The server is running Ubuntu 14.04, which is based on Linux.

The first thing we'll do is install Docker. Follow the instructions here to install Docker on the server.

Make sure to run sudo usermod -aG docker ds after installing Docker, then exit the ssh session using exit, and ssh back in with ssh ds@SERVER_IP.

Creating a notebook directory

The second step is to make a directory on the machine to hold your notebook files. You should be logged in as the ds user, so just type mkdir -p /home/ds/notebooks to create the notebooks directory.

Downloading the appropriate Docker image

The next step is to download the Docker image you want. Here are our currently available data science images:

  • dataquestio/python3-starter -- This contains a python 3 installation, jupyter notebook, and many popular data science libraries such as numpy, pandas, scipy, scikit-learn, and nltk.
  • dataquestio/python2-starter -- This contains a python 2 installation, jupyter notebook, and many popular data science libraries such as numpy, pandas, scrapy, scipy, scikit-learn, and nltk.

You can download the images by typing docker pull IMAGE_NAME.

Starting the container

One the image is downloaded, you can start your Docker container with docker run -d -p 8888:8888 -v /home/ds/notebooks:/home/ds/notebooks dataquestio/python3-starter. Replace dataquestio/python3-starter with the name of the image you want to use.

Once this is executed, we'll have a Jupyter server running at port 8888 on our local machine.

Installing nginx

Nginx is an HTTP and reverse proxy server. What this means is that is can take requests from the internet and pass them through to our Jupyter server. Nginx can make security and other things needed to have a public-facing web application easier.

The first thing we'll need to do is install nginx, which can be done with sudo apt-get install nginx.

You should now be able to type your server IP address into the browser's address bar, and see a generic message that says "Welcome to nginx".

Setting up certificates

We'll want to encrypt traffic between our browser and Jupyter notebook. This will prevent anyone from intercepting sensitive data or passwords we send back and forth.

In order to do this, we'll need to generate an SSL Certificate. This certificate enables our browser to establish a secure link to a remote server.

We'll have to generate and install the certificate on our server. You can do this by following step one in this guide. Be sure to specify your server IP address instead of the domain name. Make sure to stop after step one -- we have a different nginx configuration we'll need to do.

Creating a password

We'll also generate a password for nginx, so only you can access your Jupyter notebooks.

To do this, we'll run:

  • sudo apt-get install apache2-utils
  • sudo htpasswd -c /etc/nginx/.htpasswd ds
    • Enter a password at the prompt -- your username will be ds, and your password will be set to what you enter here.

Setting up nginx configuration

The final step is to setup the nginx configuration.

First, we'll remove the default nginx welcome message by typing sudo rm /etc/nginx/sites-enabled/default.

Then, we'll make our own configuration file. Type sudo nano /etc/nginx/sites-enabled/ds. This will open a text editor. Paste the following into the text editor:

server {
       listen         80;
       return         301 https://$host$request_uri;
}

server {

    set $custom_host $host;
    listen 443 ssl;

    ssl_certificate /etc/nginx/ssl/nginx.crt;
    ssl_certificate_key /etc/nginx/ssl/nginx.key;

    client_max_body_size 10M;

    location / {
        proxy_set_header Host $custom_host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Protocol $scheme;

        auth_basic "Restricted";
        auth_basic_user_file /etc/nginx/.htpasswd;
        
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Origin "";
        
        proxy_pass http://127.0.0.1:8888;
    }
}

Finally, close and save the file by hitting Control + X and then Y.

Then, type sudo service nginx to restart nginx.

Trying it out

Now, you can visit your server IP address in your browser. You'll automatically be redirected to the secure version of the site. You may see a screen like this (this is from Chrome):

error_screen

This is happening because we used a self-signed SSL certificate. Most sites use certificates signed by certificate authorities, but that costs money, and we don't want to spend that on a site few people will use. Bypass this screen to keep connecting to the site -- with Chrome, you can do this by clicking Advanced, then Proceed at the bottom.

Once you click proceed, you'll see a prompt for a password. Type ds as the username, and enter the password you setup for nginx in an earlier step.

password_prompt

Connecting to the server

You've now finished all the needed setup! You should be through to the Jupyter server, which will look like this:

jupyter_server

You can upload and download data files and notebooks via the Jupyter interface, and should be able to start doing data science in the cloud.

Next Steps

Please read our earlier post for information on how to install new packages or modify the Docker containers.

If you want to build on the images we've discussed in this post, you can contribute to our Github repository here, which contains the Dockerfiles. We welcome improvements to our current images, or the addition of new images focusing on tools other than Python.

If you're interested in learning data science, checkout our interactive data science course at Dataquest.