October 1, 2019

Tutorial: Transforming Data with Python Scripts and the Command Line

In this tutorial, we're going to dig into how to transform data using Python scripts and the command line.

But first, it's worth asking the question you may be thinking: "How does Python fit into the command line and why would I ever want to interact with Python using the command line when I know I can do all my data science work using IPython notebooks or Jupyter lab?"

Notebooks are great for quick data visualization and exploration, but Python scripts are the way to put anything we learn into production. Let's say you want to make a website to help people make Hacker News posts with ideal headlines and sublesson times. To do this, you'll need scripts.

This tutorial assumes basic knowledge of functions, and a little command line experience wouldn't hurt either. If you haven't worked with Python before, feel free to check out our lesson covering the fundamentals of Python functions, or dig deeper into some of our data science courses. Recently, we've released two new, interactive command line courses: Command Line Elements and Text Processing in the Command Line, so if you want to dig deeper into the command line, we recommend those as well

That said, don't worry too much about prerequisites! We'll explain everything we're doing as we go along, so let's dive in!

Getting Familiar with the Data

Hacker News is a site where users can submit articles from across the internet (usually about technology and startups), and others can "upvote" the articles, signifying that they like them. The more upvotes a sublesson gets, the more popular it was in the community. Popular articles get to the "front page" of Hacker News, where they're more likely to be seen by others.

This begs the question: what makes a Hacker News article successful? Are there patterns we can find in terms of what's most likely to get upvoted?

The data set we'll be using was compiled by Arnaud Drizard using the Hacker News API, and can be found here. We've sampled 10000 rows from the data randomly, and removed all extraneous columns. Our data set only has four columns:

  • sublesson_time — when the story was submitted.
  • upvotes — number of upvotes the sublesson got.
  • url — the base domain of the sublesson.
  • headline — the headline of the sublesson. Users can edit this, and it doesn't have to match the headline of the original article.

We'll be writing scripts to answer three key questions:

  • What words appear most often in the headlines?
  • What domains were submitted most often to Hacker News?
  • At what times are most articles submitted?

Remember: There are multiple ways to approach a task when it comes to programming. In this tutorial, we'll walk through one way to approach these questions, but there are certainly others that may be just as valid, so feel free to experiment and try to come up with your own approach!

Reading in Data with the Command Line and a Python Script

To star, let's create a folder called Transforming_Data_with_Python on the Desktop. To create a folder using the command line, you can use the mkdir command followed by the name of the folder. For example, if you wanted to make a folder called test, you could navigate to the Desktop directory and then type mkdir test.

We'll talk about why we created a folder a bit later, but for now, let's navigate to the created folder using the cd command. The cd command allows us to change directories using the command line.

While there are multiple ways to create a file using the command line, we can take advantage of a technique called piping and redirecting output to accomplish two things at once: redirecting output from stdout (the standard output generated by the command line) into a file and creating a new file! In other words, instead of having the command line just print its output, we can have it create a new file and make its output the contents of that file.

To do this, we can use either > and >>, depending on what we want to do with the file. Both will create a file if it does not exist; however, > will overwrite the text that is already in the file with the redirected output while >> will append any redirected output to the file.

We want to read our data into this file and create a descriptive filename and function name, so we'll create a function called load_data() and have it inside a file called read.py. Let's create our function using the command line that reads in the data. For this, we'll use the printf function. (We'll use printf because it allows us to print newline characters and tab characters, which we'll want to use to make our scripts more readable for ourselves and others).

To do this, we can type the following into the command line

printf "import pandas as pd\n\ndef load_data():\n\thn_stories = pd.read_csv('hn_stories.csv')\n\thn_stories.colummns = ['sublesson_time', 'upvotes', 'url', 'headline']\n\treturn(hn_stores)\n" > read.py 

Examining the code above, there's a lot going on. Let's break it down piece by piece. In the function, we are:

  • Keeping in mind we want to have our script readable, we're using the printf command to generate some output using the command line to preserve formatting as we generate the output.
  • Importing pandas.
  • Reading in our dataset (hn_stories.csv) into a pandas dataframe.
  • Using df.columns to add column names to our dataframe.
  • Creating a function called load_data() that contains the code to read in and process the dataset.
  • Utilizing newline characters (\n) and tab characters (\t) to preserve the format and so Python can read the script.
  • Redirecting the output of printf to a file called read.py using the > operator. Because read.py doesn't already exist, the file is created.

After we run the code above, we can type cat read.py in the command line and execute the command to examine the contents of read.py. If everything ran properly, our read.py file will look like this:

import pandas as pd

def load_data():
    hn_stories = pd.read_csv("hn_stories.csv")
    hn_stories.columns = ['sublesson_time', 'upvotes', 'url', 'headline']

After we create this file, our directory structure should look something like this

 | read.py

Creating __init__.py

For the rest of this project, we will be creating more scripts to answer our questions and using the load_data() function. While we could and paste this function into every file that uses it, this can get quite cumbersome if the project we're working on is large.

To get around this problem, we can create a file called __init__.py. Essentially, __init__.py allows folders to treat their directory files as packages. In its simplest form, __init__.py can be an empty file. It just has to exist for directory files to be treated as packages. You can find more information about packages and modules in the Python docs.

Because load_data() is a function in read.py, we can import that function in the same way we import packages: from read import load_data().

Remember how there are multiple ways to create a file using the command line? We can use another command to create __init__.py This time, we'll use the touch command to create the file. touch is a command that creates an empty file for you as soon as you run the command:

touch __init__.py

After we create this file, the directory structure will look something like this

 | __init__.py
 | read.py

Exploring Words In Headlines

Now that we've created a script to read in and process the data as well as created __init__.py, we can start analyzing the data! The first thing we want to explore is the unique words that appear in the headlines. To do this, we want to do the following:

  • Make a file called count.py, using the command line.
  • Import load_data from read.py, and call the function to read in the data set.
  • Combine all of the headlines together into one long string. We'll want to leave a space between each headline when you combine them. For this step, we'll use the Series.str.cat to join the strings.
  • Split that long string into words.
  • Use the Counter class to count how many times each word occurs in the string.
  • Use the .most_common() method to store the 100 most common words to wordCount.

Here's how it looks if you're creating this file using the command line:

printf "from read import load_data\nfrom collections import Counter\n\nstories = load_data()\nheadlines = stories['headline'].str.cat(sep = ' ').lower()\nwordCount = Counter(headlines.split(' ')).most_common(100)\nprint(wordCount)\n" > count.py

After you run the code above, you can type cat count.py in the command line and execute the command to examine the contents of count.py. If everything ran properly, your count.py file will look like this:

from read import load_data
from collections import Counter

stories = load_data()
headlines = stories["headline"].str.cat(sep = ' ').lower()
wordCount = Counter(headlines.split(' ')).most_common(100)

Here's what the directory should look like now:

 | __init__.py
 | read.py
 | count.py

Now that we've created our Python script, we can run our script from the command line to get a list of the one hundred most common words. To run the script, we type the command python count.py from the command line.

After the script has run, you'll see these results printed:

[('the', 2045), ('to', 1641), ('a', 1276), ('of', 1170), ('for', 1140), ('in', 1036), ('and', 936), ('', 733), ('is', 620), ('on', 568), ('hn:', 537), ('with', 537), ('how', 526), ('-', 487), ('your', 480), ('you', 392), ('ask', 371), ('from', 310), ('new', 304), ('google', 303), ('why', 262), ('what', 258), ('an', 243), ('are', 223), ('by', 219), ('at', 213), ('show', 205), ('web', 192), ('it', 192), ('–', 184), ('do', 183), ('app', 178), ('i', 173), ('as', 161), ('not', 160), ('that', 160), ('data', 157), ('about', 154), ('be', 154), ('facebook', 150), ('startup', 147), ('my', 131), ('|', 127), ('using', 125), ('free', 125), ('online', 123), ('apple', 123), ('get', 122), ('can', 115), ('open', 114), ('will', 112), ('android', 110), ('this', 110), ('out', 109), ('we', 106), ('its', 102), ('now', 101), ('best', 101), ('up', 100), ('code', 98), ('have', 97), ('or', 96), ('one', 95), ('more', 93), ('first', 93), ('all', 93), ('software', 93), ('make', 92), ('iphone', 91), ('twitter', 91), ('should', 91), ('video', 90), ('social', 89), ('&', 88), ('internet', 88), ('us', 88), ('mobile', 88), ('use', 86), ('has', 84), ('just', 80), ('world', 79), ('design', 79), ('business', 79), ('5', 78), ('apps', 77), ('source', 77), ('cloud', 76), ('into', 76), ('api', 75), ('top', 74), ('tech', 73), ('javascript', 73), ('like', 72), ('programming', 72), ('windows', 72), ('when', 71), ('ios', 70), ('live', 69), ('future', 69), ('most', 68)]

Scrolling through them on our site is a bit awkward, but you may notice that the most common words things like the, to a for etc. These words are known as stop words — words that are useful to human speech but add nothing to data analysis. You can find more about stopwords in our tutorial on spaCy; removing the stopwords from our analysis here would be a fun next step if you want to expand this project.

Even with the stopwords included, though, we can spot some trends. Aside from the stopwords, the vast majority of these words are tech- and startup- related terms. That's not a big surprise given HackerNews's focus on tech startups, but we can see some interesting specific trends. For example, Google is the most frequently-mentioned brand in this data set. Facebook, Apple, are Twitter are other brands that are top topics of discussion.

Exploring Domain Sublessons

Now that we've explored the different headlines and displayed the top 100 most common words, we can now explore the domain sublessons! To do this, we can do the following:

  • Make a file called domains.py, using the command line.
  • Import load_data from read.py, and call the function to read in the data set.
  • Use the value_counts() method in pandas to count the number of occurrences of each value in a column.
  • Loop through the series and print the index value and its associated total.

Here's how that looks in command line form:

printf "from read import load_data\n\nstories = load_data()\ndomains = stories['url'].value_counts()\nfor name, row in domains.items():\n\tprint('{0}: {1}'.format(name, row))\n" > domains.py

And again, if we type cat domains.py in the command line to examine domains.py, we should see this:

from read import load_data

stories = load_data()
domains = stories['url'].value_counts()
for name, row in domains.items():
    print('{0}: {1}'.format(name, row))

After creating this file, here's what our directory looks like:

 | __init__.py
 | read.py
 | count.py
 | domains.py

Exploring Sublesson Times

We want to know when most articles are submitted. One easy way to reframe this is to look at what hour articles are submitted. To figure this out, we'll need to use the sublesson_time column.

The sublesson_time column contains timestamps that look like this: 2011-11-09T21:56:22Z. These times are expressed in UTC, which is a universal time zone used by most software for consistency (imagine a database populated with times all having different time zones; it would be a huge pain to work with).

To get the hour from a timestamp, we can use the dateutil library. The parser module in dateutil contains the parse function, which can take in a timestamp, and return a datetime object. Here's a link to the documentation. After parsing the timestamp, the hour property of the resulting date object will tell you the hour the article was submitted.

To do this, we can do the following:

  • Make a file called times.py, using the command line.
  • Write a function to extract the hour from a timestamp. This function should first use dateutil.parser.parse to parse the timestamp, then extract the hour from the resulting datetime object, then return the hour using .hour.
  • Use the pandas apply() method to make a column of sublesson hours.
  • Use the value_counts() method in pandas to count the number of occurrences of each hour.
  • Print out the results.

Here's how we do that in the command line:

printf "from dateutil.parser import parse\nfrom read import load_data\n\n\ndef extract_hour(timestamp):\n\tdatetime = parse(timestamp)\n\thour = datetime.hour\n\treturn hour\n\nstories = load_data()\nstories['hour'] = stories['sublesson_time'].apply(extract_hour)\ntime = stories['hour'].value_counts()\nprint(time)" > times.py

And here's how it looks as a separate .py file (which, as discussed above, you can confirm by running cat times.py from the command line to inspect the file):

from dateutil.parser import parse
from read import load_data

def extract_hour(timestamp):
    datetime = parse(timestamp)
    hour = datetime.hour
    return hour

nstories = load_data()
stories['hour'] = stories['sublesson_time'].apply(extract_hour)
time = stories['hour'].value_counts()

Once again, let's update our directory:

 | __init__.py
 | read.py
 | count.py
 | domains.py
 | times.py

Now that we've created our Python script, we can run our script from the command line to get a list of how many articles were posted in a certain hour. To do this, you can type the command python times.py from the command line. Running this script, you will see the following result:

17    646
16    627
15    618
14    602
18    575
19    563
20    538
13    531
21    497
12    398
23    394
22    386
11    347
10    324
7     320
0     317
1     314
2     298
9     298
3     296
4     282
6     279
5     275
8     274

You will notice that most sublessons are posted in the afternoon. Keep in mind, however, that these times are in UTC. If you're interested in expanding on this project, try adding a section to your script to convert the output from UTC into your local time zone.

Next Steps

In this tutorial, we've explored the data and built a directory of short scripts that work with each other to provide the answers we want. This is the first step in building a production version of our data analysis project.

But of course, this is just the start! We haven't made any use of the upvotes data in this tutorial, so that would be a great next step for expanding your analysis:

  • What headline length leads to the most upvotes?
  • What sublesson time leads to the most upvotes?
  • How are the total number of upvotes changing over time?

We encourage you to think of your own questions, and be creative as you continue exploring this data set!

Celeste Grupman

About the author

Celeste Grupman

Celeste Grupman is the CEO at Dataquest She is passionate about creating affordable access to high-quality skills training for students across the globe.