Python has some powerful tools that enable you to do natural language processing (NLP). In this tutorial, we’ll learn about how to do some basic NLP in python.

## Looking at the data

We’ll be looking at a dataset consisting of submissions to Hacker News from 2006 to 2015. The data was taken from here. Arnaud Drizard used the Hacker News API to scrape it. We’ve sampled 10000 rows from the data randomly, and removed all the extraneous columns. Our data only has four columns:

• submission_time – when the story was submitted.
• url – the base url of the submission.
• upvotes – number of upvotes the submission got.
• headline – the headline of the submission.

We’ll be using the headlines to predict the number of upvotes. The data is stored in the submissions variable.

## Natural Lanuage Processing - First steps

We want to eventually train a machine learning algorithm to take in a headline and tell us how many upvotes it would receive. However, machine learning algorithms only understand numbers, not words. How do we translate our headlines into something an algorithm can understand?

The first step is to create something called a bag of words matrix. A bag of word matrix gives us a numerical representation of which words are in which headlines.

In order to construct a bag of words matrix, we first find the unique words across the whole set of headlines. Then, we setup a matrix where each row is a headline, and each column is one of the unique words. Then, we fill in each cell with the number of times that word occured in that headline.

This will result in a matrix where a lot of the cells have a value of zero, unless the vocabulary is mostly shared between the headlines.

   why  HN:  for  people  immediately  like  I  Go  80  meets  ...   my  who  \
0    0    0    0       0            0     0  0   0   0      0  ...    0    0
1    0    0    0       0            0     0  0   0   0      0  ...    0    0
2    0    0    0       0            0     0  1   0   0      0  ...    0    0
3    0    0    0       0            0     0  1   0   0      0  ...    0    0
4    0    0    0       0            0     0  0   0   0      0  ...    0    0

PretzelBros,  whatever  language  do  \$2  still  than  soul
0             0         0         0   0   0      0     0     0
1             0         0         0   0   0      0     0     0
2             0         0         0   0   0      0     0     0
3             0         0         0   0   0      0     0     0
4             0         0         0   0   0      0     0     0

[5 rows x 51 columns]


## Removing punctuation

The matrix we just made is very sparse – that means that a lot of the values are zero. This is unavoidable to some extent, because the headlines don’t have much shared vocabulary. We can take some steps to make the problem better, though. Right now Why and why, and use and use. are treated as different entities, but we know they refer to the same word.

We can help the parser recognize that these are in fact the same by lowercasing every word and removing all punctuation.

   2  why  top  hn  for  people  immediately  like  python  80  ...   should  \
0  1    0    0   0    0       0            0     0       0   0  ...        0
1  0    0    0   0    0       0            0     0       0   0  ...        0
2  0    0    0   0    0       0            0     0       0   0  ...        0
3  0    0    0   0    0       0            0     0       0   0  ...        0
4  0    0    0   0    0       0            0     0       0   0  ...        0

my  who  go  whatever  language  do  still  than  soul
0   0    0   0         0         0   0      0     0     0
1   0    0   0         0         0   0      0     0     0
2   0    0   0         0         0   0      0     0     0
3   0    0   0         0         0   0      0     0     0
4   0    0   0         0         0   0      0     0     0

[5 rows x 47 columns]


## Removing stopwords

Certain words don’t help you discriminate between good and bad headlines. Words such as the, a, and also occur commonly enough in all contexts that they don’t really tell us much about whether something is good or not. They are generally equally likely to appear in both good and bad headlines.

By removing these, we can reduce the size of the matrix, and make training an algorithm faster.

   2  top  hn  people  immediately  like  python  80  meets  pretzels  ...   \
0  1    0   0       0            0     0       0   0      0         0  ...
1  0    0   0       0            0     0       0   0      0         0  ...
2  0    0   0       0            0     0       0   0      0         0  ...
3  0    0   0       0            0     0       0   0      0         0  ...
4  0    0   0       0            0     0       0   0      0         0  ...

gta  10  uber  raises  pretzelbros  go  whatever  language  still  soul
0    0   0     0       0            0   0         0         0      0     0
1    0   0     0       0            0   0         0         0      0     0
2    0   0     0       0            0   0         0         0      0     0
3    0   0     0       0            0   0         0         0      0     0
4    0   0     0       0            0   0         0         0      0     0

[5 rows x 34 columns]


## Generating a matrix for all the headlines

Now that we know the basics, we can make a bag of words matrix for the whole set of headlines.

We don’t want to have to code everything out manually every time, so we’ll use a class from scikit-learn to do it automatically. Using the vectorizers from scikit-learn to construct your bag of words matrices will make the process much easier and faster.

[[0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 1 0 0 0 0 0 0 0 0]
[1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0]
[0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1]
[0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 1 0]
[0 0 0 0 0 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0]]
(9356, 13631)


## Reducing dimensionality

We’ve constructed a matrix, but it now has 13631 unique words, or columns. This will take a very long time to make predictions with. We want to speed it up, so we’ll need to cut down the column count somehow.

One way to do this is to pick a subset of the columns that are the most informative – that is, the columns that differentiate between good and bad headlines the best. A good way to figure out the most informative columns is to use something called a chi-squared test.

A chi-squared test finds the words that discriminate the most between highly upvoted posts and posts that weren’t upvoted. This can be words that occur a lot in highly upvoted posts, and not at all in posts without upvotes, or words that occur a lot in posts that aren’t upvoted, but don’t occur in posts that are upvoted.

A chi-squared test only works on binary values, so we’ll make our upvotes column binary by setting anything with more upvotes than average to 1 and anything with less upvotes than average to 0.

One downside of this is that we are using knowledge from the dataset to select features, and thus introducing some overfitting. We could get around the overfitting in the “real world” by using a subset of the data for feature selection, and using a different subset for training the algorithm. We’ll make things a bit simpler for now and skip that step.

If we ignore the “meta” features of the headlines we’re missing out on a lot of good information. These features are things like length, amount of punctuation, average word length, and other sentence specific features.

Adding these in can greatly increase prediction accuracy.

To add them in, we’ll loop over our headlines, and apply a function to each one. Some functions will count the length of the headline in characters, and others will do more advanced things, like counting the number of digits.

There are more features we can work with than just text features. We have a column called submission_time, that tells us when a story was submitted, and could add more information.

Often when doing NLP work, you’ll be able to add outside features that make your predictions much better. Some machine learning algorithms can figure out how these features interact with your textual features(ie “Posting at midnight with the word ‘tacos’ in the headline results in a high scoring post”).

## Making predictions

Now that we can translate words to numbers, we can make predictions using an algorithm. We’ll randomly pick 7500 headlines as a training set, and then evaluate the performance of the algorithm on the test set of 2500 headlines.

Predicting the results on the same set that we train on will result in overfitting, where your algorithm is overly optimized to the training set – we’ll think that the error rate is good, but it could actually be much higher on new data.

For the algorithm, we’ll use ridge regression. As compared to ordinary linear regression, ridge regression introduces a penalty on the coefficients, which prevents them from becoming too large. This can help it work with large numbers of predictors (columns) that are correlated to each other, like we have.

## Evaluating error

We now have predictions, but how do we determine how good they are? One way is to calculate the error rate between the predictions on the test set and the actual upvote counts for the test set.

We’ll also want a baseline to compare the error to to see if the results are good. We can do this by using a simple method to make baseline estimates for the test set, and comparing the error rate of our predictions to the error rate of the baseline estimates. One very simple baseline is to take the average number of upvotes per submission in the training set, and use that as a prediction for every submission.

We’ll use mean absolute error as an error metric. It’s very simple – just subtract the actual value from the prediction, take the absolute value of the difference, then find the mean of all the differences.

13.6606593988
17.2759421912


## Next Steps

This method worked reasonably but not stunningly well on this dataset. We found that the headlines and other columns have some predictive value.

We could improve this approach by using a different predictive algorithm, like a random forest or a neural network. We could also use ngrams, such as bigrams and trigrams, when we are generating our bag of words matrix. Trying a tf-idf transform on the matrix could also help – scikit-learn has a class that does this automatically.

We could also take other data into account, like the user who submitted the article, and generate features indicating things like the karma of the user, and the recent activity of the user. Other statistics on the submitted url, like the average number of upvotes submissions from that url received would also be potentially useful. Be careful when doing these to only take into account information that existed before the submission you’re predicting for was made.

All of these additions will take much longer to run than what we have so far, but will reduce error. Hopefully you’ll have some time to try them out!

If you’d like to work more with NLP, you can check out our interactive Natural Language Programming Course.

Image Credit: Up by The Impekables from the Noun Project.