/ Python

Python vs R: head to head data analysis

Which is better for data analysis?

There have been dozens of articles written comparing Python and R from a subjective standpoint. We'll add our own views at some point, but this article aims to look at the languages more objectively. We'll analyze a dataset side by side in Python and R, and show what code is needed in both languages to achieve the same result. This will let us understand the strengths and weaknesses of each language without the conjecture. At Dataquest, we teach both languages, and think both have a place in a data science toolkit.

We'll be analyzing a dataset of NBA players and their performance in the 2013-2014 season. You can download the file here. For each step in the analysis, we'll show the Python and R code, along with some explanation and discussion of the different approaches. Without further ado, let's get this head to head Python vs R matchup started!

Importing a CSV

R

nba <- read.csv("nba_2013.csv")

Python

import pandas
nba = pandas.read_csv("nba_2013.csv")

The above code will load the csv file nba_2013.csv, which contains data on NBA players from the 2013-2014 season, into the variable nba in both languages. The only real difference is that in Python, we need to import the pandas library to get access to Dataframes. Dataframes are available in both R and Python, and are two-dimensional arrays (matrices) where each column can be of a different datatype. At the end of this step, the csv file has been loaded by both languages into a dataframe.

Finding the number of rows

R

dim(nba)
[1] 481  31

Python

nba.shape
(481, 31)

This prints out the number of players and the number of columns in each. We have 481 rows, or players, and 31 columns containing data on the players.

Looking at the first row of the data

R

head(nba, 1)
player pos age bref_team_id
1 Quincy Acy  SF  23          TOT
[output truncated]

Python

nba.head(1)
player pos  age bref_team_id
0  Quincy Acy  SF   23          TOT
[output truncated]

This is pretty much identical. Both print out the first row of the data, and the syntax is very similar. Python is more object-oriented here, and head is a method on the dataframe object, and R has a separate head function. This is a common theme you'll see as you start to do analysis with these languages, where Python is more object-oriented, and R is more functional.

Find the average of each statistic

Let's find the average value for each statistic. The columns, as you can see, have names like fg (field goals made), and ast (assists). These are the season statistics for the player. If you want a fuller explanation of all the stats, look here.

R

sapply(nba, mean, na.rm=TRUE)
player NA
pos NA
age 26.5093555093555
bref_team_id NA
[output truncated]

Python

nba.mean()
age             26.509356
g               53.253638
gs              25.571726
[output truncated]

There are some major differences in approach here. In both, we're applying a function across the dataframe columns. In python, the mean method on dataframes will find the mean of each column by default.

In R, taking the mean of string values will just result in NA -- not available. However, we do need to ignore NA values when we take the mean (requiring us to pass na.rm=TRUE into the mean function). If we don't, we end up with NA for the mean of columns like x3p.. This column is three point percentage. Some players didn't take three point shots, so their percentage is missing. If we try the mean function in R, we get NA as a response, unless we specify na.rm=TRUE, which ignores NA values when taking the mean. The .mean() method in Python already ignores these values by default.

Make pairwise scatterplots

One common way to explore a dataset is to see how different columns correlate to others. We'll compare the ast, fg, and trb columns.

R

library(GGally)
ggpairs(nba[,c("ast", "fg", "trb")])

r_pairs

Python

import seaborn as sns
import matplotlib.pyplot as plt
sns.pairplot(nba[["ast", "fg", "trb"]])
plt.show()

python_pairs

We get very similar plots in the end, but this shows how the R data science ecosystem has many smaller packages (GGally is a helper package for ggplot2, the most-used R plotting package), and many more visualization packages in general. In Python, matplotlib is the primary plotting package, and seaborn is a widely used layer over matplotlib. With visualization in Python, there is usually one main way to do something, whereas in R, there are many packages supporting different methods of doing things (there are at least a half dozen packages to make pair plots, for instance).

Make clusters of the players

One good way to explore this kind of data is to generate cluster plots. These will show which players are most similar.

R

library(cluster)
set.seed(1)
isGoodCol <- function(col){
   sum(is.na(col)) == 0 && is.numeric(col) 
}
goodCols <- sapply(nba, isGoodCol)
clusters <- kmeans(nba[,goodCols], centers=5)
labels <- clusters$cluster

Python

from sklearn.cluster import KMeans
kmeans_model = KMeans(n_clusters=5, random_state=1)
good_columns = nba._get_numeric_data().dropna(axis=1)
kmeans_model.fit(good_columns)
labels = kmeans_model.labels_

In order to cluster properly, we remove any non-numeric columns, or columns with missing values (NA, Nan, etc). In R, we do this by applying a function across each column, and removing it if it has any missing values or isn't numeric. We then use the cluster package to perform k-means and find 5 clusters in our data. We set a random seed using set.seed to be able to reproduce our results.

In Python, we use the main Python machine learning package, scikit-learn, to fit a k-means clustering model and get our cluster labels. We perform very similar methods to prepare the data that we used in R, except we use the get_numeric_data and dropna methods to remove non-numeric columns and columns with missing values.

Plot players by cluster

We can now plot out the players by cluster to discover patterns. One way to do this is to first use PCA to make our data 2-dimensional, then plot it, and shade each point according to cluster association.

R

nba2d <- prcomp(nba[,goodCols], center=TRUE)
twoColumns <- nba2d$x[,1:2]
clusplot(twoColumns, labels)

r_clus

Python

from sklearn.decomposition import PCA
pca_2 = PCA(2)
plot_columns = pca_2.fit_transform(good_columns)
plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=labels)
plt.show()

python_clus

Made a scatter plot of our data, and shaded or changed the icon of the data according to cluster. In R, the clusplot function was used, which is part of the cluster library. We performed PCA via the pccomp function that is builtin to R.

With Python, we used the PCA class in the scikit-learn library. We used matplotlib to create the plot.

Split into training and testing sets

If we want to do supervised machine learning, it's a good idea to split the data into training and testing sets so we don't overfit.

R

trainRowCount <- floor(0.8 * nrow(nba))
set.seed(1)
trainIndex <- sample(1:nrow(nba), trainRowCount)
train <- nba[trainIndex,]
test <- nba[-trainIndex,]

Python

train = nba.sample(frac=0.8, random_state=1)
test = nba.loc[~nba.index.isin(train.index)]

You'll notice that R has many more data-analysis focused builtins, like floor, sample, and set.seed, whereas these are called via packages in Python (math.floor, random.sample, random.seed). In Python, the recent version of pandas came with a sample method that returns a certain proportion of rows randomly sampled from a source dataframe -- this makes the code much more concise. In R, there are packages to make sampling simpler, but aren't much more concise than using the built-in sample function. In both cases, we set a random seed to make the results reproducible.

Univariate linear regression

Let's say we want to predict number of assists per player from field goals made per player.

R

fit <- lm(ast ~ fg, data=train)
predictions <- predict(fit, test)

Python

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(train[["fg"]], train["ast"])
predictions = lr.predict(test[["fg"]])

Scikit-learn has a linear regression model that we can fit and generate predictions from. R relies on the built-in lm and predict functions. predict will behave differently depending on the kind of fitted model that is passed into it -- it can be used with a variety of fitted models.

Calculate summary statistics for the model

R

summary(fit)
Call:
lm(formula = ast ~ fg, data = train)

Residuals:
Min      1Q  Median      3Q     Max 
-228.26  -35.38  -11.45   11.99  559.61
[output truncated]

Python

import statsmodels.formula.api as sm
model = sm.ols(formula='ast ~ fga', data=train)
fitted = model.fit()
fitted.summary()


OLS Regression Results

Dep. Variable: ast
R-squared: 0.568
Model: OLS
Adj. R-squared: 0.567
[output truncated]

If we want to get summary statistics about the fit, like r-squared value, we'll need to do a bit more in Python than in R. With R, we can use the builtin summary function to get information on the model. With Python, we need to use the statsmodels package, which enables many statistical methods to be used in Python. We get similar results, although generally it's a bit harder to do statistical analysis in Python, and some statistical methods that exist in R don't exist in Python.

Fit a random forest model

Our linear regression worked well in the single variable case, but we suspect there may be nonlinearities in the data. Thus, we want to fit a random forest model.

R

library(randomForest)
predictorColumns <- c("age", "mp", "fg", "trb", "stl", "blk")
rf <- randomForest(train[predictorColumns], train$ast, ntree=100)
predictions <- predict(rf, test[predictorColumns])

Python

from sklearn.ensemble import RandomForestRegressor
predictor_columns = ["age", "mp", "fg", "trb", "stl", "blk"]
rf = RandomForestRegressor(n_estimators=100, min_samples_leaf=3)
rf.fit(train[predictor_columns], train["ast"])
predictions = rf.predict(test[predictor_columns])

The main difference here is that we needed to use the randomForest library in R to use the algorithm, whereas it was built in to scikit-learn in Python. scikit-learn has a unified interface for working with many different machine learning algorithms in Python, and there's usually only one main implementation of each algorithm in Python. With R, there are many smaller packages containing individual algorithms, often with inconsistent ways to access them. This results in a greater diversity of algorithms (many have several implementations, and many are fresh out of research labs), but with a bit of a usability hit.

Calculate error

Now that we've fit two models, let's calculate error. We'll use MSE.

R

mean((test["ast"] - predictions)^2)
4573.86778567462

Python

from sklearn.metrics import mean_squared_error
mean_squared_error(test["ast"], predictions)
4166.9202475632374

In Python, the scikit-learn library has a variety of error metrics that we can use. In R, there are likely some smaller libraries that calculate MSE, but doing it manually is pretty easy in either language. There's a small difference in errors that almost certainly due to parameter tuning, and isn't a big deal.

Download a webpage

Now that we have data on NBA players from 2013-2014, let's scrape some additional data to supplement it. We'll just look at one box score from the NBA Finals here to save time.

R

library(RCurl)
url <- "http://www.basketball-reference.com/boxscores/201506140GSW.html"
data <- readLines(url)

Python

import requests
url = "http://www.basketball-reference.com/boxscores/201506140GSW.html"
data = requests.get(url).content

In Python, the requests package makes downloading web pages easy, with a consistent API for all request types. In R, RCurl provides a similarly simple way to make requests. Both download the webpage to a character datatype. Note: this step is unnecessary for the next step in R, but is shown for comparisons's sake.

Extract player box scores

Now that we have the web page, we'll need to parse it to extract scores for players.

R

library(rvest)
page <- read_html(url)
table <- html_nodes(page, ".stats_table")[3]
rows <- html_nodes(table, "tr")
cells <- html_nodes(rows, "td a")
teams <- html_text(cells)

extractRow <- function(rows, i){
    if(i == 1){
        return
    }
    row <- rows[i]
    tag <- "td"
    if(i == 2){
        tag <- "th"
    }
    items <- html_nodes(row, tag)
    html_text(items)
}

scrapeData <- function(team){
    teamData <- html_nodes(page, paste("#",team,"_basic", sep=""))
    rows <- html_nodes(teamData, "tr")
    lapply(seq_along(rows), extractRow, rows=rows) 
}

data <- lapply(teams, scrapeData)

Python

from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(data, 'html.parser')
box_scores = []
for tag in soup.find_all(id=re.compile("[A-Z]{3,}_basic")):
    rows = []
    for i, row in enumerate(tag.find_all("tr")):
        if i == 0:
            continue
        elif i == 1:
            tag = "th"
        else:
            tag = "td"
        row_data = [item.get_text() for item in row.find_all(tag)]
        rows.append(row_data)
    box_scores.append(rows)

This will create a list containing two lists, the first with the box score for CLE, and the second with the box score for GSW. Both contain the headers, along with each player and their in-game stats. We won't turn this into more training data now, but it could easily be transformed into a format that could be added to our nba dataframe.

The R code is more complex than the Python code, because there isn't a convenient way to use regular expressions to select items, so we have to do additional parsing to get the team names from the HTML. R also discourages using for loops in favor of applying functions along vectors. We use lapply to do this, but since we need to treat each row different depending on whether it's a header or not, we pass the index of the item we want, and the entire rows list into the function.

We use rvest, a new and widely used R web scraping package to extract the data we need. Note that we can pass a url directly into rvest, so the last step wasn't needed in R.

In Python, we use BeautifulSoup, the most commonly used web scraping package. It enables us to loop through the tags and construct a list of lists in a straightforward way.

Python vs R in Conclusion

We've taken a look at how to analyze a dataset with R and Python. There are many tasks we didn't dive into, such as persisting the results of our analysis, sharing the results with others, testing and making things production-ready, and making more visualizations. We'll dive into these at a later date, which will let us make some more definitive conclusions. For now, here's what we can say:

R is more functional, Python is more object-oriented

As we saw from functions like lm, predict, and others, R lets functions do most of the work. Contrast this to the LinearRegression class in Python, and the sample method on dataframes.

R has more data analysis built-ins, Python relies on packages

When we looked at summary statistics, we could use the summary built-in function in R, but had to import the statsmodels package in Python. The dataframe is a built-in construct in R, but must be imported via the pandas package in Python.

Python has "main" packages for data analysis tasks, R has a larger ecosystem of small packages

With Python, we can do linear regression, random forests, and more with the scikit-learn package. It offers a consistent API, and is well-maintained. In R, we have a greater diversity of packages, but also greater fragmentation and less consistency (linear regression is a builtin, lm, randomForest is a separate package, etc).

R has more statistical support in general

R was built as a statistical language, and it shows. statsmodels in Python and other packages provide decent coverage for statistical methods, but the R ecosystem is far larger.

It's usually more straightforward to do non-statistical tasks in Python

With well-maintained libraries like BeautifulSoup and requests, web scraping in Python is far easier than in R. This applies to other tasks that we didn't look into closely, like saving to databases, deploying web servers, or running complex workflows.

There are many parallels between the data analysis workflow in both

There are clear points of inspiration between both R and Python (pandas Dataframes were inspired by R dataframes, the rvest package was inspired by BeautifulSoup), and both ecosystems continue to grow stronger. It's remarkable how similar the syntax and approaches are for many common tasks in both languages.

Last word

At Dataquest, we primarily teach Python, but have recently been adding lessons on R. We see both languages as complementary, and although we think Python is stronger in more areas, R is an effective language. It can be used either as a complement for Python in areas like data exploration and statistics, or as your sole data analysis tool. As this walkthrough proves, both languages have a lot of similarities in syntax and approach, and you can't go wrong with one, the other, or both.

Vik Paruchuri

Read more posts by this author.

Read More