Which is better for data analysis?
There have been dozens of articles written comparing Python and R from a subjective standpoint. We’ll add our own views at some point, but this article aims to look at the languages more objectively. We’ll analyze a dataset side by side in Python and R, and show what code is needed in both languages to achieve the same result. This will let us understand the strengths and weaknesses of each language without the conjecture. At Dataquest, we teach both languages, and think both have a place in a data science toolkit.
We’ll be analyzing a dataset of NBA players and their performance in the 2013-2014 season. You can download the file here. For each step in the analysis, we’ll show the Python and R code, along with some explanation and discussion of the different approaches. Without further ado, let’s get this head to head Python vs R matchup started!
Importing a CSV
The above code will load the csv file
nba_2013.csv, which contains data on NBA players from the 2013-2014 season, into the variable
nba in both languages. The only real difference is that in Python, we need to import the pandas library to get access to Dataframes. Dataframes are available in both R and Python, and are two-dimensional arrays (matrices) where each column can be of a different datatype. At the end of this step, the csv file has been loaded by both languages into a dataframe.
Finding the number of rows
 481 31
This prints out the number of players and the number of columns in each. We have
481 rows, or players, and
31 columns containing data on the players.
Looking at the first row of the data
player pos age bref_team_id 1 Quincy Acy SF 23 TOT [output truncated]
player pos age bref_team_id 0 Quincy Acy SF 23 TOT [output truncated]
This is pretty much identical. Both print out the first row of the data, and the syntax is very similar. Python is more object-oriented here, and
head is a method on the dataframe object, and R has a separate
head function. This is a common theme you’ll see as you start to do analysis with these languages, where Python is more object-oriented, and R is more functional.
Find the average of each statistic
Let’s find the average value for each statistic. The columns, as you can see, have names like
fg (field goals made), and
ast (assists). These are the season statistics for the player. If you want a fuller explanation of all the stats, look here.
player NA pos NA age 26.5093555093555 bref_team_id NA [output truncated]
age 26.509356 g 53.253638 gs 25.571726 [output truncated]
There are some major differences in approach here. In both, we’re applying a function across the dataframe columns. In python, the mean method on dataframes will find the mean of each column by default.
In R, taking the mean of string values will just result in
NA – not available. However, we do need to ignore
NA values when we take the mean (requiring us to pass
na.rm=TRUE into the
mean function). If we don’t, we end up with
NA for the mean of columns like
x3p.. This column is three point percentage. Some players didn’t take three point shots, so their percentage is missing. If we try the
mean function in R, we get
NA as a response, unless we specify
na.rm=TRUE, which ignores
NA values when taking the mean. The
.mean() method in Python already ignores these values by default.
Make pairwise scatterplots
One common way to explore a dataset is to see how different columns correlate to others. We’ll compare the
We get very similar plots in the end, but this shows how the R data science ecosystem has many smaller packages (GGally is a helper package for ggplot2, the most-used R plotting package), and many more visualization packages in general. In Python, matplotlib is the primary plotting package, and seaborn is a widely used layer over matplotlib. With visualization in Python, there is usually one main way to do something, whereas in R, there are many packages supporting different methods of doing things (there are at least a half dozen packages to make pair plots, for instance).
Make clusters of the players
One good way to explore this kind of data is to generate cluster plots. These will show which players are most similar.
In order to cluster properly, we remove any non-numeric columns, or columns with missing values (
Nan, etc). In R, we do this by applying a function across each column, and removing it if it has any missing values or isn’t numeric. We then use the cluster package to perform k-means and find
5 clusters in our data. We set a random seed using
set.seed to be able to reproduce our results.
In Python, we use the main Python machine learning package, scikit-learn, to fit a k-means clustering model and get our cluster labels. We perform very similar methods to prepare the data that we used in R, except we use the
dropna methods to remove non-numeric columns and columns with missing values.
Plot players by cluster
We can now plot out the players by cluster to discover patterns. One way to do this is to first use PCA to make our data 2-dimensional, then plot it, and shade each point according to cluster association.
Made a scatter plot of our data, and shaded or changed the icon of the data according to cluster. In R, the
clusplot function was used, which is part of the cluster library. We performed PCA via the
pccomp function that is builtin to R.
With Python, we used the PCA class in the scikit-learn library. We used matplotlib to create the plot.
Split into training and testing sets
If we want to do supervised machine learning, it’s a good idea to split the data into training and testing sets so we don’t overfit.
You’ll notice that R has many more data-analysis focused builtins, like
set.seed, whereas these are called via packages in Python (
random.seed). In Python, the recent version of pandas came with a
sample method that returns a certain proportion of rows randomly sampled from a source dataframe – this makes the code much more concise. In R, there are packages to make sampling simpler, but aren’t much more concise than using the built-in
sample function. In both cases, we set a random seed to make the results reproducible.
Univariate linear regression
Let’s say we want to predict number of assists per player from field goals made per player.
Scikit-learn has a linear regression model that we can fit and generate predictions from. R relies on the built-in
predict will behave differently depending on the kind of fitted model that is passed into it – it can be used with a variety of fitted models.
Enjoying this post? Learn data science with Dataquest!
Start for Free
- Learn from the comfort of your browser.
- Work with real-life data sets.
- Build a portfolio of projects.
Calculate summary statistics for the model
Call: lm(formula = ast ~ fg, data = train) Residuals: Min 1Q Median 3Q Max -228.26 -35.38 -11.45 11.99 559.61 [output truncated]
OLS Regression Results ============================ Dep. Variable: ast R-squared: 0.568 Model: OLS Adj. R-squared: 0.567 [output truncated]
If we want to get summary statistics about the fit, like r-squared value, we’ll need to do a bit more in Python than in R. With R, we can use the builtin
summary function to get information on the model. With Python, we need to use the statsmodels package, which enables many statistical methods to be used in Python. We get similar results, although generally it’s a bit harder to do statistical analysis in Python, and some statistical methods that exist in R don’t exist in Python.
Fit a random forest model
Our linear regression worked well in the single variable case, but we suspect there may be nonlinearities in the data. Thus, we want to fit a random forest model.
The main difference here is that we needed to use the randomForest library in R to use the algorithm, whereas it was built in to scikit-learn in Python. scikit-learn has a unified interface for working with many different machine learning algorithms in Python, and there’s usually only one main implementation of each algorithm in Python. With R, there are many smaller packages containing individual algorithms, often with inconsistent ways to access them. This results in a greater diversity of algorithms (many have several implementations, and many are fresh out of research labs), but with a bit of a usability hit.
Now that we’ve fit two models, let’s calculate error. We’ll use MSE.
In Python, the scikit-learn library has a variety of error metrics that we can use. In R, there are likely some smaller libraries that calculate MSE, but doing it manually is pretty easy in either language. There’s a small difference in errors that almost certainly due to parameter tuning, and isn’t a big deal.
Download a webpage
Now that we have data on NBA players from 2013-2014, let’s scrape some additional data to supplement it. We’ll just look at one box score from the NBA Finals here to save time.
In Python, the requests package makes downloading web pages easy, with a consistent API for all request types. In R, RCurl provides a similarly simple way to make requests. Both download the webpage to a character datatype. Note: this step is unnecessary for the next step in R, but is shown for comparisons’s sake.
Extract player box scores
Now that we have the web page, we’ll need to parse it to extract scores for players.
This will create a list containing two lists, the first with the box score for
CLE, and the second with the box score for
GSW. Both contain the headers, along with each player and their in-game stats. We won’t turn this into more training data now, but it could easily be transformed into a format that could be added to our
The R code is more complex than the Python code, because there isn’t a convenient way to use regular expressions to select items, so we have to do additional parsing to get the team names from the HTML. R also discourages using
for loops in favor of applying functions along vectors. We use
lapply to do this, but since we need to treat each row different depending on whether it’s a header or not, we pass the index of the item we want, and the entire
rows list into the function.
rvest, a new and widely used R web scraping package to extract the data we need. Note that we can pass a url directly into rvest, so the last step wasn’t needed in R.
In Python, we use BeautifulSoup, the most commonly used web scraping package. It enables us to loop through the tags and construct a list of lists in a straightforward way.
Python vs R in Conclusion
We’ve taken a look at how to analyze a dataset with R and Python. There are many tasks we didn’t dive into, such as persisting the results of our analysis, sharing the results with others, testing and making things production-ready, and making more visualizations. We’ll dive into these at a later date, which will let us make some more definitive conclusions. For now, here’s what we can say:
R is more functional, Python is more object-oriented
As we saw from functions like
predict, and others, R lets functions do most of the work. Contrast this to the
LinearRegression class in Python, and the
sample method on dataframes.
R has more data analysis built-ins, Python relies on packages
When we looked at summary statistics, we could use the
summary built-in function in R, but had to import the
statsmodels package in Python. The dataframe is a built-in construct in R, but must be imported via the
pandas package in Python.
Python has “main” packages for data analysis tasks, R has a larger ecosystem of small packages
With Python, we can do linear regression, random forests, and more with the scikit-learn package. It offers a consistent API, and is well-maintained. In R, we have a greater diversity of packages, but also greater fragmentation and less consistency (linear regression is a builtin,
randomForest is a separate package, etc).
R has more statistical support in general
R was built as a statistical language, and it shows.
statsmodels in Python and other packages provide decent coverage for statistical methods, but the R ecosystem is far larger.
It’s usually more straightforward to do non-statistical tasks in Python
With well-maintained libraries like BeautifulSoup and requests, web scraping in Python is far easier than in R. This applies to other tasks that we didn’t look into closely, like saving to databases, deploying web servers, or running complex workflows.
There are many parallels between the data analysis workflow in both
There are clear points of inspiration between both R and Python (pandas Dataframes were inspired by R dataframes, the rvest package was inspired by BeautifulSoup), and both ecosystems continue to grow stronger. It’s remarkable how similar the syntax and approaches are for many common tasks in both languages.
At Dataquest, we primarily teach Python, but have recently been adding lessons on R. We see both languages as complementary, and although we think Python is stronger in more areas, R is an effective language. It can be used either as a complement for Python in areas like data exploration and statistics, or as your sole data analysis tool. As this walkthrough proves, both languages have a lot of similarities in syntax and approach, and you can’t go wrong with one, the other, or both.