Tutorial: Better Blog Post Analysis with googleAnalyticsR

In my previous role as a marketing data analyst for a blogging company, one of my most important tasks was to track how blog posts performed.

On the surface, it’s a fairly straightforward goal. With Google Analytics, you can quickly get just about any metric you need for your blog posts, for any date range. 

But when it comes to comparing blog post performance, things get a bit trickier. 

For example, let’s say we want to compare the performance of the blog posts we published on the Dataquest blog in June (using the month of June as our date range). 

But wait… two blog posts with more than 1,000 pageviews were published earlier in the month, And the two with fewer than 500 pageviews were published at the end of the month. That’s hardly a fair comparison!

My first solution to this problem was to look up each post individually, so that I could make an even comparison of how each post performed in their first day, first week, first month, etc. 

However, that required a lot of manual copy-and-paste work, which was extremely tedious if I wanted to compare more than a few posts, date ranges, or metrics at a time. 

But then, I learned R, and realized that there was a much better way.

In this post, we'll walk through how it's done, so you can do my better blog post analysis for yourself!

What we'll need

To complete this tutorial, you’ll need basic knowledge of R syntax and the tidyverse, and access to a Google Analytics account.

Not yet familiar with the basics of R? We can help with that! Our interactive online courses teach you R from scratch, with no prior programming experience required. Sign up and start today!

You’ll also need the dyplr, lubridate, and stringr packages installed — which, as a reminder, you can do with the install.packages() command.

Finally, you will need a CSV of the blog posts you want to analyze. Here’s what’s in my dataset:

post_url: the page path of the blog post
post_date: the date the post was published (formatted m/d/yy)
category: the blog category the post was published in (optional)
title: the title of the blog post (optional)

Depending on your content management system, there may be a way for you to automate gathering this data — but that’s out of the scope of this tutorial!

For this tutorial, we’ll use a manually-gathered dataset of the past ten Dataquest blog posts.

Setting up the googleAnalyticsR package

To access data from the Google Analytics API, we’ll use the excellent googleAnalyticsR package by Mark Edmonson. 

As described in the documentation, there are two "modes" to the googleAnalyticsR package. The first mode, which we’ll use here, is a “Try it out” mode, which uses a shared Google Project to authorize your Google Analytics account. 

If you want to make this report a recurring tool for your blog or client, be sure to create your own Google Project, which will help keep the traffic on the shared Project to a minimum. To find out how to set this up, head over to the package setup documentation.

For now, though, we’ll stick with “Try it out” mode. 

First, we'll install the package using this code:

install.packages('googleAnalyticsR', dependencies = TRUE)

This installs the package, as well as the required dependencies.

Next, we'll load the library, and authorize it with a Google Analytics account using the ga_auth() function.

library(googleAnalyticsR)
ga_auth()

When you run this code the first time, it will open a browser window and prompt you to log in to your Google account. Then, it will give you a code to paste into your R console. After that, it will save an authorization token so you only have to do this once!

Once you’ve completed the Google Analytics authorization, we’re ready to set up the rest of the libraries and load in our blog posts. We’ll also use dplyr::mutate() to change the post_date to a Date class while we’re at it!

library(dplyr)
library(lubridate)
library(stringr)
library(readr)

blog_posts <- read.csv("articles.csv") %>%
  mutate(
    post_date = as.Date(post_date, "%m/%d/%y") # changes the post_date column to a Date
  )

Here’s what the blog post data frame looks like: 

Finally, to get data from your Google Analytics account, you will need the ID of the Google Analytics view you want to access. ga_account_list() will return a list of your available accounts.

accounts <- ga_account_list()

# select the view ID by view and property name, and store it for ease of use
view_id <- accounts$viewId[which(accounts$viewName == "All Web Site Data" & accounts$webPropertyName == "Dataquest")]
# be sure to change this out with your own view and/or property name!

Now, we’re ready to do our first Google Analytics API requests!

Accessing blog post data with googleAnalyticsR

In this tutorial, our goal is to gather data for the first week each post was active, and compile it in a dataframe for analysis. To do this, we’ll create a function that runs a for loop and requests this data for each post in our blog_posts dataframe.

So, let’s take a look at how to send a request to the Google Analytics API using googleAnalyticsR.

google_analytics(view_id,
                  date_range = c(as.Date("2020-06-01"), as.Date("2020-06-30")),
                  metrics = c("pageviews"),
                  dimensions = c("pagePath")
)

This request has a few components. First, enter the view_id, which we already stored from our ga_accounts() dataframe.

Next, specify the date range, which needs to be passed in as a list of dates.

Then, we input the metrics (like pageviews, landing page sessions, or time on page) and dimensions (like page path, channel, or device). We can use any dimension or metric that’s available in the Google Analytics UI — here’s a useful reference for finding the API name of any UI metric or dimension.

So, the request above will return a dataframe of all pageviews in June, by page path (by default googleAnalyticsR will only return the first 1,000 results).

But, in our case, we only want to retrieve pageviews for a specific page – so we need to filter on the pagePath dimension using a dimension filter, which looks like this:

page_filter <- dim_filter(dimension = "pagePath",
                          operator = "REGEXP",
                          expressions = "^www.dataquest.io/blog/r-markdown-guide-cheatsheet/$")

To use this filter in our request, googleAnalyticsR wants us to create a filter clause – which is how you would combine filters if you wanted to use multiple dimension filters. But in our case, we just need the one: 

page_filter_clause <- filter_clause_ga4(list(page_filter))

Now, let’s try sending a response with this filter:

google_analytics(view_id,
              date_range = c(as.Date("2020-07-01"), Sys.Date()),
              metrics = c("pageviews"),
              dimensions = c("pagePath"),
              dim_filters = page_filter_clause)

The result is a dataframe with the pageviews for the R Markdown post!

Creating the for loop

Now that we can gather data and filter it by dimension, we are ready to build out our function to run our for loop! The steps to the function are:

  • Set up a data frame to hold the results
  • Begin the loop based on the number of rows in the data frame
  • Access the post URL and post date for each post
  • Create a page filter based on the post URL
  • Send a request to Google Analytics using the post_date as the start date, and date the week later as the end date
  • Add the post URL and pageview data to the final data frame

I also have added a print() command to let us know how far along the loop is (because it can take awhile) and a Sys.Sleep() command to keep us from hitting the Google Analytics API rate limit.

Here’s what that looks like all put together!

get_pageviews <- function(posts) {

  # set up dataframe to be returned, using the same variable names as our original dataframe
  final <- tibble(pageviews = numeric(),
                      post_url = character())

  # begin the loop for each row in the posts dataframe
  for (i in seq(1:nrow(posts))) {

    # select the post URL and post date for this loop — also using the same variable names as our original dataframe
    post_url <- posts$post_url[i]
    post_date <- posts$post_date[i]

    # set up the page filter and page filter clause with the current post URL
    page_filter <- dim_filter(dimension = "pagePath",
                              operator = "REGEXP",
                              expressions = post_url)

    page_filter_clause <- filter_clause_ga4(list(page_filter))

    # send the request, and set the date range to the week following the date the post was shared
    page_data <- google_analytics(view_id,
                                    date_range = c(post_date, post_date %m+% weeks(1)),
                                    metrics = c("pageviews"),
                                    dim_filters = page_filter_clause)

    # add the post url to the returned dataframe
    page_data$post_url <- post_url

    # add the returned data to the data frame we created outside the loop
    final <- rbind(final, page_data)

    # print loop status
    print(paste("Completed row", nrow(final), "of", nrow(posts)))

    # wait two seconds
    Sys.sleep(2)

  }

  return(final)

}

We could potentially speed this up with a “functional” in R, such as purrr::map(). The map() function takes a function as an input and returns a vector as output. Check out Dataquest's interactive online lesson on the map function if you'd like to deepen your knowledge!

For this tutorial, though, we'll use a for loop because it's a bit less abstract. 

Now, we’ll run the loop on our blog_posts dataframe, and merge the results to our blog_posts data.

recent_posts_first_week <- get_pageviews(blog_posts)
recent_posts_first_week <- merge(blog_posts, recent_posts_first_week)

recent_posts_first_week

And that’s it! Now, we can get on to the good stuff — analyzing and visualizing the data.

Blog post data, visualized!

For demonstration, here's a ggplot bar chart that shows how many pageviews each of our most recent 10 blog posts got in the first week after they were published: 

library(ggplot2)
library(scales)

recent_posts_first_week %>%
  arrange(
    post_date
  ) %>%
  mutate(
    pretty_title = str_c(str_extract(title, "^(\\S+\\s+\\n?){1,5}"), "..."),
    pretty_title = factor(pretty_title, levels = pretty_title[order(post_date)])
  ) %>%
  ggplot(aes(pretty_title, pageviews)) +
  geom_bar(stat = "identity", fill = "#39cf90") +
  coord_flip() +
  theme_minimal() +
  theme(axis.title = element_blank()) +
  labs(title = "Recent Dataquest blog posts by first week pageviews") +
  scale_y_continuous(labels = comma)

Now we can see how useful it is to be able to compare blog posts on "even footing"! 

For more information on the googleAnalyticsR package and what you can do with it, check out its very helpful resource page


Tags

promote, r, R tutorial, r tutorials, rstats, tutorial, Tutorials


You may also like

Get started with Dataquest today - for free!

__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"493ef":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"493ef":{"val":"rgb(44, 168, 116)","hsl":{"h":154,"s":0.58,"l":0.42}}},"gradients":[]},"original":{"colors":{"493ef":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
Sign up now

Or, visit our pricing page to learn about our Basic and Premium plans.