December 11, 2018

Historic Wildfire Data: Exploratory Visualization in R

In recent weeks, news of the devastating wildfires sweeping parts of the US state of California have featured prominently in the news. While most wildfires are started accidentally by humans, weather conditions like wind and drought can exacerbate fires’ spread and intensity. Improved understanding of historical wildfire trends and causes can inform fire management and save lives and property. In this exercise, we’ll use R to perform exploratory data visualization as we learn about historical wildfire data for the state of California. Exploratory data visualization is an important first step to take when deciding an further analyses of a new data set. If you’d like, you can follow along on your own machine if you have R or RStudio installed. For a primer on using R and setting up RStudio, check out our Introduction to R course. To learn to perform some of the more advanced data wrangling techniques we’ll use in this exercise, we recommend Intermediate R Programming.


Identifying Data Sources

To start, we’ll need to find data to work with. The California Department of Forestry and Fire Protection (Cal Fire) provides a wealth of resources containing historic fire data, including information about the size, duration, and cause of large wildfires (greater than 300 acres) dating back to the year 2000.

We were happy to find that Lazaro Gamio and Jill Hubley had already scraped data from CAL FIRE and made it available here. The data set contains the following variables:

  • id: The Incident Number, an ID used to uniquely identify the fire.
  • unit: The Ranger Unit, used by Cal Fire to identify management regions. * name: The name used to refer to the fire.
  • start: The date the fire began.
  • end: The date when the fire was contained.
  • agency: The federal agency that provided the data.
  • acres: The number of acres burned by the fire. This data set contains information for fires greater than 300 acres.
  • cause: What caused the fire (if known).

We’ll begin by importing the data into R as a data frame named ca_fires:

ca_fires <- read_csv("axios-calfire-wildfire-data.csv")

Let’s have a look at the first few lines of the data:

id unit name start end agency acres cause

TGU-803 Mendocino NFMNF–154 Town 03-31-2000 04-05-2000 USFS 1500 escape control
MNF-177 Mendocino NF Cabbage 04-01-2000 04-01-2000 USFS 1540 ?
TGU-851 Mendocino NF MNF_178 Franklin 04-02-2000 04-03-2000 USFS 400 na
LMU-630 Lassen NF PNF 151 (Robinson) 04-04-2000 04-05-2000 USFS 500 na
BDU-4016 Mono County Assist (Azusa) 05-29-2000 06-01-2000 USFS 740 Campfire
TUU-4863 Tulare Elephant 06-01-2000 06-01-2000 CDF 497 Control Burn

We can use these data to answer a variety of questions to help us better understand California wildfires over the past eighteen years:

  • How large are they?
  • How long do they last?
  • When are they most likely to occur?
  • Have their severity changed over time?

Preparing the Data for Analysis

We’ll create visualizations to answer some of these questions. The data are fairly clean, but we’ll need to perform some manipulations before we begin our exploratory visualization:

  • The start and end variables are currently character data, but we need to change them to dates.
  • From start and end, we can calculate a new variable, duration, to tell us how long each fire was active.

To change the start and end variable dates from character data to date data, we’ll use the package lubridate. We’ll also use tools for data wrangling from the dplyr package.


Using the dplyr function mutate() and the lubridate function as.Date(), we can change the start and end dates from character to numeric:

ca_fires <- ca_fires %>%  mutate(start = as.Date(start, "%m-%d-%Y"), end = as.Date(end, "%m-%d-%Y"))

Since it may be useful to calculate summary statistics by month or by year, let’s extract information from the start column to create new month and year variables. We can do this using the lubridate functions month() and year():

ca_fires <- ca_fires %>%  mutate(Month = month(start, label = TRUE),
         Year = year(start)) 

From the start and end dates, we can also calculate the length of time the fire was active. We’ll create a new variable, duration, using mutate():

ca_fires <- ca_fires %>%  mutate(duration = end - start)

So that we can visualize trends in the number of acres burned by wildfires over time, we’ll create a new variable,acres_cumulative, containing the cumulative sum of acres burned using mutate() and the base R function cumsum():

ca_fires <- ca_fires %>%  mutate(acres_cumulative = cumsum(acres))

We are now ready to begin exploring the data by creating visualizations. We’ll use the ggplot2 package, an enormously popular data visualization package in R.

Visualizing Data Distributions

Let’s begin by considering the first two questions we asked about the wildfire data:

  • How large are CA wildfires?
  • How long do they last?

As we explore the data to try to understand our answers to this question, a good first step is to create histograms to understand how the data are distributed, or where in the range of values most of the data fall. We’ll use the ggplot() function to create histograms to visualize the distributions of the variables area and duration:

## area
ggplot(data = ca_fires) +
  aes(x = acres) +
  geom_histogram(bins  = 30) +
  theme_classic() +
  scale_x_continuous(labels = scales::comma)  
## duration
ggplot(data = ca_fires) +
  aes(x = duration) +
  geom_histogram(bins  = 30) +
  theme_classic() +
  scale_x_continuous(labels = scales::comma)

Let’s look first at the histogram for acres:


In histograms, the height of the bars corresponds to the number of observations (shown on the y-axis) that fall into a range of values for a variable (on the x-axis). It’s clear from this histogram that the data are highly skewed: That is, the vast majority of fires are smaller than a few thousand acres, but there are some very large values for area, nearly 300,000 acres. What can a histogram tell us about how long fires last?

Let’s take a look at the histogram for duration: duration_outliers Take a look at the x-axis scale on this plot. All values of duration should be positive numbers, but the histogram indicates the presence of some large negative values. Let’s use the dplyr function filter() to take a look at observations for which values of duration are negative:

ca_fires %>%  filter(duration < 0)

If we look at the start and end dates for these observations, we can see that they appear to be incorrect, possibly due to a data entry error:

id         unit     name            start      end       <chr>      <chr>    <chr>           <date>     <date>    
TUU-007233 Tulare   Frazier         2003-07-08 2003-07-07
TCU-006534 TUOLUMNE EARLY           2004-08-11 2004-08-09
BTU-9805   BUTTE    EMPIRE          2008-08-13 2008-08-03
BTU-7042   PLUMAS   BOULDER COMPLEX 2016-06-25 2006-07-05

Let’s remove these clearly inaccurate observations from the data set:

ca_fires <- ca_fires %>%
    filter(duration >0)

If we re-run the code for creating a histogram of duration now that we have omitted the rows with inaccurate date data, we can see that the vast majority of fires burn for less than 100 days. duration_hist

Visualizing Annual Patterns

Now that we’ve used histograms to understand the most common sizes and durations for wildfires, let’s visualize the data to answer another question:

  • When are fires most likely to occur?

Are wildfires more common at certain times of the year? Since we have a month variable, we can visualize trends in wildfire occurrence and number of acres burned by month. First, let’s use the dplyr functions group_by() and summarize() to create a count of the number of fires that occurred during each month:

month_summary <- ca_fires %>%
  group_by(month) %>%
  summarize(fires = n())

Then, we’ll create a bar chart to visualize the number of fires per month:

ggplot(data = month_summary) +
 aes(x = month, y = fires) +
  geom_bar(stat = "identity") +
 theme_classic() +
 scale_y_continuous(labels = scales::comma)

fires_per_month_bar This bar graph allows us to clearly see that most fires occur during the spring, summer, and fall, seasons that tend be characterized by hot, dry conditions conducive to fire spread. However, it does not allow us to visualize the variability in the number of fires that occur per month among different years. Instead, let’s summarize counts of fire occurrence in the ca_fires data frame by month and by year and create a boxplot that will let us visualize the variability of monthly fire counts across years:

month_year_summary <- ca_fires %>% 
 group_by(month, year) %>%
  summarize(fires = n())
ggplot(data = month_year_summary) +
  aes(x = month, y = fires) +
   geom_boxplot() +
  theme_classic() +
  scale_y_continuous(labels = scales::comma)

Boxplots graphically depict data’s minimum, maximum, and median values, and therefore are excellent visualizations for comparing groups of data without obscuring variability. Let’s have a look at our boxplot of the number of wildfires by month:

fires_month_boxplot As the bar graph we created earlier made clear, wildfires are most common during the summer months and are usually rare during the winter. The presence of points representing outliers demonstrates that, while not common, wildfires can occur year-round.

Visualizing Changes Over Time

Let’s move on to another question we can answer using these data:

  • Has wildfire severity changed over time?

To answer this question, we’ll create a line graph of cumulative acres burned over time:

ca_fires %>%
  ggplot() +
  aes(x = start, y = acres_cumulative) +
  geom_line() + 
 theme_classic() +
  scale_y_continuous(labels = scales::comma) + 
 ylab("Acres Burned (cumulative)") + 
 xlab("date") + 
 ggtitle("Cumulative Acres Burned Since 2000")

A smooth, upward-trending line would indicate that the number of acres burned by wildfires has remained consistent over time, whereas any increases or decreases in the number of acres burned will be noticeable as a change in the line’s slope.

fires_line_graph As we look at the line graph of cumulative acres burned, we can see that a pattern that looks a bit like a staircase: The number of acres burned increases rapidly, and then levels off again. Based on earlier explorations, we can infer that this pattern is likely due to seasonal differences. More acres are burned during the summer months, and the increase is smaller during the cooler months. We can also see that some “steps” are taller than others, indicating a larger number of acres burned at once.

Conclusions and Next Steps

In this exercise, our exploratory data visualization provided us with information about California wildfires greater than 300 acres:

  • While there are some large outliers, most wildfires are smaller than a few hundred acres and last for less than 50 days.
  • Most wildfires occur during the summer months, although they can occur at any time of year.
  • Some years are characterized by more destructive wildfires.

The purpose of exploratory data visualization is to familiarize yourself with a new data set, get an idea of trends, and plan future analyses. An interesting next step might be to explore the effect of climate and weather conditions on forest fire frequency and severity. There are many sources for such data:

  • You can acquire drought data for the state of California from the National Oceanic and Atmospheric Administration’s (NOAA) National Integrated Drought Information System (NIDIS) program.
  • Historic weather and climate data such as temperature and precipitation is available from NOAA.
  • NIDIS offers data on soil moisture, an indicator of drought and a factor that can exacerbate wildfires.


If you’re interested in learning more about using R for exploratory data visualization, including ways to investigate relationships between environmental variables and wildfire data, be sure to check out our Data Visualization in R course.

> install.packages("Dataquest")

Start learning R today with our Introduction to R course — no credit card required!

Vik Paruchuri

About the author

Vik Paruchuri

Vik is the CEO and Founder of Dataquest.

Learn data skills for free

Headshot Headshot

Join 1M+ learners

Try free courses