In recent weeks, news of the devastating wildfires sweeping parts of the US state of California have featured prominently in the news.
While most wildfires are started accidentally by humans, weather conditions like wind and drought can exacerbate fires' spread and intensity. Improved understanding of historical wildfire trends and causes can inform fire management and save lives and property. In this exercise, we'll use R to perform exploratory data visualization as we learn about historical wildfire data for the state of California. Exploratory data visualization is an important first step to take when deciding an further analyses of a new data set.
If you'd like, you can follow along on your own machine if you have R or RStudio installed. For a primer on using R and setting up RStudio, check out our Introduction to R course. To learn to perform some of the more advanced data wrangling techniques we'll use in this exercise, we recommend Intermediate R Programming.
Identifying Data Sources
To start, we'll need to find data to work with.
The California Department of Forestry and Fire Protection (Cal Fire) provides a wealth of resources containing historic fire data, including information about the size, duration, and cause of large wildfires (greater than 300 acres) dating back to the year 2000. Unfortunately, most of the data are stored in PDF files, like this. It is certainly possible to programatically extract the data and clean it, and often, this is a necessary step. However, we were happy to find that Lazaro Gamio and Jill Hubley had already scraped data from CAL FIRE and made it available here.
The data set contains the following variables:
- id: The Incident Number, an ID used to uniquely identify the fire.
- unit: The Ranger Unit, used by Cal Fire to identify management regions. * name: The name used to refer to the fire.
- start: The date the fire began.
- end: The date when the fire was contained.
- agency: The federal agency that provided the data.
- acres: The number of acres burned by the fire. This data set contains information for fires greater than 300 acres.
- cause: What caused the fire (if known).
We'll begin by importing the data into R as a data frame names
library(readr) ca_fires <- read_csv("axios-calfire-wildfire-data.csv")
Let's have a look at the first few lines of the data:
|TGU-803||Mendocino NFMNF--154||Town||03-31-2000||04-05-2000||USFS||1500||escape control|
|TGU-851||Mendocino NF MNF_178||Franklin||04-02-2000||04-03-2000||USFS||400||na|
|LMU-630||Lassen NF||PNF 151 (Robinson)||04-04-2000||04-05-2000||USFS||500||na|
|BDU-4016||Mono County||Assist (Azusa)||05-29-2000||06-01-2000||USFS||740||Campfire|
We can use these data to answer a variety of questions to help us better understand California wildfires over the past eighteen years:
- How large are they?
- How long do they last?
- When are they most likely to occur?
- Have their severity changed over time?
Preparing the Data for Analysis
We'll create visualizations to answer some of these questions. The data are fairly clean, but we'll need to perform some manipulations before we begin our exploratory visualization:
endvariables are currently character data, but we need to change them to dates.
end, we can calculate a new variable,
duration, to tell us how long each fire was active.
To change the
end variable dates from character data to date data, we'll use the package lubridate. We'll also use tools for data wrangling from the
mutate() and the
as.Date(), we can change the
end dates from character to numeric:
ca_fires <- ca_fires %>% mutate(start = as.Date(start, "%m-%d-%Y"), end = as.Date(end, "%m-%d-%Y"))
Since it may be useful to calculate summary statistics by month or by year, let's extract information from the
start column to create new
year variables. We can do this using the
ca_fires <- ca_fires %>% mutate(Month = month(start, label = TRUE), Year = year(start))
end dates, we can also calculate the length of time the fire was active. We'll create a new variable,
ca_fires <- ca_fires %>% mutate(duration = end - start)
So that we can visualize trends in the number of acres burned by wildfires over time, we'll create a new variable,
acres_cumulative, containing the cumulative sum of acres burned using
mutate() and the base R function
ca_fires <- ca_fires %>% mutate(acres_cumulative = cumsum(acres))
We are now ready to begin exploring the data by creating visualizations. We'll use the
ggplot2 package, an enormously popular data visualization package in R.
Visualizing Data Distributions
Let's begin by considering the first two questions we asked about the wildfire data:
- How large are CA wildfires?
- How long do they last?
As we explore the data to try to understand our answers to this question, a good first step is to create histograms to understand how the data are distributed, or where in the range of values most of the data fall.
We'll use the ggplot() function to create histograms to visualize the distributions of the variables
## area ggplot(data = ca_fires) + aes(x = acres) + geom_histogram(bins = 30) + theme_classic() + scale_x_continuous(labels = scales::comma) ## duration ggplot(data = ca_fires) + aes(x = duration) + geom_histogram(bins = 30) + theme_classic() + scale_x_continuous(labels = scales::comma)
Let's look first at the histogram for
In histograms, the height of the bars corresponds to the number of observations (shown on the y-axis) that fall into a range of values for a variable (on the x-axis). It's clear from this histogram that the data are highly skewed: That is, the vast majority of fires are smaller than a few thousand acres, but there are some very large values for
area, nearly 300,000 acres.
What can a histogram tell us about how long fires last? Let's take a look at the histogram for
Take a look at the x-axis scale on this plot. All values of
duration should be positive numbers, but the histogram indicates the presence of some large negative values. Let's use the
filter() to take a look at observations for which values of
duration are negative:
ca_fires %>% filter(duration < 0)
If we look at the
end dates for these observations, we can see that they appear to be incorrect, possibly due to a data entry error:
id unit name start end <chr> <chr> <chr> <date> <date> TUU-007233 Tulare Frazier 2003-07-08 2003-07-07 TCU-006534 TUOLUMNE EARLY 2004-08-11 2004-08-09 BTU-9805 BUTTE EMPIRE 2008-08-13 2008-08-03 BTU-7042 PLUMAS BOULDER COMPLEX 2016-06-25 2006-07-05
Let's remove these clearly inaccurate observations from the data set:
ca_fires <- ca_fires %>% filter(duration >0)
If we re-run the code for creating a histogram of
duration now that we have omitted the rows with inaccurate date data, we can see that the vast majority of fires burn for less than 100 days.
Visualizing Annual Patterns
Now that we've used histograms to understand the most common sizes and durations for wildfires, let's visualize the data to answer another question:
- When are fires most likely to occur?
Are wildfires more common at certain times of the year? Since we have a
month variable, we can visualize trends in wildfire occurrence and number of acres burned by month.
First, let's use the
summarize() to create a count of the number of fires that occurred during each month:
month_summary <- ca_fires %>% group_by(month) %>% summarize(fires = n())
Then, we'll create a bar chart to visualize the number of fires per month:
ggplot(data = month_summary) + aes(x = month, y = fires) + geom_bar(stat = "identity")+ theme_classic() + scale_y_continuous(labels = scales::comma)
This bar graph allows us to clearly see that most fires occur during the spring, summer, and fall, seasons that tend be characterized by hot, dry conditions conducive to fire spread. However, it does not allow us to visualize the variability in the number of fires that occur per month among different years.
Instead, let's summarize counts of fire occurrence in the
ca_fires data frame by month and by year and create a boxplot that will let us visualize the variability of monthly fire counts across years:
month_year_summary <- ca_fires %>% group_by(month, year) %>% summarize(fires = n()) ggplot(data = month_year_summary) + aes(x = month, y = fires) + geom_boxplot() + theme_classic() + scale_y_continuous(labels = scales::comma)
Boxplots graphically depict data's minimum, maximum, and median values, and therefore are excellent visualizations for comparing groups of data without obscuring variability. Let's have a look at our boxplot of the number of wildfires by month:
As the bar graph we created earlier made clear, wildfires are most common during the summer months and are usually rare during the winter. The presence of points representing outliers demonstrates that, while not common, wildfires can occur year-round.
Visualizing Changes Over Time
Let's move on to another question we can answer using these data:
- Has wildfire severity changed over time?
To answer this question, we'll create a line graph of cumulative acres burned over time:
ca_fires %>% ggplot() + aes(x = start, y = acres_cumulative) + geom_line() + theme_classic() + scale_y_continuous(labels = scales::comma) + ylab("Acres Burned (cumulative)") + xlab("date") + ggtitle("Cumulative Acres Burned Since 2000")
A smooth, upward-trending line would indicate that the number of acres burned by wildfires has remained consistent over time, whereas any increases or decreases in the number of acres burned will be noticeable as a change in the line's slope.
As we look at the line graph of cumulative acres burned, we can see that a pattern that looks a bit like a staircase: The number of acres burned increases rapidly, and then levels off again. Based on earlier explorations, we can infer that this pattern is likely due to seasonal differences. More acres are burned during the summer months, and the increase is smaller during the cooler months. We can also see that some "steps" are taller than others, indicating a larger number of acres burned at once.
Conclusions and Next Steps
In this exercise, our exploratory data visualization provided us with information about California wildfires greater than 300 acres:
- While there are some large outliers, most wildfires are smaller than a few hundred acres and last for less than 50 days.
- Most wildfires occur during the summer months, although they can occur at any time of year.
- Some years are characterized by more destructive wildfires.
The purpose of exploratory data visualization is to familiarize yourself with a new data set, get an idea of trends, and plan future analyses. An interesting next step might be to explore the effect of climate and weather conditions on forest fire frequency and severity. There are many sources for such data:
- You can acquire drought data for the state of California from the National Oceanic and Atmospheric Administration’s (NOAA) National Integrated Drought Information System (NIDIS) program.
- Historic weather and climate data such as temperature and precipitation is available from NOAA.
- NIDIS offers data on soil moisture, an indicator of drought and a factor that can exacerbate wildfires.
If you're interested in learning more about using R for explortory data visualization, including ways to investigate relationships between environmental variables and wildfire data, be sure to check out our Data Visualization in R course, which just launched!