September 25, 2017

Explore Happiness Data Using Python Pivot Tables

One of the biggest challenges when facing a new data set is knowing where to start and what to focus on. Being able to quickly summarize hundreds of rows and columns can save you a lot of time and frustration. A simple tool you can use to achieve this is a pivot table, which helps you slice, filter, and group data at the speed of inquiry and represent the information in a visually appealing way.

Pivot table, what is it good for?

You may already be familiar with the concept of pivot tables from Excel, where they were introduced in 1994 by the trademarked name PivotTable. This tool enabled users to automatically sort, count, total, or average the data stored in one table. In the image below we used the PivotTable functionality to quickly summarize the Titanic data set. The larger table below displays the first ~30 rows of the data set, and the smaller tables are the PivotTables we created.

The pivot table on the left grouped the data according to the Sex and Survived column. As a result, this table displays the percentage of each gender among the different survival status (0: Didn't survive, 1: Survived). This allows us to quickly see that women had better chances of survival than men. The table on the right also uses the Survived column, but this time the data is grouped by Class.

Introducing our data set: World Happiness Report

We used Excel for the above examples, but this post will demonstrate the advantages of the built-in pandas function pivot_table built in function in Pandas. We'll use the World Happiness Report, which is a survey about the state of global happiness. The report ranks more than 150 countries by their happiness levels, and has been published almost every year since 2012. We'll use data collected in the years 2015, 2016, and 2017, which is available for download if you'd like to follow along. We're running python 3.6 and pandas 0.19.

Some interesting questions we might like to answer are:

Which are the happiest and least happy countries and regions in the world?
Is happiness affected by region?
Did the happiness score change significantly over the past three years?

Let's import our data and take a quick first look:

import pandas as pd
import numpy as np
# reading the data
data = pd.read_csv('data.csv', index_col=0)
# sort the df by ascending years and descending happiness scores
data.sort_values(['Year', "Happiness Score"], ascending=[True, False], inplace=True)
#diplay first 10 rows
data.head(10)

	Country	Region	Happiness Rank	Happiness Score	Economy (GDP per Capita)	Family	Health (Life Expectancy)	Freedom	Trust (Government Corruption)	Generosity	Dystopia Residual	Year
141	Switzerland	Western Europe	1.0	7.587	1.39651	1.34951	0.94143	0.66557	0.41978	0.29678	2.51738	2015
60	Iceland	Western Europe	2.0	7.561	1.30232	1.40223	0.94784	0.62877	0.14145	0.43630	2.70201	2015
38	Denmark	Western Europe	3.0	7.527	1.32548	1.36058	0.87464	0.64938	0.48357	0.34139	2.49204	2015
108	Norway	Western Europe	4.0	7.522	1.45900	1.33095	0.88521	0.66973	0.36503	0.34699	2.46531	2015
25	Canada	North America	5.0	7.427	1.32629	1.32261	0.90563	0.63297	0.32957	0.45811	2.45176	2015
46	Finland	Western Europe	6.0	7.406	1.29025	1.31826	0.88911	0.64169	0.41372	0.23351	2.61955	2015
102	Netherlands	Western Europe	7.0	7.378	1.32944	1.28017	0.89284	0.61576	0.31814	0.47610	2.46570	2015
140	Sweden	Western Europe	8.0	7.364	1.33171	1.28907	0.91087	0.65980	0.43844	0.36262	2.37119	2015
103	New Zealand	Australia and New Zealand	9.0	7.286	1.25018	1.31967	0.90837	0.63938	0.42922	0.47501	2.26425	2015
6	Australia	Australia and New Zealand	10.0	7.284	1.33358	1.30923	0.93156	0.65124	0.35637	0.43562	2.26646	2015

Each country's Happiness Score is calculated by summing the seven other variables in the table. Each of these variables reveals a population-weighted average score on a scale running from 0 to 10, that is tracked over time and compared against other countries.

These variables are:

Economy: real GDP per capita
Family: social support
Health: healthy life expectancy
Freedom: freedom to make life choices
Trust: perceptions of corruption
Generosity: perceptions of generosity
Dystopia: each country is compared against a hypothetical nation that represents the lowest national averages for each key variable and is, along with residual error, used as a regression benchmark

Each country's Happiness Score determines its Happiness Rank - which is its relative position among other countries in that specific year. For example, the first row indicates that Switzerland was ranked the happiest country in 2015 with a happiness score of 7.587. Switzerland was ranked first just before Iceland, which scored 7.561. Denmark was ranked third in 2015, and so on. It's interesting to note that Western Europe took seven of the top eight rankings in 2015.

We'll concentrate on the final Happiness Score to demonstrate the technical aspects of pivot table.


# getting an overview of our data
print("Our data has {0} rows and {1} columns".format(data.shape[0], data.shape[1]))
# checking for missing values
print("Are there missing values? {}".format(data.isnull().any().any()))
data.describe()

Our data has 495 rows and 12 columns
Are there missing values? True

	Happiness Rank	Happiness Score	Economy (GDP per Capita)	Family	Health (Life Expectancy)	Freedom	Trust (Government Corruption)	Generosity	Dystopia Residual	Year
count	470.000000	470.000000	470.000000	470.000000	470.000000	470.000000	470.000000	470.000000	470.000000	495.000000
mean	78.829787	5.370728	0.927830	0.990347	0.579968	0.402828	0.134790	0.242241	2.092717	2016.000000
std	45.281408	1.136998	0.415584	0.318707	0.240161	0.150356	0.111313	0.131543	0.565772	0.817323
min	1.000000	2.693000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.328580	2015.000000
25	40.000000	4.509000	0.605292	0.793000	0.402301	0.297615	0.059777	0.152831	1.737975	2015.000000
50	79.000000	5.282500	0.995439	1.025665	0.630053	0.418347	0.099502	0.223140	2.094640	2016.000000
75	118.000000	6.233750	1.252443	1.228745	0.768298	0.516850	0.173161	0.315824	2.455575	2017.000000
max	158.000000	7.587000	1.870766	1.610574	1.025250	0.669730	0.551910	0.838075	3.837720	2017.000000

The describe() method reveals that Happiness Rank ranges from 1 to 158, which means that the largest number of surveyed countries for a given year was 158. It's worth noting that Happiness Rank was originally of type int. The fact it's displayed as a float here implies we have NaN values in this column (we can also determine this by the count row which only amounts to 470 as opposed to the 495 rows in our data set).

The Year column doesn't have any missing values. Firstly, because it's displayed in the data set as int, but also - the count for Year amounts to 495 which is the number of rows in our data set. By comparing the count value for Year to the other columns, it seems we can expect 25 missing values in each column (495 in Year VS. 470 in all other columns).

Categorizing the data by `Year` and `Region`

The fun thing about pandas pivot_table is you can get another point of view on your data with only one line of code. Most of the pivot_table parameters use default values, so the only mandatory parameters you must add are data and index. Though it isn't mandatory, we'll also use the value parameter in the next example.

data is self explanatory - it's the DataFrame you'd like to use
index is the column, grouper, array (or list of the previous) you'd like to group your data by. It will be displayed in the index column (or columns, if you're passing in a list)
values (optional) is the column you'd like to aggregate. If you do not specify this then the function will aggregate all numeric columns.

Let's first look at the output, and then explain how the table was produced:

pd.pivot_table(data, index= 'Year', values= "Happiness Score")

	Happiness Score
Year
2015	5.375734
2016	5.382185
2017	5.354019

By passing Year as the index parameter, we chose to group our data by Year. The output is a pivot table that displays the three different values for Year as index, and the Happiness Score as values. It's worth noting that the aggregation default value is mean (or average), so the values displayed in the Happiness Score column are the yearly average for all countries. The table shows the average for all countries was highest in 2016, and is currently the lowest in the past three years.

Here's a detailed diagram of how this pivot table was created:

Next, let's use the Region column as index:

pd.pivot_table(data, index = 'Region', values="Happiness Score")

	Happiness Score
Region
Australia and New Zealand	7.302500
Central and Eastern Europe	5.371184
Eastern Asia	5.632333
Latin America and Caribbean	6.069074
Middle East and Northern Africa	5.387879
North America	7.227167
Southeastern Asia	5.364077
Southern Asia	4.590857
Sub-Saharan Africa	4.150957
Western Europe	6.693000

The numbers displayed in the Happiness Score column in the pivot table above are the mean, just as before - but this time it's each region's mean for all years documented (2015, 2016, 2017). This display makes it easier to see Australia and New Zealand have the highest average score, while North America is ranked close behind. It's interesting that despite the initial impression we got from reading the data, which showed Western Europe in most of the top places, Western Europe is actually ranked third when calculating the average for the past three years. The lowest ranked region is Sub-Saharan Africa, and close behind is Southern Asia.

Creating a multi-index pivot table

You may have used groupby() to achieve some of the pivot table functionality (we've previously demonstrated how to use groupby() to analyze your data). However, the pivot_table() built-in function offers straightforward parameter names and default values that can help simplify complex procedures like multi-indexing.

In order to group the data by more than one column, all we have to do is pass in a list of column names. Let's categorize the data by Region and Year.

pd.pivot_table(data, index = ['Region', 'Year'], values="Happiness Score")

		Happiness Score
Region	Year
Australia and New Zealand	2015	7.285000
	2016	7.323500
	2017	7.299000
Central and Eastern Europe	2015	5.332931
	2016	5.370690
	2017	5.409931
Eastern Asia	2015	5.626167
	2016	5.624167
	2017	5.646667
Latin America and Caribbean	2015	6.144682
	2016	6.101750
	2017	5.957818
Middle East and Northern Africa	2015	5.406900
	2016	5.386053
	2017	5.369684
North America	2015	7.273000
	2016	7.254000
	2017	7.154500
Southeastern Asia	2015	5.317444
	2016	5.338889
	2017	5.444875
Southern Asia	2015	4.580857
	2016	4.563286
	2017	4.628429
Sub-Saharan Africa	2015	4.202800
	2016	4.136421
	2017	4.111949
Western Europe	2015	6.689619
	2016	6.685667
	2017	6.703714

These examples also reveal where pivot table got its name from: it allows you to rotate or pivot the summary table, and this rotation gives us a different perspective of the data. A perspective that can very well help you quickly gain valuable insights.

This is one way to look at the data, but we can use the columns parameter to get a better display:

columns is the column, grouper, array, or list of the previous you'd like to group your data by. Using it will spread the different values horizontally.

Using Year as the Columns argument will display the different values for year, and will make for a much better display, like so:

pd.pivot_table(data, index= 'Region', columns='Year', values="Happiness Score")

Year	2015	2016	2017
Region
Australia and New Zealand	7.285000	7.323500	7.299000
Central and Eastern Europe	5.332931	5.370690	5.409931
Eastern Asia	5.626167	5.624167	5.646667
Latin America and Caribbean	6.144682	6.101750	5.957818
Middle East and Northern Africa	5.406900	5.386053	5.369684
North America	7.273000	7.254000	7.154500
Southeastern Asia	5.317444	5.338889	5.444875
Southern Asia	4.580857	4.563286	4.628429
Sub-Saharan Africa	4.202800	4.136421	4.111949
Western Europe	6.689619	6.685667	6.703714

Visualizing the pivot table using `plot()`

If you want to look at the visual representation of the previous pivot table we created, all you need to do is add plot() at the end of the pivot_table function call (you'll also need to import the relevant plotting libraries).



import matplotlib.pyplot as plt
import seaborn as sns
# use Seaborn styles
sns.set()
pd.pivot_table(data, index= 'Region', columns= 'Year', values= "Happiness Score").plot(kind= 'bar')
plt.ylabel("Happiness Rank")

<matplotlib.text.Text at 0x11b885630>

The visual representation helps reveal that the differences are minor. Having said that, this also shows that there's a permanent decrease in the Happiness rank of both of the regions located in America.

Manipulating the data using `aggfunc`

Up until now we've used the average to get insights about the data, but there are other important values to consider. Time to experiment with the aggfunc parameter:

aggfunc (optional) accepts a function or list of functions you'd like to use on your group (default: numpy.mean). If a list of functions is passed, the resulting pivot table will have hierarchical columns whose top level are the function names.

Let's add the median, minimum, maximum, and the standard deviation for each region. This can help us evaluate how accurate the average is, and if it's really representative of the real picture.

pd.pivot_table(data, index= 'Region', values= "Happiness Score", aggfunc= [np.mean, np.median, min, max, np.std])

	mean	median	min	max	std
	Happiness Score	Happiness Score	Happiness Score	Happiness Score	Happiness Score
Region
Australia and New Zealand	7.302500	7.2995	7.284	7.334	0.020936
Central and Eastern Europe	5.371184	5.4010	4.096	6.609	0.578274
Eastern Asia	5.632333	5.6545	4.874	6.422	0.502100
Latin America and Caribbean	6.069074	6.1265	3.603	7.226	0.728157
Middle East and Northern Africa	5.387879	5.3175	3.006	7.278	1.031656
North America	7.227167	7.2175	6.993	7.427	0.179331
Southeastern Asia	5.364077	5.2965	3.819	6.798	0.882637
Southern Asia	4.590857	4.6080	3.360	5.269	0.535978
Sub-Saharan Africa	4.150957	4.1390	2.693	5.648	0.584945
Western Europe	6.693000	6.9070	4.857	7.587	0.777886

Looks like some regions have extreme values that might affect our average more than we'd like them to. For example, Middle East and Northern Africa region have a high standard deviation, so we might want to remove extreme values. Let's see how many values we're calculating for each region. This might affect the representation we're seeing. For example, Australia and new Zealand have a very low standard deviation and are ranked happiest for all three years, but we can also assume they only account for two countries.

Applying a custom function to remove outliers

pivot_table allows you to pass your own custom aggregation functions as arguments. You can either use a lambda function, or create a function. Let's calculate the average number of countries in each region in a given year. We can do this easily using a lambda function, like so:

pd.pivot_table(data, index = 'Region', values="Happiness Score", aggfunc= [np.mean, min, max, np.std, lambda x: x.count()/3])

	mean	min	max	std	<lambda>
	Happiness Score	Happiness Score	Happiness Score	Happiness Score	Happiness Score
Region
Australia and New Zealand	7.302500	7.284	7.334	0.020936	2.000000
Central and Eastern Europe	5.371184	4.096	6.609	0.578274	29.000000
Eastern Asia	5.632333	4.874	6.422	0.502100	6.000000
Latin America and Caribbean	6.069074	3.603	7.226	0.728157	22.666667
Middle East and Northern Africa	5.387879	3.006	7.278	1.031656	19.333333
North America	7.227167	6.993	7.427	0.179331	2.000000
Southeastern Asia	5.364077	3.819	6.798	0.882637	8.666667
Southern Asia	4.590857	3.360	5.269	0.535978	7.000000
Sub-Saharan Africa	4.150957	2.693	5.648	0.584945	39.000000
Western Europe	6.693000	4.857	7.587	0.777886	21.000000

Both highest ranking regions with the lowest standard deviation only account for only two countries. Sub-Saharan Africa, on the other hand, has the lowest Happiness score, but it accounts for 43 countries. An interesting next step would be to remove extreme values from the calculation to see if the ranking changes significantly. Let's create a function that only calculates the values that are between the 0.25th and 0.75th quantiles. We'll use this function as a way to calculate the average for each region and check if the ranking stays the same or not.


def remove_outliers(values):
    mid_quantiles = values.quantile([.25, .75])
return np.mean(mid_quantiles)
pd.pivot_table(data, index = 'Region', values="Happiness Score", aggfunc= [np.mean, remove_outliers, lambda x: x.count()/3])

	mean	remove_outliers	<lambda>
	Happiness Score	Happiness Score	Happiness Score
Region
Australia and New Zealand	7.302500	7.299125	2.000000
Central and Eastern Europe	5.371184	5.449250	29.000000
Eastern Asia	5.632333	5.610125	6.000000
Latin America and Caribbean	6.069074	6.192750	22.666667
Middle East and Northern Africa	5.387879	5.508500	19.333333
North America	7.227167	7.244875	2.000000
Southeastern Asia	5.364077	5.470125	8.666667
Southern Asia	4.590857	4.707500	7.000000
Sub-Saharan Africa	4.150957	4.128000	39.000000
Western Europe	6.693000	6.846500	21.000000

Removing the outliers mostly affected the regions with a higher number of countries, which makes sense. We can see Western Europe (average of 21 countries surveyed per year) improved its ranking. Unfortunately, Sub-Saharan Africa (average of 39 countries surveyed per year) received an even lower ranking when we removed the outliers.

Categorizing using string manipulation

Up until now we've grouped our data according to the categories in the original table. However, we can search the strings in the categories to create our own groups. For example, it would be interesting to look at the results by continents. We can do this by looking for region names that contains Asia, Europe, etc. To do this, we can first assign our pivot table to a variable, and then add our filter:


table = pd.pivot_table(data, index = 'Region', values="Happiness Score", aggfunc= [np.mean, remove_outliers])
table[table.index.str.contains('Asia')]

	mean	remove_outliers
	Happiness Score	Happiness Score
Region
Eastern Asia	5.632333	5.610125
Southeastern Asia	5.364077	5.470125
Southern Asia	4.590857	4.707500

Let's see the results for Europe:

table[table.index.str.contains('Europe')]

	mean	remove_outliers
	Happiness Score	Happiness Score
Region
Central and Eastern Europe	5.371184	5.44925
Western Europe	6.693000	6.84650

The difference shows that the two European regions have a larger difference in happiness score. In most cases, removing outliers makes the score higher, but not in Eastern Asia.

If you'd like to extract specific values from more than one column, then it's better to use df.query because the previous method won't work for conditioning multi-indexes. For example, we can choose to view specific years, and specific regions in the Africa area.


table = pd.pivot_table(data, index = ['Region', 'Year'], values='Happiness Score',aggfunc= [np.mean, remove_outliers])
table.query('Year == [2015, 2017] and Region == ["Sub-Saharan Africa", "Middle East and Northern Africa"]')

		mean	remove_outliers
		Happiness Score	Happiness Score
Region	Year
Middle East and Northern Africa	2015	5.406900	5.515875
Middle East and Northern Africa	2017	5.369684	5.425500
Sub-Saharan Africa	2015	4.202800	4.168375
Sub-Saharan Africa	2017	4.111949	4.118000

In this example the differences are minor, but an interesting exercise would be to compare information from previous years since the survey has reports since 2012.

Handling missing data

We've covered the most powerful parameters of pivot_table thus far, so you can already get a lot out of it if you go experiment using this method on your own project. Having said that, it's useful to quickly go through the remaining parameters (which are all optional and have default values). The first thing to talk about is missing values.

dropna is type boolean, and used to indicate you do not want to include columns whose entries are all NaN (default: True)
fill_value is type scalar, and used to choose a value to replace missing values (default: None).

We don't have any columns where all entries are NaN, but it's worth knowing that if we did pivot_table would drop them by default according to dropna definition.

We have been letting pivot_table treat our NaN's according to the default settings. The fill_value default value is None so this means we didn't replace missing values in our Data set. To demonstrate this we'll need to produce a pivot table with NaN values. We can split the Happiness Score of each region into three quantiles, and check how many countries fall into each of the three quantiles (hoping at least one of the quantiles will have missing values in it).

To do this, we'll use qcut(), which is a built-in pandas function that allows you to split your data into any number of quantiles you choose. For example, specifying pd.qcut(data["Happiness Score"], 4) will result in four quantiles:

0-25
25
50
75


# splitting the happiness score into 3 quantiles
score = pd.qcut(data["Happiness Score"], 4)
pd.pivot_table(data, index= ['Region', score], values= "Happiness Score", aggfunc= 'count').head(9)

		Happiness Score
Region	Happiness Score
Australia and New Zealand	(2.692, 4.509]	NaN
	(4.509, 5.283]	NaN
	(5.283, 6.234]	NaN
	(6.234, 7.587]	6.0
Central and Eastern Europe	(2.692, 4.509]	10.0
	(4.509, 5.283]	28.0
	(5.283, 6.234]	46.0
	(6.234, 7.587]	3.0
Eastern Asia	(2.692, 4.509]	NaN

Regions where there are no countries in a specific quantile show NaN. This isn't ideal because a count that equals NaN doesn't give us any useful information. It's less confusing to display 0, so let's substitute NaN by zeros using fill_value:


# splitting the happiness score into 3 quantiles
score = pd.qcut(data["Happiness Score"], 3)
pd.pivot_table(data, index= ['Region', score], values= "Happiness Score", aggfunc= 'count', fill_value= 0)

		Happiness Score
Region	Happiness Score
Australia and New Zealand	(2.692, 4.79]	0
	(4.79, 5.895]	0
	(5.895, 7.587]	6
Central and Eastern Europe	(2.692, 4.79]	15
	(4.79, 5.895]	58
	(5.895, 7.587]	14
Eastern Asia	(2.692, 4.79]	0
	(4.79, 5.895]	11
	(5.895, 7.587]	7
Latin America and Caribbean	(2.692, 4.79]	4
	(4.79, 5.895]	19
	(5.895, 7.587]	45
Middle East and Northern Africa	(2.692, 4.79]	18
	(4.79, 5.895]	20
	(5.895, 7.587]	20
North America	(2.692, 4.79]	0
	(4.79, 5.895]	0
	(5.895, 7.587]	6
Southeastern Asia	(2.692, 4.79]	6
	(4.79, 5.895]	12
	(5.895, 7.587]	8
Southern Asia	(2.692, 4.79]	13
	(4.79, 5.895]	8
	(5.895, 7.587]	0
Sub-Saharan Africa	(2.692, 4.79]	101
	(4.79, 5.895]	16
	(5.895, 7.587]	0
Western Europe	(2.692, 4.79]	0
	(4.79, 5.895]	12
	(5.895, 7.587]	51

Adding total rows/columns

The last two parameters are both optional and mostly useful to improve display:

margins is type boolean and allows you to add an all row / columns, e.g. for subtotal / grand totals (Default False)
margins_name which is type string and accepts the name of the row / column that will contain the totals when margins is True (default ‘All’)

Let's use these to add a total to our last table.


# splitting the happiness score into 3 quantiles
score = pd.qcut(data['Happiness Score'], 3)
# creating a pivot table and only displaying the first 9 values
pd.pivot_table(data, index= ['Region', score], values= "Happiness Score", aggfunc= 'count', fill_value= 0, margins = True, margins_name= 'Total count')

		Happiness Score
Region	Happiness Score
Australia and New Zealand	(2.692, 4.79]	0.0
	(4.79, 5.895]	0.0
	(5.895, 7.587]	6.0
Central and Eastern Europe	(2.692, 4.79]	15.0
	(4.79, 5.895]	58.0
	(5.895, 7.587]	14.0
Eastern Asia	(2.692, 4.79]	0.0
	(4.79, 5.895]	11.0
	(5.895, 7.587]	7.0
Latin America and Caribbean	(2.692, 4.79]	4.0
	(4.79, 5.895]	19.0
	(5.895, 7.587]	45.0
Middle East and Northern Africa	(2.692, 4.79]	18.0
	(4.79, 5.895]	20.0
	(5.895, 7.587]	20.0
North America	(2.692, 4.79]	0.0
	(4.79, 5.895]	0.0
	(5.895, 7.587]	6.0
Southeastern Asia	(2.692, 4.79]	6.0
	(4.79, 5.895]	12.0
	(5.895, 7.587]	8.0
Southern Asia	(2.692, 4.79]	13.0
	(4.79, 5.895]	8.0
	(5.895, 7.587]	0.0
Sub-Saharan Africa	(2.692, 4.79]	101.0
	(4.79, 5.895]	16.0
	(5.895, 7.587]	0.0
Western Europe	(2.692, 4.79]	0.0
	(4.79, 5.895]	12.0
	(5.895, 7.587]	51.0
Total count		470.0

Let's summarize

If you're looking for a way to inspect your data from a different perspective then pivot_table is the answer. It's easy to use, it's useful for both numeric and categorical values, and it can get you results in one line of code.

If you enjoyed digging into this data and you're interested in investigating this further then we suggest adding survey results from previous years, and/or combining additional columns with countries information like: poverty, terror, unemployment, etc. Feel free to share your notebook, and enjoy your learning!

Pandas

Explore Happiness Data Using Python Pivot Tables

Pivot table, what is it good for?

Introducing our data set: World Happiness Report

Categorizing the data by `Year` and `Region`

Creating a multi-index pivot table

Visualizing the pivot table using `plot()`

Manipulating the data using `aggfunc`

Applying a custom function to remove outliers

Categorizing using string manipulation

Handling missing data

Adding total rows/columns

Let's summarize

Tutorial: Using Pandas with Large Data Sets in Python

Most Helpful Python Libraries for Data Cleaning

Explore Happiness Data Using Python Pivot Tables

Pivot table, what is it good for?

Introducing our data set: World Happiness Report

Categorizing the data by Year and Region

Creating a multi-index pivot table

Visualizing the pivot table using plot()

Manipulating the data using aggfunc

Applying a custom function to remove outliers

Categorizing using string manipulation

Handling missing data

Adding total rows/columns

Let's summarize

More learning resources

Tutorial: Using Pandas with Large Data Sets in Python

Most Helpful Python Libraries for Data Cleaning

Categorizing the data by `Year` and `Region`

Visualizing the pivot table using `plot()`

Manipulating the data using `aggfunc`