August 4, 2017

Tutorial: Using Pandas with Large Data Sets in Python

Did you know Python and pandas can reduce your memory usage by up to 90

When working in Python using pandas with small data (under 100 megabytes), performance is rarely a problem. When we move to larger data (100 megabytes to multiple gigabytes), performance issues can make run times much longer, and cause code to fail entirely due to insufficient memory.

While tools like Spark can handle large data sets (100 gigabytes to multiple terabytes), taking full advantage of their capabilities usually requires more expensive hardware. And unlike pandas, they lack rich feature sets for high quality data cleaning, exploration, and analysis. For medium-sized data, we're better off trying to get more out of pandas, rather than switching to a different tool.

In this post, we'll learn about Python's memory usage with pandas, how to reduce a dataframe's memory footprint by almost 90

Working with baseball game logs

We'll be working with data from 130 years of major league baseball games, originally sourced from Retrosheet.

Originally the data was in 127 separate CSV files, however we have used csvkit to merge the files, and have added column names into the first row. If you'd like to download our version of the data to follow along with this post, we have made it available here.

Let's start by importing both pandas and our data in Python and taking a look at the first five rows.


import pandas as pd
gl = pd.read_csv('game_logs.csv')
gl.head()

	date	day_of_week	v_name	v_league	v_game_number	h_name	h_league	h_game_number	v_score	h_score	length_outs	day_night	completion	forefeit	protest	park_id	attendance	length_minutes	v_line_score	h_line_score	v_at_bats	v_hits	v_doubles	v_triples	v_homeruns	v_rbi	v_walks	v_intentional walks	v_strikeouts	v_stolen_bases	v_caught_stealing	v_grounded_into_double	v_first_catcher_interference	v_left_on_base	v_pitchers_used	v_individual_earned_runs	v_team_earned_runs	v_wild_pitches	v_putouts	v_assists	v_errors	v_passed_balls	v_double_plays	h_at_bats	h_hits	h_doubles	h_triples	h_rbi	h_walks	h_intentional walks	h_strikeouts	h_stolen_bases	h_caught_stealing	h_grounded_into_double	h_first_catcher_interference	h_left_on_base	h_pitchers_used	h_individual_earned_runs	h_team_earned_runs	h_wild_pitches	h_putouts	h_assists	h_errors	h_passed_balls	h_double_plays	hp_umpire_id	hp_umpire_name	1b_umpire_id	1b_umpire_name	2b_umpire_id	2b_umpire_name	3b_umpire_id	3b_umpire_name	lf_umpire_id	lf_umpire_name	rf_umpire_id	rf_umpire_name	v_manager_id	v_manager_name	h_manager_id	h_manager_name	winning_pitcher_id	winning_pitcher_name	losing_pitcher_id	losing_pitcher_name	saving_pitcher_id	saving_pitcher_name	winning_rbi_batter_id	winning_rbi_batter_id_name	v_starting_pitcher_id	v_starting_pitcher_name	h_starting_pitcher_id	h_starting_pitcher_name	v_player_1_id	v_player_1_name	v_player_1_def_pos	v_player_2_id	v_player_2_name	v_player_2_def_pos	v_player_3_id	v_player_3_name	v_player_3_def_pos	v_player_4_id	v_player_4_name	v_player_4_def_pos	v_player_5_id	v_player_5_name	v_player_5_def_pos	v_player_6_id	v_player_6_name	v_player_6_def_pos	v_player_7_id	v_player_7_name	v_player_7_def_pos	v_player_8_id	v_player_8_name	v_player_8_def_pos	v_player_9_id	v_player_9_name	v_player_9_def_pos	h_player_1_id	h_player_1_name	h_player_1_def_pos	h_player_2_id	h_player_2_name	h_player_2_def_pos	h_player_3_id	h_player_3_name	h_player_3_def_pos	h_player_4_id	h_player_4_name	h_player_4_def_pos	h_player_5_id	h_player_5_name	h_player_5_def_pos	h_player_6_id	h_player_6_name	h_player_6_def_pos	h_player_7_id	h_player_7_name	h_player_7_def_pos	h_player_8_id	h_player_8_name	h_player_8_def_pos	h_player_9_id	h_player_9_name	h_player_9_def_pos	additional_info	acquisition_info
0	18710504	Thu	CL1	na	1	FW1	na	1	0	2	54.0	D	NaN	NaN	NaN	FOR01	200.0	120.0	000000000	010010000	30.0	4.0	1.0	0.0	0.0	0.0	1.0	-1.0	6.0	1.0	-1.0	-1.0	-1.0	4.0	1.0	1.0	1.0	0.0	27.0	9.0	0.0	3.0	0.0	31.0	4.0	1.0	0.0	2.0	1.0	-1.0	0.0	0.0	-1.0	-1.0	-1.0	3.0	1.0	0.0	0.0	0.0	27.0	3.0	3.0	1.0	1.0	boakj901	John Boake	NaN	(none)	NaN	(none)	NaN	(none)	NaN	(none)	NaN	(none)	paboc101	Charlie Pabor	lennb101	Bill Lennon	mathb101	Bobby Mathews	prata101	Al Pratt	NaN	(none)	NaN	(none)	prata101	Al Pratt	mathb101	Bobby Mathews	whitd102	Deacon White	2.0	kimbg101	Gene Kimball	4.0	paboc101	Charlie Pabor	7.0	allia101	Art Allison	8.0	white104	Elmer White	9.0	prata101	Al Pratt	1.0	sutte101	Ezra Sutton	5.0	carlj102	Jim Carleton	3.0	bassj101	John Bass	6.0	selmf101	Frank Sellman	5.0	mathb101	Bobby Mathews	1.0	foraj101	Jim Foran	3.0	goldw101	Wally Goldsmith	6.0	lennb101	Bill Lennon	2.0	caret101	Tom Carey	4.0	mince101	Ed Mincher	7.0	mcdej101	James McDermott	8.0	kellb105	Bill Kelly	9.0	NaN	Y
1	18710505	Fri	BS1	na	1	WS3	na	1	20	18	54.0	D	NaN	NaN	NaN	WAS01	5000.0	145.0	107000435	640113030	41.0	13.0	1.0	2.0	0.0	13.0	18.0	-1.0	5.0	3.0	-1.0	-1.0	-1.0	12.0	1.0	6.0	6.0	1.0	27.0	13.0	10.0	1.0	2.0	49.0	14.0	2.0	0.0	11.0	10.0	-1.0	2.0	1.0	-1.0	-1.0	-1.0	14.0	1.0	7.0	7.0	0.0	27.0	20.0	10.0	2.0	3.0	dobsh901	Henry Dobson	NaN	(none)	NaN	(none)	NaN	(none)	NaN	(none)	NaN	(none)	wrigh101	Harry Wright	younn801	Nick Young	spala101	Al Spalding	braia102	Asa Brainard	NaN	(none)	NaN	(none)	spala101	Al Spalding	braia102	Asa Brainard	wrigg101	George Wright	6.0	barnr102	Ross Barnes	4.0	birdd102	Dave Birdsall	9.0	mcvec101	Cal McVey	2.0	wrigh101	Harry Wright	8.0	goulc101	Charlie Gould	3.0	schah101	Harry Schafer	5.0	conef101	Fred Cone	7.0	spala101	Al Spalding	1.0	watef102	Fred Waterman	5.0	forcd101	Davy Force	6.0	mille105	Everett Mills	3.0	allid101	Doug Allison	2.0	hallg101	George Hall	7.0	leona101	Andy Leonard	4.0	braia102	Asa Brainard	1.0	burrh101	Henry Burroughs	9.0	berth101	Henry Berthrong	8.0	HTBF	Y
2	18710506	Sat	CL1	na	2	RC1	na	1	12	4	54.0	D	NaN	NaN	NaN	RCK01	1000.0	140.0	610020003	010020100	49.0	11.0	1.0	1.0	0.0	8.0	0.0	-1.0	1.0	0.0	-1.0	-1.0	-1.0	10.0	1.0	0.0	0.0	2.0	27.0	12.0	8.0	5.0	0.0	36.0	7.0	2.0	1.0	2.0	0.0	-1.0	3.0	5.0	-1.0	-1.0	-1.0	5.0	1.0	3.0	3.0	1.0	27.0	12.0	13.0	3.0	0.0	mawnj901	J.H. Manny	NaN	(none)	NaN	(none)	NaN	(none)	NaN	(none)	NaN	(none)	paboc101	Charlie Pabor	hasts101	Scott Hastings	prata101	Al Pratt	fishc102	Cherokee Fisher	NaN	(none)	NaN	(none)	prata101	Al Pratt	fishc102	Cherokee Fisher	whitd102	Deacon White	2.0	kimbg101	Gene Kimball	4.0	paboc101	Charlie Pabor	7.0	allia101	Art Allison	8.0	white104	Elmer White	9.0	prata101	Al Pratt	1.0	sutte101	Ezra Sutton	5.0	carlj102	Jim Carleton	3.0	bassj101	John Bass	6.0	mackd101	Denny Mack	3.0	addyb101	Bob Addy	4.0	fishc102	Cherokee Fisher	1.0	hasts101	Scott Hastings	8.0	ham-r101	Ralph Ham	5.0	ansoc101	Cap Anson	2.0	sagep101	Pony Sager	6.0	birdg101	George Bird	7.0	stirg101	Gat Stires	9.0	NaN	Y
3	18710508	Mon	CL1	na	3	CH1	na	1	12	14	54.0	D	NaN	NaN	NaN	CHI01	5000.0	150.0	101403111	077000000	46.0	15.0	2.0	1.0	2.0	10.0	0.0	-1.0	1.0	0.0	-1.0	-1.0	-1.0	7.0	1.0	6.0	6.0	0.0	27.0	15.0	11.0	6.0	0.0	43.0	11.0	2.0	0.0	8.0	4.0	-1.0	2.0	1.0	-1.0	-1.0	-1.0	6.0	1.0	4.0	4.0	0.0	27.0	14.0	7.0	2.0	0.0	willg901	Gardner Willard	NaN	(none)	NaN	(none)	NaN	(none)	NaN	(none)	NaN	(none)	paboc101	Charlie Pabor	woodj106	Jimmy Wood	zettg101	George Zettlein	prata101	Al Pratt	NaN	(none)	NaN	(none)	prata101	Al Pratt	zettg101	George Zettlein	whitd102	Deacon White	2.0	kimbg101	Gene Kimball	4.0	paboc101	Charlie Pabor	7.0	allia101	Art Allison	8.0	white104	Elmer White	9.0	prata101	Al Pratt	1.0	sutte101	Ezra Sutton	5.0	carlj102	Jim Carleton	3.0	bassj101	John Bass	6.0	mcatb101	Bub McAtee	3.0	kingm101	Marshall King	8.0	hodec101	Charlie Hodes	2.0	woodj106	Jimmy Wood	4.0	simmj101	Joe Simmons	9.0	folet101	Tom Foley	7.0	duffe101	Ed Duffy	6.0	pinke101	Ed Pinkham	5.0	zettg101	George Zettlein	1.0	NaN	Y
4	18710509	Tue	BS1	na	2	TRO	na	1	9	5	54.0	D	NaN	NaN	NaN	TRO01	3250.0	145.0	000002232	101003000	46.0	17.0	4.0	1.0	0.0	6.0	2.0	-1.0	0.0	1.0	-1.0	-1.0	-1.0	12.0	1.0	2.0	2.0	0.0	27.0	12.0	5.0	0.0	1.0	36.0	9.0	0.0	0.0	2.0	3.0	-1.0	0.0	2.0	-1.0	-1.0	-1.0	7.0	1.0	3.0	3.0	1.0	27.0	11.0	7.0	3.0	0.0	leroi901	Isaac Leroy	NaN	(none)	NaN	(none)	NaN	(none)	NaN	(none)	NaN	(none)	wrigh101	Harry Wright	pikel101	Lip Pike	spala101	Al Spalding	mcmuj101	John McMullin	NaN	(none)	NaN	(none)	spala101	Al Spalding	mcmuj101	John McMullin	wrigg101	George Wright	6.0	barnr102	Ross Barnes	4.0	birdd102	Dave Birdsall	9.0	mcvec101	Cal McVey	2.0	wrigh101	Harry Wright	8.0	goulc101	Charlie Gould	3.0	schah101	Harry Schafer	5.0	conef101	Fred Cone	7.0	spala101	Al Spalding	1.0	flync101	Clipper Flynn	9.0	mcgem101	Mike McGeary	2.0	yorkt101	Tom York	8.0	mcmuj101	John McMullin	1.0	kings101	Steve King	7.0	beave101	Edward Beavens	4.0	bells101	Steve Bellan	5.0	pikel101	Lip Pike	3.0	cravb101	Bill Craver	6.0	HTBF	Y

We've summarized some of the important columns below, however if you'd like to see a guide to all columns we have created a data dictionary for the whole data set:

date — Date of the game.
v_name — Visiting team name.
v_league — Visiting team league.
h_name — Home team name.
h_league — Home team league.
v_score — Visiting team score.
h_score — Home team score.
v_line_score — Visiting team line score, eg 010000(10)00.
h_line_score — Home team line score, eg 010000(10)0X.
park_id — ID of the park where the game was held.
attendance— Game attendance.

We can use the DataFrame.info() method to give us some high level information about our dataframe, including its size, information about data types and memory usage.

By default, pandas approximates of the memory usage of the dataframe to save time. Because we're interested in accuracy, we'll set the memory_usage parameter to 'deep' to get an accurate number.

gl.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>RangeIndex: 171907 entries, 0 to 171906
Columns: 161 entries, date to acquisition_infodtypes: float64(77), int64(6), object(78)
memory usage: 861.6 MB

We can see that we have 171,907 rows and 161 columns. Pandas has automatically detected types for us, with 83 numeric columns and 78 object columns. Object columns are used for strings or where a column contains mixed data types.

So we can get a better understanding of where we can reduce this memory usage, let's take a look into how Python and pandas store data in memory.

The Internal Representation of a Dataframe

Under the hood, pandas groups the columns into blocks of values of the same type. Here's a preview of how pandas stores the first twelve columns of our dataframe.

You'll notice that the blocks don't maintain references to the column names. This is because blocks are optimized for storing the actual values in the dataframe. The BlockManager class is responsible for maintaining the mapping between the row and column indexes and the actual blocks. It acts as an API that provides access to the underlying data. Whenever we select, edit, or delete values, the dataframe class interfaces with the BlockManager class to translate our requests to function and method calls.

Each type has a specialized class in the pandas.core.internals module. Pandas uses the ObjectBlock class to represent the block containing string columns, and the FloatBlock class to represent the block containing float columns. For blocks representing numeric values like integers and floats, pandas combines the columns and stores them as a NumPy ndarray. The NumPy ndarray is built around a C array, and the values are stored in a contiguous block of memory. Due to this storage scheme, accessing a slice of values is incredibly fast.

Because each data type is stored separately, we're going to examine the memory usage by data type. Let's start by looking at the average memory usage for data type.


for dtype in ['float','int','object']:
    selected_dtype = gl.select_dtypes(include=[dtype])
    mean_usage_b = selected_dtype.memory_usage(deep=True).mean()
    mean_usage_mb = mean_usage_b / 1024 ** 2
    print("Average memory usage for {} columns: {:03.2f} MB".format(dtype,mean_usage_mb))

Average memory usage for float columns: 1.29 MB
Average memory usage for int columns: 1.12 MB
Average memory usage for object columns: 9.53 MB

Immediately we can see that most of our memory is used by our 78 object columns. We'll look at those later, but first lets see if we can improve on the memory usage for our numeric columns.

Understanding Subtypes

As we mentioned briefly before, under the hood pandas represents numeric values as NumPy ndarrays and stores them in a continuous block of memory. This storage model consumes less space and allows us to access the values themselves quickly. Because pandas represents each value of the same type using the same number of bytes, and a NumPy ndarray stores the number of values, pandas can return the number of bytes a numeric column consumes quickly and accurately.

Many types in pandas have multiple subtypes that can use fewer bytes to represent each value. For example, the float type has the float16, float32, and float64 subtypes. The number portion of a type's name indicates the number of bits that type uses to represent values. For example, the subtypes we just listed use 2, 4, 8 and 16 bytes, respectively. The following table shows the subtypes for the most common pandas types:

memory usage	float	int	uint	datetime	bool	object
1 bytes		int8	uint8		bool
2 bytes	float16	int16	uint16
4 bytes	float32	int32	uint32
8 bytes	float64	int64	uint64	datetime64
variable						object

An int8 value uses 1 byte (or 8 bits) to store a value, and can represent 256 values (2^8) in binary. This means that we can use this subtype to represent values ranging from -128 to 127 (including 0).

We can use the numpy.iinfo class to verify the minimum and maximum values for each integer subtype. Let's look at an example:


import numpy as np
int_types = ["uint8", "int8", "int16"]
for it in int_types:
    print(np.iinfo(it))


Machine parameters for uint8----------------------------------------
min = 0
max = 255
Machine parameters for int8-----------------------------------------
min = -128
max = 127
Machine parameters for int16----------------------------------------
min = -32768
max = 32767

We can see here the difference between uint (unsigned integers) and int (signed integers). Both types have the same capacity for storage, but by only storing positive values, unsigned integers allow us to be more efficient with our storage of columns that only contain positive values.

Optimizing Numeric Columns with Subtypes

We can use the function pd.to_numeric() to downcast our numeric types. We'll use DataFrame.select_dtypes to select only the integer columns, then we'll optimize the types and compare the memory usage.


# We're going to be calculating memory usage a lot,
# so we'll create a function to save us some time!
def mem_usage(pandas_obj):
    if isinstance(pandas_obj,pd.DataFrame):
        usage_b = pandas_obj.memory_usage(deep=True).sum()
    else: # we assume if not a df it's a series
        usage_b = pandas_obj.memory_usage(deep=True)
    usage_mb = usage_b / 1024 ** 2 # convert bytes to megabytes
    return "{:03.2f} MB".format(usage_mb)
gl_int = gl.select_dtypes(include=['int'])
converted_int = gl_int.apply(pd.to_numeric,downcast='unsigned')
print(mem_usage(gl_int))
print(mem_usage(converted_int))
compare_ints = pd.concat([gl_int.dtypes,converted_int.dtypes],axis=1)
compare_ints.columns = ['before','after']
compare_ints.apply(pd.Series.value_counts)

7.87 MB
1.48 MB

	before	after
uint8	NaN	5.0
uint32	NaN	1.0
int64	6.0	NaN

We can see a drop from 7.9 to 1.5 megabytes in memory usage, which is a more than 80

Lets do the same thing with our float columns.


gl_float = gl.select_dtypes(include=['float'])
converted_float = gl_float.apply(pd.to_numeric,downcast='float')
print(mem_usage(gl_float))
print(mem_usage(converted_float))
compare_floats = pd.concat([gl_float.dtypes,converted_float.dtypes],axis=1)
compare_floats.columns = ['before','after']
compare_floats.apply(pd.Series.value_counts)

100.99 MB
50.49 MB

	before	after
float32	NaN	77.0
float64	77.0	NaN

We can see that all our float columns were converted from float64 to float32, giving us a 50

Let's create a copy of our original dataframe, assign these optimized numeric columns in place of the originals, and see what our overall memory usage is now.


optimized_gl = gl.copy()
optimized_gl[converted_int.columns] = converted_int
optimized_gl[converted_float.columns] = converted_float
print(mem_usage(gl))
print(mem_usage(optimized_gl))

861.57 MB

804.69 MB

While we've dramatically reduced the memory usage of our numeric columns, overall we've only reduced the memory usage of our dataframe by 7

Before we do, let's take a closer look at how strings are stored in pandas compared to the numeric types

Comparing Numeric to String storage

The object type represents values using Python string objects, partly due to the lack of support for missing string values in NumPy. Because Python is a high-level, interpreted language, it doesn't have fine grained-control over how values in memory are stored.

This limitation causes strings to be stored in a fragmented way that consumes more memory and is slower to access. Each element in an object column is really a pointer that contains the "address" for the actual value's location in memory.

Below is a diagram showing how numeric data is stored in NumPy data types vs how strings are stored using Python's inbuilt types.

Diagram adapted from the excellent post Why Python Is Slow.

You may have noticed our chart earlier described object types as using a variable amount of memory. While each pointer takes up 1 byte of memory, each actual string value uses the same amount of memory that string would use if stored individually in Python. Let's use sys.getsizeof() to prove that out, first by looking at individual strings, and then items in a pandas series.


from sys import getsizeof
s1 = 'working out'
s2 = 'memory usage for'
s3 = 'strings in python is fun!'
s4 = 'strings in python is fun!'
for s in [s1, s2, s3, s4]:
    print(getsizeof(s))

obj_series = pd.Series(['working out',
    'memory usage for',
    'strings in python is fun!',
    'strings in python is fun!'])
obj_series.apply(getsizeof)


0 60
1 65
2 74
3 74
dtype: int64

You can see that the size of strings when stored in a pandas series are identical to their usage as separate strings in Python.

Optimizing object types using categoricals

Pandas introduced Categoricals in version 0.15. The category type uses integer values under the hood to represent the values in a column, rather than the raw values. Pandas uses a separate mapping dictionary that maps the integer values to the raw ones. This arrangement is useful whenever a column contains a limited set of values. When we convert a column to the category dtype, pandas uses the most space efficient int subtype that can represent all of the unique values in a column.

To get an overview of where we might be able to use this type to reduce memory, let's take a look at the number of unique values of each of our object types.


gl_obj = gl.select_dtypes(include=['object']).copy()
gl_obj.describe()

	day_of_week	v_name	v_league	h_name	h_league	day_night	completion	forefeit	protest	park_id	v_line_score	h_line_score	hp_umpire_id	hp_umpire_name	1b_umpire_id	1b_umpire_name	2b_umpire_id	2b_umpire_name	3b_umpire_id	3b_umpire_name	lf_umpire_id	lf_umpire_name	rf_umpire_id	rf_umpire_name	v_manager_id	v_manager_name	h_manager_id	h_manager_name	winning_pitcher_id	winning_pitcher_name	losing_pitcher_id	losing_pitcher_name	saving_pitcher_id	saving_pitcher_name	winning_rbi_batter_id	winning_rbi_batter_id_name	v_starting_pitcher_id	v_starting_pitcher_name	h_starting_pitcher_id	h_starting_pitcher_name	v_player_1_id	v_player_1_name	v_player_2_id	v_player_2_name	v_player_3_id	v_player_3_name	v_player_4_id	v_player_4_name	v_player_5_id	v_player_5_name	v_player_6_id	v_player_6_name	v_player_7_id	v_player_7_name	v_player_8_id	v_player_8_name	v_player_9_id	v_player_9_name	h_player_1_id	h_player_1_name	h_player_2_id	h_player_2_name	h_player_3_id	h_player_3_name	h_player_4_id	h_player_4_name	h_player_5_id	h_player_5_name	h_player_6_id	h_player_6_name	h_player_7_id	h_player_7_name	h_player_8_id	h_player_8_name	h_player_9_id	h_player_9_name	additional_info	acquisition_info
count	171907	171907	171907	171907	171907	140150	116	145	180	171907	147271	147271	171888	171891	147040	171891	88540	171127	116723	171135	203	171902	9	171902	171907	171907	171907	171907	140229	140229	140229	140229	48018	140838	105699	140838	171863	171863	171863	171863	140838	140838	140838	140838	140838	140838	140838	140838	140838	140838	140838	140838	140838	140838	140838	140838	140835	140835	140838	140838	140838	140838	140838	140838	140838	140838	140838	140838	140838	140838	140838	140838	140838	140838	140838	140838	1456	140841
unique	7	148	7	148	7	2	116	3	5	245	36367	37859	1149	1146	678	678	324	325	362	363	31	32	8	9	648	648	659	659	5123	5084	5653	5606	3133	3117	5739	5674	5193	5129	5170	5125	2870	2847	3709	3673	2989	2964	2581	2563	3757	3722	4794	4736	5301	5241	4812	4763	5643	5585	2802	2782	3648	3614	2881	2858	2533	2517	3696	3660	4774	4720	5253	5197	4760	4710	5193	5142	332	1
top	Sat	CHN	NL	CHN	NL	D	19510725,,6,6,46	H	V	STL07	000000000	000000000	klemb901	Bill Klem	connt901	(none)	westj901	(none)	mcgob901	(none)	sudoe901	(none)	gormt101	(none)	mackc101	Connie Mack	mackc101	Connie Mack	johnw102	Walter Johnson	rixee101	Dutch Leonard	rivem002	(none)	pujoa001	(none)	younc102	Young	younc102	Young	suzui001	Ichiro Suzuki	fox-n101	Nellie Fox	speat101	Tris Speaker	bottj101	Jim Bottomley	heilh101	Harry Heilmann	grimc101	Charlie Grimm	grimc101	Charlie Grimm	lopea102	Al Lopez	grifa001	Alfredo Griffin	suzui001	Ichiro Suzuki	fox-n101	Nellie Fox	speat101	Tris Speaker	gehrl101	Lou Gehrig	heilh101	Harry Heilmann	grimc101	Charlie Grimm	grimc101	Charlie Grimm	lopea102	Al Lopez	spahw101	Warren Spahn	HTBF	Y
freq	28891	8870	88866	9024	88867	82724	1	69	90	7022	10102	8028	3545	3545	2029	24851	815	82587	1129	54412	30	171699	2	171893	3901	3901	3848	3848	385	385	251	295	523	92820	288	35139	403	441	412	451	893	893	852	852	1224	1224	816	816	663	663	465	465	485	485	687	687	333	333	927	927	859	859	1165	1165	752	752	612	612	427	427	491	491	676	676	339	339	1112	140841

A quick glance reveals many columns where there are few unique values relative to the overall ~172,000 games in our data set.

Before we dive too far in, we'll start by selecting just one of our object columns, and looking at what happens behind the scenes when we convert it to the categorical type. We'll use the second column of our data set, day_of_week.

Looking at the table above. we can see that it only contains seven unique values. We'll convert it to categorical by using the .astype() method.


dow = gl_obj.day_of_week
print(dow.head())
dow_cat = dow.astype('category')
print(dow_cat.head())


0 Thu
1 Fri
2 Sat
3 Mon
4 Tue
Name: day_of_week, dtype: object
0 Thu
1 Fri
2 Sat
3 Mon
4 Tue
Name: day_of_week, dtype: category
Categories (7, object): [Fri, Mon, Sat, Sun, Thu, Tue, Wed]

As you can see, apart from the fact that the type of the column has changed, the data looks exactly the same. Let's take a look under the hood at what's happening.

In the following code, we use the Series.cat.codes attribute to return the integer values the category type uses to represent each value.

dow_cat.head().cat.codes


0 4
1 0
2 2
3 1
4 5
dtype: int8

You can see that each unique value has been assigned an integer, and that the underlying datatype for the column is now int8. This column doesn't have any missing values, but if it did, the category subtype handles missing values by setting them to -1.

Lastly, let's look at the memory usage for this column before and after converting to the category type.


print(mem_usage(dow))
print(mem_usage(dow_cat))

9.84 MB
0.16 MB

We've gone from 9.8MB of memory usage to 0.16MB of memory usage, or a 98

While converting all of the columns to this type sounds appealing, it's important to be aware of the trade-offs. The biggest one is the inability to perform numerical computations. We can't do arithmetic with category columns or use methods like Series.min() and Series.max() without converting to a true numeric dtype first.

We should stick to using the category type primarily for object columns where less than 50

We'll write a loop to iterate over each object column, check if the number of unique values is less than 50


converted_obj = pd.DataFrame()
for col in gl_obj.columns:
    num_unique_values = len(gl_obj[col].unique())
    num_total_values = len(gl_obj[col])
    if num_unique_values / num_total_values < 0.5:
        converted_obj.loc[:,col] = gl_obj[col].astype('category')
    else:
        converted_obj.loc[:,col] = gl_obj[col]

As before,


print(mem_usage(gl_obj))
print(mem_usage(converted_obj))
compare_obj = pd.concat([gl_obj.dtypes,converted_obj.dtypes],axis=1)
compare_obj.columns = ['before','after']
compare_obj.apply(pd.Series.value_counts)

752.72 MB
51.67 MB

	before	after
object	78.0	NaN
category	NaN	78.0

In this case, all our object columns were converted to the category type, however this won't be the case with all data sets, so you should be sure to use the process above to check.

What's more, our memory usage for our object columns has gone from 752MB to 52MB, or a reduction of 93


optimized_gl[converted_obj.columns] = converted_obj
mem_usage(optimized_gl)

'103.64 MB'

Wow, we've really made some progress! We have one more optimization we can make - if you remember back to our table of types, there was a datetime type that we can use for the first column of our data set.


date = optimized_gl.date
print(mem_usage(date))
date.head()

0.66 MB

0 18710504
1 18710505
2 18710506
3 18710508
4 18710509
Name: date, dtype: uint32

You may remember that this was read in as an integer type and already optimized to unint32. Because of this, converting it to datetime will actually double it's memory usage, as the datetime type is a 64 bit type. There's value in converting it to datetime anyway since it will allow us to more easily do time series analysis.

We'll convert using pandas.to_datetime() function, using the format parameter to tell it that our date data is stored YYYY-MM-DD.


optimized_gl['date'] = pd.to_datetime(date,format='
print(mem_usage(optimized_gl))
optimized_gl.date.head()

104.29 MB


0 1871-05-04
1 1871-05-05
2 1871-05-06
3 1871-05-08
4 1871-05-09
Name: date, dtype: datetime64[ns]

Selecting Types While Reading the Data In

So far, we've explored ways to reduce the memory footprint of an existing dataframe. By reading the dataframe in first and then iterating on ways to save memory, we were able to understand the amount of memory we can expect to save from each optimization better. As we mentioned earlier in the lesson, however, we often won't have enough memory to represent all the values in a data set. How can we apply memory-saving techniques when we can't even create the dataframe in the first place?

Fortunately, we can specify the optimal column types when we read the data set in. The pandas.read_csv() function has a few different parameters that allow us to do this. The dtype parameter accepts a dictionary that has (string) column names as the keys and NumPy type objects as the values.

First, we'll store the final types of every column in a dictionary with keys for column names, first removing the date column since that needs to be treated separately.


dtypes = optimized_gl.drop('date',axis=1).dtypes
dtypes_col = dtypes.index
dtypes_type = [i.name for i in dtypes.values]
column_types = dict(zip(dtypes_col, dtypes_type))
# rather than print all 161 items, we'll
# sample 10 key/value pairs from the dict
# and print it nicely using prettyprint
preview = first2pairs = {key:value for key,value in list(column_types.items())[:10]}
import pprintpp
pp = pp = pprint.PrettyPrinter(indent=4)
pp.pprint(preview)


{
'acquisition_info': 'category',
'h_caught_stealing': 'float32',
'h_player_1_name': 'category',
'h_player_9_name': 'category',
'v_assists': 'float32',
'v_first_catcher_interference': 'float32',
'v_grounded_into_double': 'float32',
'v_player_1_id': 'category',
'v_player_3_id': 'category',
'v_player_5_id': 'category'
}

Now we can use the dictionary, along with a few parameters for the date to read in the data with the correct types in a few lines:


read_and_optimized = pd.read_csv('game_logs.csv',dtype=column_types,parse_dates=['date'],infer_datetime_format=True)
print(mem_usage(read_and_optimized))
read_and_optimized.head()

104.28 MB

	date	day_of_week	v_name	v_league	v_game_number	h_name	h_league	h_game_number	v_score	h_score	length_outs	day_night	completion	forefeit	protest	park_id	attendance	length_minutes	v_line_score	h_line_score	v_at_bats	v_hits	v_doubles	v_triples	v_homeruns	v_rbi	v_walks	v_intentional walks	v_strikeouts	v_stolen_bases	v_caught_stealing	v_grounded_into_double	v_first_catcher_interference	v_left_on_base	v_pitchers_used	v_individual_earned_runs	v_team_earned_runs	v_wild_pitches	v_putouts	v_assists	v_errors	v_passed_balls	v_double_plays	h_at_bats	h_hits	h_doubles	h_triples	h_rbi	h_walks	h_intentional walks	h_strikeouts	h_stolen_bases	h_caught_stealing	h_grounded_into_double	h_first_catcher_interference	h_left_on_base	h_pitchers_used	h_individual_earned_runs	h_team_earned_runs	h_wild_pitches	h_putouts	h_assists	h_errors	h_passed_balls	h_double_plays	hp_umpire_id	hp_umpire_name	1b_umpire_id	1b_umpire_name	2b_umpire_id	2b_umpire_name	3b_umpire_id	3b_umpire_name	lf_umpire_id	lf_umpire_name	rf_umpire_id	rf_umpire_name	v_manager_id	v_manager_name	h_manager_id	h_manager_name	winning_pitcher_id	winning_pitcher_name	losing_pitcher_id	losing_pitcher_name	saving_pitcher_id	saving_pitcher_name	winning_rbi_batter_id	winning_rbi_batter_id_name	v_starting_pitcher_id	v_starting_pitcher_name	h_starting_pitcher_id	h_starting_pitcher_name	v_player_1_id	v_player_1_name	v_player_1_def_pos	v_player_2_id	v_player_2_name	v_player_2_def_pos	v_player_3_id	v_player_3_name	v_player_3_def_pos	v_player_4_id	v_player_4_name	v_player_4_def_pos	v_player_5_id	v_player_5_name	v_player_5_def_pos	v_player_6_id	v_player_6_name	v_player_6_def_pos	v_player_7_id	v_player_7_name	v_player_7_def_pos	v_player_8_id	v_player_8_name	v_player_8_def_pos	v_player_9_id	v_player_9_name	v_player_9_def_pos	h_player_1_id	h_player_1_name	h_player_1_def_pos	h_player_2_id	h_player_2_name	h_player_2_def_pos	h_player_3_id	h_player_3_name	h_player_3_def_pos	h_player_4_id	h_player_4_name	h_player_4_def_pos	h_player_5_id	h_player_5_name	h_player_5_def_pos	h_player_6_id	h_player_6_name	h_player_6_def_pos	h_player_7_id	h_player_7_name	h_player_7_def_pos	h_player_8_id	h_player_8_name	h_player_8_def_pos	h_player_9_id	h_player_9_name	h_player_9_def_pos	additional_info	acquisition_info
0	1871-05-04	Thu	CL1	na	1	FW1	na	1	0	2	54.0	D	NaN	NaN	NaN	FOR01	200.0	120.0	000000000	010010000	30.0	4.0	1.0	0.0	0.0	0.0	1.0	-1.0	6.0	1.0	-1.0	-1.0	-1.0	4.0	1.0	1.0	1.0	0.0	27.0	9.0	0.0	3.0	0.0	31.0	4.0	1.0	0.0	2.0	1.0	-1.0	0.0	0.0	-1.0	-1.0	-1.0	3.0	1.0	0.0	0.0	0.0	27.0	3.0	3.0	1.0	1.0	boakj901	John Boake	NaN	(none)	NaN	(none)	NaN	(none)	NaN	(none)	NaN	(none)	paboc101	Charlie Pabor	lennb101	Bill Lennon	mathb101	Bobby Mathews	prata101	Al Pratt	NaN	(none)	NaN	(none)	prata101	Al Pratt	mathb101	Bobby Mathews	whitd102	Deacon White	2.0	kimbg101	Gene Kimball	4.0	paboc101	Charlie Pabor	7.0	allia101	Art Allison	8.0	white104	Elmer White	9.0	prata101	Al Pratt	1.0	sutte101	Ezra Sutton	5.0	carlj102	Jim Carleton	3.0	bassj101	John Bass	6.0	selmf101	Frank Sellman	5.0	mathb101	Bobby Mathews	1.0	foraj101	Jim Foran	3.0	goldw101	Wally Goldsmith	6.0	lennb101	Bill Lennon	2.0	caret101	Tom Carey	4.0	mince101	Ed Mincher	7.0	mcdej101	James McDermott	8.0	kellb105	Bill Kelly	9.0	NaN	Y
1	1871-05-05	Fri	BS1	na	1	WS3	na	1	20	18	54.0	D	NaN	NaN	NaN	WAS01	5000.0	145.0	107000435	640113030	41.0	13.0	1.0	2.0	0.0	13.0	18.0	-1.0	5.0	3.0	-1.0	-1.0	-1.0	12.0	1.0	6.0	6.0	1.0	27.0	13.0	10.0	1.0	2.0	49.0	14.0	2.0	0.0	11.0	10.0	-1.0	2.0	1.0	-1.0	-1.0	-1.0	14.0	1.0	7.0	7.0	0.0	27.0	20.0	10.0	2.0	3.0	dobsh901	Henry Dobson	NaN	(none)	NaN	(none)	NaN	(none)	NaN	(none)	NaN	(none)	wrigh101	Harry Wright	younn801	Nick Young	spala101	Al Spalding	braia102	Asa Brainard	NaN	(none)	NaN	(none)	spala101	Al Spalding	braia102	Asa Brainard	wrigg101	George Wright	6.0	barnr102	Ross Barnes	4.0	birdd102	Dave Birdsall	9.0	mcvec101	Cal McVey	2.0	wrigh101	Harry Wright	8.0	goulc101	Charlie Gould	3.0	schah101	Harry Schafer	5.0	conef101	Fred Cone	7.0	spala101	Al Spalding	1.0	watef102	Fred Waterman	5.0	forcd101	Davy Force	6.0	mille105	Everett Mills	3.0	allid101	Doug Allison	2.0	hallg101	George Hall	7.0	leona101	Andy Leonard	4.0	braia102	Asa Brainard	1.0	burrh101	Henry Burroughs	9.0	berth101	Henry Berthrong	8.0	HTBF	Y
2	1871-05-06	Sat	CL1	na	2	RC1	na	1	12	4	54.0	D	NaN	NaN	NaN	RCK01	1000.0	140.0	610020003	010020100	49.0	11.0	1.0	1.0	0.0	8.0	0.0	-1.0	1.0	0.0	-1.0	-1.0	-1.0	10.0	1.0	0.0	0.0	2.0	27.0	12.0	8.0	5.0	0.0	36.0	7.0	2.0	1.0	2.0	0.0	-1.0	3.0	5.0	-1.0	-1.0	-1.0	5.0	1.0	3.0	3.0	1.0	27.0	12.0	13.0	3.0	0.0	mawnj901	J.H. Manny	NaN	(none)	NaN	(none)	NaN	(none)	NaN	(none)	NaN	(none)	paboc101	Charlie Pabor	hasts101	Scott Hastings	prata101	Al Pratt	fishc102	Cherokee Fisher	NaN	(none)	NaN	(none)	prata101	Al Pratt	fishc102	Cherokee Fisher	whitd102	Deacon White	2.0	kimbg101	Gene Kimball	4.0	paboc101	Charlie Pabor	7.0	allia101	Art Allison	8.0	white104	Elmer White	9.0	prata101	Al Pratt	1.0	sutte101	Ezra Sutton	5.0	carlj102	Jim Carleton	3.0	bassj101	John Bass	6.0	mackd101	Denny Mack	3.0	addyb101	Bob Addy	4.0	fishc102	Cherokee Fisher	1.0	hasts101	Scott Hastings	8.0	ham-r101	Ralph Ham	5.0	ansoc101	Cap Anson	2.0	sagep101	Pony Sager	6.0	birdg101	George Bird	7.0	stirg101	Gat Stires	9.0	NaN	Y
3	1871-05-08	Mon	CL1	na	3	CH1	na	1	12	14	54.0	D	NaN	NaN	NaN	CHI01	5000.0	150.0	101403111	077000000	46.0	15.0	2.0	1.0	2.0	10.0	0.0	-1.0	1.0	0.0	-1.0	-1.0	-1.0	7.0	1.0	6.0	6.0	0.0	27.0	15.0	11.0	6.0	0.0	43.0	11.0	2.0	0.0	8.0	4.0	-1.0	2.0	1.0	-1.0	-1.0	-1.0	6.0	1.0	4.0	4.0	0.0	27.0	14.0	7.0	2.0	0.0	willg901	Gardner Willard	NaN	(none)	NaN	(none)	NaN	(none)	NaN	(none)	NaN	(none)	paboc101	Charlie Pabor	woodj106	Jimmy Wood	zettg101	George Zettlein	prata101	Al Pratt	NaN	(none)	NaN	(none)	prata101	Al Pratt	zettg101	George Zettlein	whitd102	Deacon White	2.0	kimbg101	Gene Kimball	4.0	paboc101	Charlie Pabor	7.0	allia101	Art Allison	8.0	white104	Elmer White	9.0	prata101	Al Pratt	1.0	sutte101	Ezra Sutton	5.0	carlj102	Jim Carleton	3.0	bassj101	John Bass	6.0	mcatb101	Bub McAtee	3.0	kingm101	Marshall King	8.0	hodec101	Charlie Hodes	2.0	woodj106	Jimmy Wood	4.0	simmj101	Joe Simmons	9.0	folet101	Tom Foley	7.0	duffe101	Ed Duffy	6.0	pinke101	Ed Pinkham	5.0	zettg101	George Zettlein	1.0	NaN	Y
4	1871-05-09	Tue	BS1	na	2	TRO	na	1	9	5	54.0	D	NaN	NaN	NaN	TRO01	3250.0	145.0	000002232	101003000	46.0	17.0	4.0	1.0	0.0	6.0	2.0	-1.0	0.0	1.0	-1.0	-1.0	-1.0	12.0	1.0	2.0	2.0	0.0	27.0	12.0	5.0	0.0	1.0	36.0	9.0	0.0	0.0	2.0	3.0	-1.0	0.0	2.0	-1.0	-1.0	-1.0	7.0	1.0	3.0	3.0	1.0	27.0	11.0	7.0	3.0	0.0	leroi901	Isaac Leroy	NaN	(none)	NaN	(none)	NaN	(none)	NaN	(none)	NaN	(none)	wrigh101	Harry Wright	pikel101	Lip Pike	spala101	Al Spalding	mcmuj101	John McMullin	NaN	(none)	NaN	(none)	spala101	Al Spalding	mcmuj101	John McMullin	wrigg101	George Wright	6.0	barnr102	Ross Barnes	4.0	birdd102	Dave Birdsall	9.0	mcvec101	Cal McVey	2.0	wrigh101	Harry Wright	8.0	goulc101	Charlie Gould	3.0	schah101	Harry Schafer	5.0	conef101	Fred Cone	7.0	spala101	Al Spalding	1.0	flync101	Clipper Flynn	9.0	mcgem101	Mike McGeary	2.0	yorkt101	Tom York	8.0	mcmuj101	John McMullin	1.0	kings101	Steve King	7.0	beave101	Edward Beavens	4.0	bells101	Steve Bellan	5.0	pikel101	Lip Pike	3.0	cravb101	Bill Craver	6.0	HTBF	Y

By optimizing the columns, we've managed to reduce the memory usage in pandas from 861.6 MB to 104.28 MB - an impressive 88

Analyzing baseball games

Now that we've optimized our data, we can perform some analysis. Let's start by looking at the distribution of game days.


optimized_gl['year'] = optimized_gl.date.dt.year
games_per_day = optimized_gl.pivot_table(index='year',columns='day_of_week',values='date',aggfunc=len)
games_per_day = games_per_day.divide(games_per_day.sum(axis=1),axis=0)
ax = games_per_day.plot(kind='area',stacked='true')
ax.legend(loc='upper right')
ax.set_ylim(0,1)
plt.show()

We can see that before the 1920s, Sunday baseball games were rare on Sundays before coming gradually more popular through the latter half of last century.

We can also see pretty clearly that the distribution of game days has been relatively static for the last 50 years.

Let's also look at how game length has varied over the years.


game_lengths = optimized_gl.pivot_table(index='year', values='length_minutes')
game_lengths.reset_index().plot.scatter('year','length_minutes')
plt.show()

Looks like baseball games have continued to get longer from the 1940s onwards.

Summary and Next Steps

We've learned how pandas stores data using different types, and then we used that knowledge to reduce the memory usage of our pandas dataframe by almost 90

Downcasting numeric columns to more efficient types.
Converting string columns to the categorical type.

If you'd like to dive into working with larger data in pandas some more, a lot of the content in this blog post is available in our interactive course Processing Large Datasets In Pandas Course, which you can start for free.

Pandas

Tutorials

Tutorial: Using Pandas with Large Data Sets in Python

Working with baseball game logs

The Internal Representation of a Dataframe

Understanding Subtypes

Optimizing Numeric Columns with Subtypes

Comparing Numeric to String storage

Optimizing object types using categoricals

Selecting Types While Reading the Data In

Analyzing baseball games

Summary and Next Steps

Kaggle Competition: How I Ranked in the Top 15 with My First Attempt

Analyze Facebook Data to Find Out If You Post Too Much

Tutorial: Using Pandas with Large Data Sets in Python

Working with baseball game logs

The Internal Representation of a Dataframe

Understanding Subtypes

Optimizing Numeric Columns with Subtypes

Comparing Numeric to String storage

Optimizing object types using categoricals

Selecting Types While Reading the Data In

Analyzing baseball games

Summary and Next Steps

More learning resources

Kaggle Competition: How I Ranked in the Top 15 with My First Attempt

Analyze Facebook Data to Find Out If You Post Too Much