August 4, 2017

Tutorial: Using Pandas with Large Data Sets in Python

Did you know Python and pandas can reduce your memory usage by up to 90% when you're working with big data sets?

When working in Python using pandas with small data (under 100 megabytes), performance is rarely a problem. When we move to larger data (100 megabytes to multiple gigabytes), performance issues can make run times much longer, and cause code to fail entirely due to insufficient memory.

While tools like Spark can handle large data sets (100 gigabytes to multiple terabytes), taking full advantage of their capabilities usually requires more expensive hardware. And unlike pandas, they lack rich feature sets for high quality data cleaning, exploration, and analysis. For medium-sized data, we're better off trying to get more out of pandas, rather than switching to a different tool.

In this post, we'll learn about Python's memory usage with pandas, how to reduce a dataframe's memory footprint by almost 90%, simply by selecting the appropriate data types for columns.

Working with baseball game logs

We'll be working with data from 130 years of major league baseball games, originally sourced from Retrosheet.

Originally the data was in 127 separate CSV files, however we have used csvkit to merge the files, and have added column names into the first row. If you'd like to download our version of the data to follow along with this post, we have made it available here.

Let's start by importing both pandas and our data in Python and taking a look at the first five rows.


import pandas as pd
gl = pd.read_csv('game_logs.csv')
gl.head()
date number_of_game day_of_week v_name v_league v_game_number h_name h_league h_game_number v_score h_score length_outs day_night completion forefeit protest park_id attendance length_minutes v_line_score h_line_score v_at_bats v_hits v_doubles v_triples v_homeruns v_rbi v_sacrifice_hits v_sacrifice_flies v_hit_by_pitch v_walks v_intentional walks v_strikeouts v_stolen_bases v_caught_stealing v_grounded_into_double v_first_catcher_interference v_left_on_base v_pitchers_used v_individual_earned_runs v_team_earned_runs v_wild_pitches v_balks v_putouts v_assists v_errors v_passed_balls v_double_plays v_triple_plays h_at_bats h_hits h_doubles h_triples h_homeruns h_rbi h_sacrifice_hits h_sacrifice_flies h_hit_by_pitch h_walks h_intentional walks h_strikeouts h_stolen_bases h_caught_stealing h_grounded_into_double h_first_catcher_interference h_left_on_base h_pitchers_used h_individual_earned_runs h_team_earned_runs h_wild_pitches h_balks h_putouts h_assists h_errors h_passed_balls h_double_plays h_triple_plays hp_umpire_id hp_umpire_name 1b_umpire_id 1b_umpire_name 2b_umpire_id 2b_umpire_name 3b_umpire_id 3b_umpire_name lf_umpire_id lf_umpire_name rf_umpire_id rf_umpire_name v_manager_id v_manager_name h_manager_id h_manager_name winning_pitcher_id winning_pitcher_name losing_pitcher_id losing_pitcher_name saving_pitcher_id saving_pitcher_name winning_rbi_batter_id winning_rbi_batter_id_name v_starting_pitcher_id v_starting_pitcher_name h_starting_pitcher_id h_starting_pitcher_name v_player_1_id v_player_1_name v_player_1_def_pos v_player_2_id v_player_2_name v_player_2_def_pos v_player_3_id v_player_3_name v_player_3_def_pos v_player_4_id v_player_4_name v_player_4_def_pos v_player_5_id v_player_5_name v_player_5_def_pos v_player_6_id v_player_6_name v_player_6_def_pos v_player_7_id v_player_7_name v_player_7_def_pos v_player_8_id v_player_8_name v_player_8_def_pos v_player_9_id v_player_9_name v_player_9_def_pos h_player_1_id h_player_1_name h_player_1_def_pos h_player_2_id h_player_2_name h_player_2_def_pos h_player_3_id h_player_3_name h_player_3_def_pos h_player_4_id h_player_4_name h_player_4_def_pos h_player_5_id h_player_5_name h_player_5_def_pos h_player_6_id h_player_6_name h_player_6_def_pos h_player_7_id h_player_7_name h_player_7_def_pos h_player_8_id h_player_8_name h_player_8_def_pos h_player_9_id h_player_9_name h_player_9_def_pos additional_info acquisition_info
0 18710504 0 Thu CL1 na 1 FW1 na 1 0 2 54.0 D NaN NaN NaN FOR01 200.0 120.0 000000000 010010000 30.0 4.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 -1.0 6.0 1.0 -1.0 -1.0 -1.0 4.0 1.0 1.0 1.0 0.0 0.0 27.0 9.0 0.0 3.0 0.0 0.0 31.0 4.0 1.0 0.0 0.0 2.0 0.0 0.0 0.0 1.0 -1.0 0.0 0.0 -1.0 -1.0 -1.0 3.0 1.0 0.0 0.0 0.0 0.0 27.0 3.0 3.0 1.0 1.0 0.0 boakj901 John Boake NaN (none) NaN (none) NaN (none) NaN (none) NaN (none) paboc101 Charlie Pabor lennb101 Bill Lennon mathb101 Bobby Mathews prata101 Al Pratt NaN (none) NaN (none) prata101 Al Pratt mathb101 Bobby Mathews whitd102 Deacon White 2.0 kimbg101 Gene Kimball 4.0 paboc101 Charlie Pabor 7.0 allia101 Art Allison 8.0 white104 Elmer White 9.0 prata101 Al Pratt 1.0 sutte101 Ezra Sutton 5.0 carlj102 Jim Carleton 3.0 bassj101 John Bass 6.0 selmf101 Frank Sellman 5.0 mathb101 Bobby Mathews 1.0 foraj101 Jim Foran 3.0 goldw101 Wally Goldsmith 6.0 lennb101 Bill Lennon 2.0 caret101 Tom Carey 4.0 mince101 Ed Mincher 7.0 mcdej101 James McDermott 8.0 kellb105 Bill Kelly 9.0 NaN Y
1 18710505 0 Fri BS1 na 1 WS3 na 1 20 18 54.0 D NaN NaN NaN WAS01 5000.0 145.0 107000435 640113030 41.0 13.0 1.0 2.0 0.0 13.0 0.0 0.0 0.0 18.0 -1.0 5.0 3.0 -1.0 -1.0 -1.0 12.0 1.0 6.0 6.0 1.0 0.0 27.0 13.0 10.0 1.0 2.0 0.0 49.0 14.0 2.0 0.0 0.0 11.0 0.0 0.0 0.0 10.0 -1.0 2.0 1.0 -1.0 -1.0 -1.0 14.0 1.0 7.0 7.0 0.0 0.0 27.0 20.0 10.0 2.0 3.0 0.0 dobsh901 Henry Dobson NaN (none) NaN (none) NaN (none) NaN (none) NaN (none) wrigh101 Harry Wright younn801 Nick Young spala101 Al Spalding braia102 Asa Brainard NaN (none) NaN (none) spala101 Al Spalding braia102 Asa Brainard wrigg101 George Wright 6.0 barnr102 Ross Barnes 4.0 birdd102 Dave Birdsall 9.0 mcvec101 Cal McVey 2.0 wrigh101 Harry Wright 8.0 goulc101 Charlie Gould 3.0 schah101 Harry Schafer 5.0 conef101 Fred Cone 7.0 spala101 Al Spalding 1.0 watef102 Fred Waterman 5.0 forcd101 Davy Force 6.0 mille105 Everett Mills 3.0 allid101 Doug Allison 2.0 hallg101 George Hall 7.0 leona101 Andy Leonard 4.0 braia102 Asa Brainard 1.0 burrh101 Henry Burroughs 9.0 berth101 Henry Berthrong 8.0 HTBF Y
2 18710506 0 Sat CL1 na 2 RC1 na 1 12 4 54.0 D NaN NaN NaN RCK01 1000.0 140.0 610020003 010020100 49.0 11.0 1.0 1.0 0.0 8.0 0.0 0.0 0.0 0.0 -1.0 1.0 0.0 -1.0 -1.0 -1.0 10.0 1.0 0.0 0.0 2.0 0.0 27.0 12.0 8.0 5.0 0.0 0.0 36.0 7.0 2.0 1.0 0.0 2.0 0.0 0.0 0.0 0.0 -1.0 3.0 5.0 -1.0 -1.0 -1.0 5.0 1.0 3.0 3.0 1.0 0.0 27.0 12.0 13.0 3.0 0.0 0.0 mawnj901 J.H. Manny NaN (none) NaN (none) NaN (none) NaN (none) NaN (none) paboc101 Charlie Pabor hasts101 Scott Hastings prata101 Al Pratt fishc102 Cherokee Fisher NaN (none) NaN (none) prata101 Al Pratt fishc102 Cherokee Fisher whitd102 Deacon White 2.0 kimbg101 Gene Kimball 4.0 paboc101 Charlie Pabor 7.0 allia101 Art Allison 8.0 white104 Elmer White 9.0 prata101 Al Pratt 1.0 sutte101 Ezra Sutton 5.0 carlj102 Jim Carleton 3.0 bassj101 John Bass 6.0 mackd101 Denny Mack 3.0 addyb101 Bob Addy 4.0 fishc102 Cherokee Fisher 1.0 hasts101 Scott Hastings 8.0 ham-r101 Ralph Ham 5.0 ansoc101 Cap Anson 2.0 sagep101 Pony Sager 6.0 birdg101 George Bird 7.0 stirg101 Gat Stires 9.0 NaN Y
3 18710508 0 Mon CL1 na 3 CH1 na 1 12 14 54.0 D NaN NaN NaN CHI01 5000.0 150.0 101403111 077000000 46.0 15.0 2.0 1.0 2.0 10.0 0.0 0.0 0.0 0.0 -1.0 1.0 0.0 -1.0 -1.0 -1.0 7.0 1.0 6.0 6.0 0.0 0.0 27.0 15.0 11.0 6.0 0.0 0.0 43.0 11.0 2.0 0.0 0.0 8.0 0.0 0.0 0.0 4.0 -1.0 2.0 1.0 -1.0 -1.0 -1.0 6.0 1.0 4.0 4.0 0.0 0.0 27.0 14.0 7.0 2.0 0.0 0.0 willg901 Gardner Willard NaN (none) NaN (none) NaN (none) NaN (none) NaN (none) paboc101 Charlie Pabor woodj106 Jimmy Wood zettg101 George Zettlein prata101 Al Pratt NaN (none) NaN (none) prata101 Al Pratt zettg101 George Zettlein whitd102 Deacon White 2.0 kimbg101 Gene Kimball 4.0 paboc101 Charlie Pabor 7.0 allia101 Art Allison 8.0 white104 Elmer White 9.0 prata101 Al Pratt 1.0 sutte101 Ezra Sutton 5.0 carlj102 Jim Carleton 3.0 bassj101 John Bass 6.0 mcatb101 Bub McAtee 3.0 kingm101 Marshall King 8.0 hodec101 Charlie Hodes 2.0 woodj106 Jimmy Wood 4.0 simmj101 Joe Simmons 9.0 folet101 Tom Foley 7.0 duffe101 Ed Duffy 6.0 pinke101 Ed Pinkham 5.0 zettg101 George Zettlein 1.0 NaN Y
4 18710509 0 Tue BS1 na 2 TRO na 1 9 5 54.0 D NaN NaN NaN TRO01 3250.0 145.0 000002232 101003000 46.0 17.0 4.0 1.0 0.0 6.0 0.0 0.0 0.0 2.0 -1.0 0.0 1.0 -1.0 -1.0 -1.0 12.0 1.0 2.0 2.0 0.0 0.0 27.0 12.0 5.0 0.0 1.0 0.0 36.0 9.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 3.0 -1.0 0.0 2.0 -1.0 -1.0 -1.0 7.0 1.0 3.0 3.0 1.0 0.0 27.0 11.0 7.0 3.0 0.0 0.0 leroi901 Isaac Leroy NaN (none) NaN (none) NaN (none) NaN (none) NaN (none) wrigh101 Harry Wright pikel101 Lip Pike spala101 Al Spalding mcmuj101 John McMullin NaN (none) NaN (none) spala101 Al Spalding mcmuj101 John McMullin wrigg101 George Wright 6.0 barnr102 Ross Barnes 4.0 birdd102 Dave Birdsall 9.0 mcvec101 Cal McVey 2.0 wrigh101 Harry Wright 8.0 goulc101 Charlie Gould 3.0 schah101 Harry Schafer 5.0 conef101 Fred Cone 7.0 spala101 Al Spalding 1.0 flync101 Clipper Flynn 9.0 mcgem101 Mike McGeary 2.0 yorkt101 Tom York 8.0 mcmuj101 John McMullin 1.0 kings101 Steve King 7.0 beave101 Edward Beavens 4.0 bells101 Steve Bellan 5.0 pikel101 Lip Pike 3.0 cravb101 Bill Craver 6.0 HTBF Y

We've summarized some of the important columns below, however if you'd like to see a guide to all columns we have created a data dictionary for the whole data set:

  • date — Date of the game.
  • v_name — Visiting team name.
  • v_league — Visiting team league.
  • h_name — Home team name.
  • h_league — Home team league.
  • v_score — Visiting team score.
  • h_score — Home team score.
  • v_line_score — Visiting team line score, eg 010000(10)00.
  • h_line_score — Home team line score, eg 010000(10)0X.
  • park_id — ID of the park where the game was held.
  • attendance— Game attendance.

We can use the DataFrame.info() method to give us some high level information about our dataframe, including its size, information about data types and memory usage.

By default, pandas approximates of the memory usage of the dataframe to save time. Because we're interested in accuracy, we'll set the memory_usage parameter to 'deep' to get an accurate number.

gl.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>RangeIndex: 171907 entries, 0 to 171906
Columns: 161 entries, date to acquisition_infodtypes: float64(77), int64(6), object(78)
memory usage: 861.6 MB

We can see that we have 171,907 rows and 161 columns. Pandas has automatically detected types for us, with 83 numeric columns and 78 object columns. Object columns are used for strings or where a column contains mixed data types.

So we can get a better understanding of where we can reduce this memory usage, let's take a look into how Python and pandas store data in memory.

The Internal Representation of a Dataframe

Under the hood, pandas groups the columns into blocks of values of the same type. Here's a preview of how pandas stores the first twelve columns of our dataframe.

You'll notice that the blocks don't maintain references to the column names. This is because blocks are optimized for storing the actual values in the dataframe. The BlockManager class is responsible for maintaining the mapping between the row and column indexes and the actual blocks. It acts as an API that provides access to the underlying data. Whenever we select, edit, or delete values, the dataframe class interfaces with the BlockManager class to translate our requests to function and method calls.

Each type has a specialized class in the pandas.core.internals module. Pandas uses the ObjectBlock class to represent the block containing string columns, and the FloatBlock class to represent the block containing float columns. For blocks representing numeric values like integers and floats, pandas combines the columns and stores them as a NumPy ndarray. The NumPy ndarray is built around a C array, and the values are stored in a contiguous block of memory. Due to this storage scheme, accessing a slice of values is incredibly fast.

Because each data type is stored separately, we're going to examine the memory usage by data type. Let's start by looking at the average memory usage for data type.


for dtype in ['float','int','object']:
    selected_dtype = gl.select_dtypes(include=[dtype])
    mean_usage_b = selected_dtype.memory_usage(deep=True).mean()
    mean_usage_mb = mean_usage_b / 1024 ** 2
    print("Average memory usage for {} columns: {:03.2f} MB".format(dtype,mean_usage_mb))
Average memory usage for float columns: 1.29 MB
Average memory usage for int columns: 1.12 MB
Average memory usage for object columns: 9.53 MB

Immediately we can see that most of our memory is used by our 78 object columns. We'll look at those later, but first lets see if we can improve on the memory usage for our numeric columns.

Understanding Subtypes

As we mentioned briefly before, under the hood pandas represents numeric values as NumPy ndarrays and stores them in a continuous block of memory. This storage model consumes less space and allows us to access the values themselves quickly. Because pandas represents each value of the same type using the same number of bytes, and a NumPy ndarray stores the number of values, pandas can return the number of bytes a numeric column consumes quickly and accurately.

Many types in pandas have multiple subtypes that can use fewer bytes to represent each value. For example, the float type has the float16, float32, and float64 subtypes. The number portion of a type's name indicates the number of bits that type uses to represent values. For example, the subtypes we just listed use 2, 4, 8 and 16 bytes, respectively. The following table shows the subtypes for the most common pandas types:

memory usage float int uint datetime bool object
1 bytes int8 uint8 bool
2 bytes float16 int16 uint16
4 bytes float32 int32 uint32
8 bytes float64 int64 uint64 datetime64
variable object

An int8 value uses 1 byte (or 8 bits) to store a value, and can represent 256 values (2^8) in binary. This means that we can use this subtype to represent values ranging from -128 to 127 (including 0).

We can use the numpy.iinfo class to verify the minimum and maximum values for each integer subtype. Let's look at an example:


import numpy as np
int_types = ["uint8", "int8", "int16"]
for it in int_types:
    print(np.iinfo(it))

Machine parameters for uint8----------------------------------------
min = 0
max = 255
Machine parameters for int8-----------------------------------------
min = -128
max = 127
Machine parameters for int16----------------------------------------
min = -32768
max = 32767

We can see here the difference between uint (unsigned integers) and int (signed integers). Both types have the same capacity for storage, but by only storing positive values, unsigned integers allow us to be more efficient with our storage of columns that only contain positive values.

Optimizing Numeric Columns with Subtypes

We can use the function pd.to_numeric() to downcast our numeric types. We'll use DataFrame.select_dtypes to select only the integer columns, then we'll optimize the types and compare the memory usage.


# We're going to be calculating memory usage a lot,
# so we'll create a function to save us some time!
def mem_usage(pandas_obj):
    if isinstance(pandas_obj,pd.DataFrame):
        usage_b = pandas_obj.memory_usage(deep=True).sum()
    else: # we assume if not a df it's a series
        usage_b = pandas_obj.memory_usage(deep=True)
    usage_mb = usage_b / 1024 ** 2 # convert bytes to megabytes
    return "{:03.2f} MB".format(usage_mb)
gl_int = gl.select_dtypes(include=['int'])
converted_int = gl_int.apply(pd.to_numeric,downcast='unsigned')
print(mem_usage(gl_int))
print(mem_usage(converted_int))
compare_ints = pd.concat([gl_int.dtypes,converted_int.dtypes],axis=1)
compare_ints.columns = ['before','after']
compare_ints.apply(pd.Series.value_counts)
7.87 MB
1.48 MB
before after
uint8 NaN 5.0
uint32 NaN 1.0
int64 6.0 NaN

We can see a drop from 7.9 to 1.5 megabytes in memory usage, which is a more than 80% reduction. The overall impact on our original dataframe isn't massive though, because there are so few integer columns.

Lets do the same thing with our float columns.


gl_float = gl.select_dtypes(include=['float'])
converted_float = gl_float.apply(pd.to_numeric,downcast='float')
print(mem_usage(gl_float))
print(mem_usage(converted_float))
compare_floats = pd.concat([gl_float.dtypes,converted_float.dtypes],axis=1)
compare_floats.columns = ['before','after']
compare_floats.apply(pd.Series.value_counts)
100.99 MB
50.49 MB
before after
float32 NaN 77.0
float64 77.0 NaN

We can see that all our float columns were converted from float64 to float32, giving us a 50% reduction in memory usage.

Let's create a copy of our original dataframe, assign these optimized numeric columns in place of the originals, and see what our overall memory usage is now.


optimized_gl = gl.copy()
optimized_gl[converted_int.columns] = converted_int
optimized_gl[converted_float.columns] = converted_float
print(mem_usage(gl))
print(mem_usage(optimized_gl))
861.57 MB
804.69 MB

While we've dramatically reduced the memory usage of our numeric columns, overall we've only reduced the memory usage of our dataframe by 7%. Most of our gains are going to come from optimizing the object types.

Before we do, let's take a closer look at how strings are stored in pandas compared to the numeric types

Comparing Numeric to String storage

The object type represents values using Python string objects, partly due to the lack of support for missing string values in NumPy. Because Python is a high-level, interpreted language, it doesn't have fine grained-control over how values in memory are stored.

This limitation causes strings to be stored in a fragmented way that consumes more memory and is slower to access. Each element in an object column is really a pointer that contains the "address" for the actual value's location in memory.

Below is a diagram showing how numeric data is stored in NumPy data types vs how strings are stored using Python's inbuilt types.

Diagram adapted from the excellent post Why Python Is Slow.

You may have noticed our chart earlier described object types as using a variable amount of memory. While each pointer takes up 1 byte of memory, each actual string value uses the same amount of memory that string would use if stored individually in Python. Let's use sys.getsizeof() to prove that out, first by looking at individual strings, and then items in a pandas series.


from sys import getsizeof
s1 = 'working out'
s2 = 'memory usage for'
s3 = 'strings in python is fun!'
s4 = 'strings in python is fun!'
for s in [s1, s2, s3, s4]:
    print(getsizeof(s))

60
65
74
74
obj_series = pd.Series(['working out',
    'memory usage for',
    'strings in python is fun!',
    'strings in python is fun!'])
obj_series.apply(getsizeof)

0 60
1 65
2 74
3 74
dtype: int64

You can see that the size of strings when stored in a pandas series are identical to their usage as separate strings in Python.

Optimizing object types using categoricals

Pandas introduced Categoricals in version 0.15. The category type uses integer values under the hood to represent the values in a column, rather than the raw values. Pandas uses a separate mapping dictionary that maps the integer values to the raw ones. This arrangement is useful whenever a column contains a limited set of values. When we convert a column to the category dtype, pandas uses the most space efficient int subtype that can represent all of the unique values in a column.

To get an overview of where we might be able to use this type to reduce memory, let's take a look at the number of unique values of each of our object types.


gl_obj = gl.select_dtypes(include=['object']).copy()
gl_obj.describe()
day_of_week v_name v_league h_name h_league day_night completion forefeit protest park_id v_line_score h_line_score hp_umpire_id hp_umpire_name 1b_umpire_id 1b_umpire_name 2b_umpire_id 2b_umpire_name 3b_umpire_id 3b_umpire_name lf_umpire_id lf_umpire_name rf_umpire_id rf_umpire_name v_manager_id v_manager_name h_manager_id h_manager_name winning_pitcher_id winning_pitcher_name losing_pitcher_id losing_pitcher_name saving_pitcher_id saving_pitcher_name winning_rbi_batter_id winning_rbi_batter_id_name v_starting_pitcher_id v_starting_pitcher_name h_starting_pitcher_id h_starting_pitcher_name v_player_1_id v_player_1_name v_player_2_id v_player_2_name v_player_3_id v_player_3_name v_player_4_id v_player_4_name v_player_5_id v_player_5_name v_player_6_id v_player_6_name v_player_7_id v_player_7_name v_player_8_id v_player_8_name v_player_9_id v_player_9_name h_player_1_id h_player_1_name h_player_2_id h_player_2_name h_player_3_id h_player_3_name h_player_4_id h_player_4_name h_player_5_id h_player_5_name h_player_6_id h_player_6_name h_player_7_id h_player_7_name h_player_8_id h_player_8_name h_player_9_id h_player_9_name additional_info acquisition_info
count 171907 171907 171907 171907 171907 140150 116 145 180 171907 147271 147271 171888 171891 147040 171891 88540 171127 116723 171135 203 171902 9 171902 171907 171907 171907 171907 140229 140229 140229 140229 48018 140838 105699 140838 171863 171863 171863 171863 140838 140838 140838 140838 140838 140838 140838 140838 140838 140838 140838 140838 140838 140838 140838 140838 140835 140835 140838 140838 140838 140838 140838 140838 140838 140838 140838 140838 140838 140838 140838 140838 140838 140838 140838 140838 1456 140841
unique 7 148 7 148 7 2 116 3 5 245 36367 37859 1149 1146 678 678 324 325 362 363 31 32 8 9 648 648 659 659 5123 5084 5653 5606 3133 3117 5739 5674 5193 5129 5170 5125 2870 2847 3709 3673 2989 2964 2581 2563 3757 3722 4794 4736 5301 5241 4812 4763 5643 5585 2802 2782 3648 3614 2881 2858 2533 2517 3696 3660 4774 4720 5253 5197 4760 4710 5193 5142 332 1
top Sat CHN NL CHN NL D 19510725,,6,6,46 H V STL07 000000000 000000000 klemb901 Bill Klem connt901 (none) westj901 (none) mcgob901 (none) sudoe901 (none) gormt101 (none) mackc101 Connie Mack mackc101 Connie Mack johnw102 Walter Johnson rixee101 Dutch Leonard rivem002 (none) pujoa001 (none) younc102 Young younc102 Young suzui001 Ichiro Suzuki fox-n101 Nellie Fox speat101 Tris Speaker bottj101 Jim Bottomley heilh101 Harry Heilmann grimc101 Charlie Grimm grimc101 Charlie Grimm lopea102 Al Lopez grifa001 Alfredo Griffin suzui001 Ichiro Suzuki fox-n101 Nellie Fox speat101 Tris Speaker gehrl101 Lou Gehrig heilh101 Harry Heilmann grimc101 Charlie Grimm grimc101 Charlie Grimm lopea102 Al Lopez spahw101 Warren Spahn HTBF Y
freq 28891 8870 88866 9024 88867 82724 1 69 90 7022 10102 8028 3545 3545 2029 24851 815 82587 1129 54412 30 171699 2 171893 3901 3901 3848 3848 385 385 251 295 523 92820 288 35139 403 441 412 451 893 893 852 852 1224 1224 816 816 663 663 465 465 485 485 687 687 333 333 927 927 859 859 1165 1165 752 752 612 612 427 427 491 491 676 676 339 339 1112 140841

A quick glance reveals many columns where there are few unique values relative to the overall ~172,000 games in our data set.

Before we dive too far in, we'll start by selecting just one of our object columns, and looking at what happens behind the scenes when we convert it to the categorical type. We'll use the second column of our data set, day_of_week.

Looking at the table above. we can see that it only contains seven unique values. We'll convert it to categorical by using the .astype() method.


dow = gl_obj.day_of_week
print(dow.head())
dow_cat = dow.astype('category')
print(dow_cat.head())

0 Thu
1 Fri
2 Sat
3 Mon
4 Tue
Name: day_of_week, dtype: object
0 Thu
1 Fri
2 Sat
3 Mon
4 Tue
Name: day_of_week, dtype: category
Categories (7, object): [Fri, Mon, Sat, Sun, Thu, Tue, Wed]

As you can see, apart from the fact that the type of the column has changed, the data looks exactly the same. Let's take a look under the hood at what's happening.

In the following code, we use the Series.cat.codes attribute to return the integer values the category type uses to represent each value.

dow_cat.head().cat.codes

0 4
1 0
2 2
3 1
4 5
dtype: int8

You can see that each unique value has been assigned an integer, and that the underlying datatype for the column is now int8. This column doesn't have any missing values, but if it did, the category subtype handles missing values by setting them to -1.

Lastly, let's look at the memory usage for this column before and after converting to the category type.


print(mem_usage(dow))
print(mem_usage(dow_cat))
9.84 MB
0.16 MB

We've gone from 9.8MB of memory usage to 0.16MB of memory usage, or a 98% reduction! Note that this particular column probably represents one of our best-case scenarios - a column with ~172,000 items of which there only 7 unique values.

While converting all of the columns to this type sounds appealing, it's important to be aware of the trade-offs. The biggest one is the inability to perform numerical computations. We can't do arithmetic with category columns or use methods like Series.min() and Series.max() without converting to a true numeric dtype first.

We should stick to using the category type primarily for object columns where less than 50% of the values are unique. If all of the values in a column are unique, the category type will end up using more memory. That's because the column is storing all of the raw string values in addition to the integer category codes. You can read more about the limitations of the category type in the pandas documentation.

We'll write a loop to iterate over each object column, check if the number of unique values is less than 50%, and if so, convert it to the category type.


converted_obj = pd.DataFrame()
for col in gl_obj.columns:
    num_unique_values = len(gl_obj[col].unique())
    num_total_values = len(gl_obj[col])
    if num_unique_values / num_total_values < 0.5:
        converted_obj.loc[:,col] = gl_obj[col].astype('category')
    else:
        converted_obj.loc[:,col] = gl_obj[col]

As before,


print(mem_usage(gl_obj))
print(mem_usage(converted_obj))
compare_obj = pd.concat([gl_obj.dtypes,converted_obj.dtypes],axis=1)
compare_obj.columns = ['before','after']
compare_obj.apply(pd.Series.value_counts)
752.72 MB
51.67 MB
before after
object 78.0 NaN
category NaN 78.0

In this case, all our object columns were converted to the category type, however this won't be the case with all data sets, so you should be sure to use the process above to check.

What's more, our memory usage for our object columns has gone from 752MB to 52MB, or a reduction of 93%. Let's combine this with the rest of our dataframe and see where we sit in relation to the 861MB memory usage we started with.


optimized_gl[converted_obj.columns] = converted_obj
mem_usage(optimized_gl)
'103.64 MB'

Wow, we've really made some progress! We have one more optimization we can make - if you remember back to our table of types, there was a datetime type that we can use for the first column of our data set.


date = optimized_gl.date
print(mem_usage(date))
date.head()
0.66 MB
0 18710504
1 18710505
2 18710506
3 18710508
4 18710509
Name: date, dtype: uint32

You may remember that this was read in as an integer type and already optimized to unint32. Because of this, converting it to datetime will actually double it's memory usage, as the datetime type is a 64 bit type. There's value in converting it to datetime anyway since it will allow us to more easily do time series analysis.

We'll convert using pandas.to_datetime() function, using the format parameter to tell it that our date data is stored YYYY-MM-DD.


optimized_gl['date'] = pd.to_datetime(date,format=
print(mem_usage(optimized_gl))
optimized_gl.date.head()
104.29 MB

0 1871-05-04
1 1871-05-05
2 1871-05-06
3 1871-05-08
4 1871-05-09
Name: date, dtype: datetime64[ns]

Selecting Types While Reading the Data In

So far, we've explored ways to reduce the memory footprint of an existing dataframe. By reading the dataframe in first and then iterating on ways to save memory, we were able to understand the amount of memory we can expect to save from each optimization better. As we mentioned earlier in the lesson, however, we often won't have enough memory to represent all the values in a data set. How can we apply memory-saving techniques when we can't even create the dataframe in the first place?

Fortunately, we can specify the optimal column types when we read the data set in. The pandas.read_csv() function has a few different parameters that allow us to do this. The dtype parameter accepts a dictionary that has (string) column names as the keys and NumPy type objects as the values.

First, we'll store the final types of every column in a dictionary with keys for column names, first removing the date column since that needs to be treated separately.


dtypes = optimized_gl.drop('date',axis=1).dtypes
dtypes_col = dtypes.index
dtypes_type = [i.name for i in dtypes.values]
column_types = dict(zip(dtypes_col, dtypes_type))
# rather than print all 161 items, we'll
# sample 10 key/value pairs from the dict
# and print it nicely using prettyprint
preview = first2pairs = {key:value for key,value in list(column_types.items())[:10]}
import pprintpp
pp = pp = pprint.PrettyPrinter(indent=4)
pp.pprint(preview)

{
'acquisition_info': 'category',
'h_caught_stealing': 'float32',
'h_player_1_name': 'category',
'h_player_9_name': 'category',
'v_assists': 'float32',
'v_first_catcher_interference': 'float32',
'v_grounded_into_double': 'float32',
'v_player_1_id': 'category',
'v_player_3_id': 'category',
'v_player_5_id': 'category'
}

Now we can use the dictionary, along with a few parameters for the date to read in the data with the correct types in a few lines:


read_and_optimized = pd.read_csv('game_logs.csv',dtype=column_types,parse_dates=['date'],infer_datetime_format=True)
print(mem_usage(read_and_optimized))
read_and_optimized.head()
104.28 MB
date number_of_game day_of_week v_name v_league v_game_number h_name h_league h_game_number v_score h_score length_outs day_night completion forefeit protest park_id attendance length_minutes v_line_score h_line_score v_at_bats v_hits v_doubles v_triples v_homeruns v_rbi v_sacrifice_hits v_sacrifice_flies v_hit_by_pitch v_walks v_intentional walks v_strikeouts v_stolen_bases v_caught_stealing v_grounded_into_double v_first_catcher_interference v_left_on_base v_pitchers_used v_individual_earned_runs v_team_earned_runs v_wild_pitches v_balks v_putouts v_assists v_errors v_passed_balls v_double_plays v_triple_plays h_at_bats h_hits h_doubles h_triples h_homeruns h_rbi h_sacrifice_hits h_sacrifice_flies h_hit_by_pitch h_walks h_intentional walks h_strikeouts h_stolen_bases h_caught_stealing h_grounded_into_double h_first_catcher_interference h_left_on_base h_pitchers_used h_individual_earned_runs h_team_earned_runs h_wild_pitches h_balks h_putouts h_assists h_errors h_passed_balls h_double_plays h_triple_plays hp_umpire_id hp_umpire_name 1b_umpire_id 1b_umpire_name 2b_umpire_id 2b_umpire_name 3b_umpire_id 3b_umpire_name lf_umpire_id lf_umpire_name rf_umpire_id rf_umpire_name v_manager_id v_manager_name h_manager_id h_manager_name winning_pitcher_id winning_pitcher_name losing_pitcher_id losing_pitcher_name saving_pitcher_id saving_pitcher_name winning_rbi_batter_id winning_rbi_batter_id_name v_starting_pitcher_id v_starting_pitcher_name h_starting_pitcher_id h_starting_pitcher_name v_player_1_id v_player_1_name v_player_1_def_pos v_player_2_id v_player_2_name v_player_2_def_pos v_player_3_id v_player_3_name v_player_3_def_pos v_player_4_id v_player_4_name v_player_4_def_pos v_player_5_id v_player_5_name v_player_5_def_pos v_player_6_id v_player_6_name v_player_6_def_pos v_player_7_id v_player_7_name v_player_7_def_pos v_player_8_id v_player_8_name v_player_8_def_pos v_player_9_id v_player_9_name v_player_9_def_pos h_player_1_id h_player_1_name h_player_1_def_pos h_player_2_id h_player_2_name h_player_2_def_pos h_player_3_id h_player_3_name h_player_3_def_pos h_player_4_id h_player_4_name h_player_4_def_pos h_player_5_id h_player_5_name h_player_5_def_pos h_player_6_id h_player_6_name h_player_6_def_pos h_player_7_id h_player_7_name h_player_7_def_pos h_player_8_id h_player_8_name h_player_8_def_pos h_player_9_id h_player_9_name h_player_9_def_pos additional_info acquisition_info
0 1871-05-04 0 Thu CL1 na 1 FW1 na 1 0 2 54.0 D NaN NaN NaN FOR01 200.0 120.0 000000000 010010000 30.0 4.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 -1.0 6.0 1.0 -1.0 -1.0 -1.0 4.0 1.0 1.0 1.0 0.0 0.0 27.0 9.0 0.0 3.0 0.0 0.0 31.0 4.0 1.0 0.0 0.0 2.0 0.0 0.0 0.0 1.0 -1.0 0.0 0.0 -1.0 -1.0 -1.0 3.0 1.0 0.0 0.0 0.0 0.0 27.0 3.0 3.0 1.0 1.0 0.0 boakj901 John Boake NaN (none) NaN (none) NaN (none) NaN (none) NaN (none) paboc101 Charlie Pabor lennb101 Bill Lennon mathb101 Bobby Mathews prata101 Al Pratt NaN (none) NaN (none) prata101 Al Pratt mathb101 Bobby Mathews whitd102 Deacon White 2.0 kimbg101 Gene Kimball 4.0 paboc101 Charlie Pabor 7.0 allia101 Art Allison 8.0 white104 Elmer White 9.0 prata101 Al Pratt 1.0 sutte101 Ezra Sutton 5.0 carlj102 Jim Carleton 3.0 bassj101 John Bass 6.0 selmf101 Frank Sellman 5.0 mathb101 Bobby Mathews 1.0 foraj101 Jim Foran 3.0 goldw101 Wally Goldsmith 6.0 lennb101 Bill Lennon 2.0 caret101 Tom Carey 4.0 mince101 Ed Mincher 7.0 mcdej101 James McDermott 8.0 kellb105 Bill Kelly 9.0 NaN Y
1 1871-05-05 0 Fri BS1 na 1 WS3 na 1 20 18 54.0 D NaN NaN NaN WAS01 5000.0 145.0 107000435 640113030 41.0 13.0 1.0 2.0 0.0 13.0 0.0 0.0 0.0 18.0 -1.0 5.0 3.0 -1.0 -1.0 -1.0 12.0 1.0 6.0 6.0 1.0 0.0 27.0 13.0 10.0 1.0 2.0 0.0 49.0 14.0 2.0 0.0 0.0 11.0 0.0 0.0 0.0 10.0 -1.0 2.0 1.0 -1.0 -1.0 -1.0 14.0 1.0 7.0 7.0 0.0 0.0 27.0 20.0 10.0 2.0 3.0 0.0 dobsh901 Henry Dobson NaN (none) NaN (none) NaN (none) NaN (none) NaN (none) wrigh101 Harry Wright younn801 Nick Young spala101 Al Spalding braia102 Asa Brainard NaN (none) NaN (none) spala101 Al Spalding braia102 Asa Brainard wrigg101 George Wright 6.0 barnr102 Ross Barnes 4.0 birdd102 Dave Birdsall 9.0 mcvec101 Cal McVey 2.0 wrigh101 Harry Wright 8.0 goulc101 Charlie Gould 3.0 schah101 Harry Schafer 5.0 conef101 Fred Cone 7.0 spala101 Al Spalding 1.0 watef102 Fred Waterman 5.0 forcd101 Davy Force 6.0 mille105 Everett Mills 3.0 allid101 Doug Allison 2.0 hallg101 George Hall 7.0 leona101 Andy Leonard 4.0 braia102 Asa Brainard 1.0 burrh101 Henry Burroughs 9.0 berth101 Henry Berthrong 8.0 HTBF Y
2 1871-05-06 0 Sat CL1 na 2 RC1 na 1 12 4 54.0 D NaN NaN NaN RCK01 1000.0 140.0 610020003 010020100 49.0 11.0 1.0 1.0 0.0 8.0 0.0 0.0 0.0 0.0 -1.0 1.0 0.0 -1.0 -1.0 -1.0 10.0 1.0 0.0 0.0 2.0 0.0 27.0 12.0 8.0 5.0 0.0 0.0 36.0 7.0 2.0 1.0 0.0 2.0 0.0 0.0 0.0 0.0 -1.0 3.0 5.0 -1.0 -1.0 -1.0 5.0 1.0 3.0 3.0 1.0 0.0 27.0 12.0 13.0 3.0 0.0 0.0 mawnj901 J.H. Manny NaN (none) NaN (none) NaN (none) NaN (none) NaN (none) paboc101 Charlie Pabor hasts101 Scott Hastings prata101 Al Pratt fishc102 Cherokee Fisher NaN (none) NaN (none) prata101 Al Pratt fishc102 Cherokee Fisher whitd102 Deacon White 2.0 kimbg101 Gene Kimball 4.0 paboc101 Charlie Pabor 7.0 allia101 Art Allison 8.0 white104 Elmer White 9.0 prata101 Al Pratt 1.0 sutte101 Ezra Sutton 5.0 carlj102 Jim Carleton 3.0 bassj101 John Bass 6.0 mackd101 Denny Mack 3.0 addyb101 Bob Addy 4.0 fishc102 Cherokee Fisher 1.0 hasts101 Scott Hastings 8.0 ham-r101 Ralph Ham 5.0 ansoc101 Cap Anson 2.0 sagep101 Pony Sager 6.0 birdg101 George Bird 7.0 stirg101 Gat Stires 9.0 NaN Y
3 1871-05-08 0 Mon CL1 na 3 CH1 na 1 12 14 54.0 D NaN NaN NaN CHI01 5000.0 150.0 101403111 077000000 46.0 15.0 2.0 1.0 2.0 10.0 0.0 0.0 0.0 0.0 -1.0 1.0 0.0 -1.0 -1.0 -1.0 7.0 1.0 6.0 6.0 0.0 0.0 27.0 15.0 11.0 6.0 0.0 0.0 43.0 11.0 2.0 0.0 0.0 8.0 0.0 0.0 0.0 4.0 -1.0 2.0 1.0 -1.0 -1.0 -1.0 6.0 1.0 4.0 4.0 0.0 0.0 27.0 14.0 7.0 2.0 0.0 0.0 willg901 Gardner Willard NaN (none) NaN (none) NaN (none) NaN (none) NaN (none) paboc101 Charlie Pabor woodj106 Jimmy Wood zettg101 George Zettlein prata101 Al Pratt NaN (none) NaN (none) prata101 Al Pratt zettg101 George Zettlein whitd102 Deacon White 2.0 kimbg101 Gene Kimball 4.0 paboc101 Charlie Pabor 7.0 allia101 Art Allison 8.0 white104 Elmer White 9.0 prata101 Al Pratt 1.0 sutte101 Ezra Sutton 5.0 carlj102 Jim Carleton 3.0 bassj101 John Bass 6.0 mcatb101 Bub McAtee 3.0 kingm101 Marshall King 8.0 hodec101 Charlie Hodes 2.0 woodj106 Jimmy Wood 4.0 simmj101 Joe Simmons 9.0 folet101 Tom Foley 7.0 duffe101 Ed Duffy 6.0 pinke101 Ed Pinkham 5.0 zettg101 George Zettlein 1.0 NaN Y
4 1871-05-09 0 Tue BS1 na 2 TRO na 1 9 5 54.0 D NaN NaN NaN TRO01 3250.0 145.0 000002232 101003000 46.0 17.0 4.0 1.0 0.0 6.0 0.0 0.0 0.0 2.0 -1.0 0.0 1.0 -1.0 -1.0 -1.0 12.0 1.0 2.0 2.0 0.0 0.0 27.0 12.0 5.0 0.0 1.0 0.0 36.0 9.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 3.0 -1.0 0.0 2.0 -1.0 -1.0 -1.0 7.0 1.0 3.0 3.0 1.0 0.0 27.0 11.0 7.0 3.0 0.0 0.0 leroi901 Isaac Leroy NaN (none) NaN (none) NaN (none) NaN (none) NaN (none) wrigh101 Harry Wright pikel101 Lip Pike spala101 Al Spalding mcmuj101 John McMullin NaN (none) NaN (none) spala101 Al Spalding mcmuj101 John McMullin wrigg101 George Wright 6.0 barnr102 Ross Barnes 4.0 birdd102 Dave Birdsall 9.0 mcvec101 Cal McVey 2.0 wrigh101 Harry Wright 8.0 goulc101 Charlie Gould 3.0 schah101 Harry Schafer 5.0 conef101 Fred Cone 7.0 spala101 Al Spalding 1.0 flync101 Clipper Flynn 9.0 mcgem101 Mike McGeary 2.0 yorkt101 Tom York 8.0 mcmuj101 John McMullin 1.0 kings101 Steve King 7.0 beave101 Edward Beavens 4.0 bells101 Steve Bellan 5.0 pikel101 Lip Pike 3.0 cravb101 Bill Craver 6.0 HTBF Y

By optimizing the columns, we've managed to reduce the memory usage in pandas from 861.6 MB to 104.28 MB - an impressive 88% reduction!

Analyzing baseball games

Now that we've optimized our data, we can perform some analysis. Let's start by looking at the distribution of game days.


optimized_gl['year'] = optimized_gl.date.dt.year
games_per_day = optimized_gl.pivot_table(index='year',columns='day_of_week',values='date',aggfunc=len)
games_per_day = games_per_day.divide(games_per_day.sum(axis=1),axis=0)
ax = games_per_day.plot(kind='area',stacked='true')
ax.legend(loc='upper right')
ax.set_ylim(0,1)
plt.show()

We can see that before the 1920s, Sunday baseball games were rare on Sundays before coming gradually more popular through the latter half of last century.

We can also see pretty clearly that the distribution of game days has been relatively static for the last 50 years.

Let's also look at how game length has varied over the years.


game_lengths = optimized_gl.pivot_table(index='year', values='length_minutes')
game_lengths.reset_index().plot.scatter('year','length_minutes')
plt.show()

Looks like baseball games have continued to get longer from the 1940s onwards.

Summary and Next Steps

We've learned how pandas stores data using different types, and then we used that knowledge to reduce the memory usage of our pandas dataframe by almost 90%, just by using a few simple techniques:

  • Downcasting numeric columns to more efficient types.
  • Converting string columns to the categorical type.

If you'd like to dive into working with larger data in pandas some more, a lot of the content in this blog post is available in our interactive course Processing Large Datasets In Pandas Course, which you can start for free.

Celeste Grupman

About the author

Celeste Grupman

Celeste Grupman is the CEO at Dataquest She is passionate about creating affordable access to high-quality skills training for students across the globe.