Tutorial: Using Pandas with Large Data Sets in Python

Did you know Python and pandas can reduce your memory usage by up to 90% when you’re working with big data sets?

When working in Python using pandas with small data (under 100 megabytes), performance is rarely a problem. When we move to larger data (100 megabytes to multiple gigabytes), performance issues can make run times much longer, and cause code to fail entirely due to insufficient memory.

While tools like Spark can handle large data sets (100 gigabytes to multiple terabytes), taking full advantage of their capabilities usually requires more expensive hardware. And unlike pandas, they lack rich feature sets for high quality data cleaning, exploration, and analysis. For medium-sized data, we’re better off trying to get more out of pandas, rather than switching to a different tool.

In this post, we’ll learn about Python’s memory usage with pandas, how to reduce a dataframe’s memory footprint by almost 90%, simply by selecting the appropriate data types for columns.

Working with baseball game logs

We’ll be working with data from 130 years of major league baseball games, originally sourced from Retrosheet.

Originally the data was in 127 separate CSV files, however we have used csvkit to merge the files, and have added column names into the first row. If you’d like to download our version of the data to follow along with this post, we have made it available here.

Let’s start by importing both pandas and our data in Python and taking a look at the first five rows.


import pandas as pd
gl = pd.read_csv('game_logs.csv')
gl.head()
datenumber_of_gameday_of_weekv_namev_leaguev_game_numberh_nameh_leagueh_game_numberv_scoreh_scorelength_outsday_nightcompletionforefeitprotestpark_idattendancelength_minutesv_line_scoreh_line_scorev_at_batsv_hitsv_doublesv_triplesv_homerunsv_rbiv_sacrifice_hitsv_sacrifice_fliesv_hit_by_pitchv_walksv_intentional walksv_strikeoutsv_stolen_basesv_caught_stealingv_grounded_into_doublev_first_catcher_interferencev_left_on_basev_pitchers_usedv_individual_earned_runsv_team_earned_runsv_wild_pitchesv_balksv_putoutsv_assistsv_errorsv_passed_ballsv_double_playsv_triple_playsh_at_batsh_hitsh_doublesh_triplesh_homerunsh_rbih_sacrifice_hitsh_sacrifice_fliesh_hit_by_pitchh_walksh_intentional walksh_strikeoutsh_stolen_basesh_caught_stealingh_grounded_into_doubleh_first_catcher_interferenceh_left_on_baseh_pitchers_usedh_individual_earned_runsh_team_earned_runsh_wild_pitchesh_balksh_putoutsh_assistsh_errorsh_passed_ballsh_double_playsh_triple_playshp_umpire_idhp_umpire_name1b_umpire_id1b_umpire_name2b_umpire_id2b_umpire_name3b_umpire_id3b_umpire_namelf_umpire_idlf_umpire_namerf_umpire_idrf_umpire_namev_manager_idv_manager_nameh_manager_idh_manager_namewinning_pitcher_idwinning_pitcher_namelosing_pitcher_idlosing_pitcher_namesaving_pitcher_idsaving_pitcher_namewinning_rbi_batter_idwinning_rbi_batter_id_namev_starting_pitcher_idv_starting_pitcher_nameh_starting_pitcher_idh_starting_pitcher_namev_player_1_idv_player_1_namev_player_1_def_posv_player_2_idv_player_2_namev_player_2_def_posv_player_3_idv_player_3_namev_player_3_def_posv_player_4_idv_player_4_namev_player_4_def_posv_player_5_idv_player_5_namev_player_5_def_posv_player_6_idv_player_6_namev_player_6_def_posv_player_7_idv_player_7_namev_player_7_def_posv_player_8_idv_player_8_namev_player_8_def_posv_player_9_idv_player_9_namev_player_9_def_posh_player_1_idh_player_1_nameh_player_1_def_posh_player_2_idh_player_2_nameh_player_2_def_posh_player_3_idh_player_3_nameh_player_3_def_posh_player_4_idh_player_4_nameh_player_4_def_posh_player_5_idh_player_5_nameh_player_5_def_posh_player_6_idh_player_6_nameh_player_6_def_posh_player_7_idh_player_7_nameh_player_7_def_posh_player_8_idh_player_8_nameh_player_8_def_posh_player_9_idh_player_9_nameh_player_9_def_posadditional_infoacquisition_info
0187105040ThuCL1na1FW1na10254.0DNaNNaNNaNFOR01200.0120.000000000001001000030.04.01.00.00.00.00.00.00.01.0-1.06.01.0-1.0-1.0-1.04.01.01.01.00.00.027.09.00.03.00.00.031.04.01.00.00.02.00.00.00.01.0-1.00.00.0-1.0-1.0-1.03.01.00.00.00.00.027.03.03.01.01.00.0boakj901John BoakeNaN(none)NaN(none)NaN(none)NaN(none)NaN(none)paboc101Charlie Paborlennb101Bill Lennonmathb101Bobby Mathewsprata101Al PrattNaN(none)NaN(none)prata101Al Prattmathb101Bobby Mathewswhitd102Deacon White2.0kimbg101Gene Kimball4.0paboc101Charlie Pabor7.0allia101Art Allison8.0white104Elmer White9.0prata101Al Pratt1.0sutte101Ezra Sutton5.0carlj102Jim Carleton3.0bassj101John Bass6.0selmf101Frank Sellman5.0mathb101Bobby Mathews1.0foraj101Jim Foran3.0goldw101Wally Goldsmith6.0lennb101Bill Lennon2.0caret101Tom Carey4.0mince101Ed Mincher7.0mcdej101James McDermott8.0kellb105Bill Kelly9.0NaNY
1187105050FriBS1na1WS3na1201854.0DNaNNaNNaNWAS015000.0145.010700043564011303041.013.01.02.00.013.00.00.00.018.0-1.05.03.0-1.0-1.0-1.012.01.06.06.01.00.027.013.010.01.02.00.049.014.02.00.00.011.00.00.00.010.0-1.02.01.0-1.0-1.0-1.014.01.07.07.00.00.027.020.010.02.03.00.0dobsh901Henry DobsonNaN(none)NaN(none)NaN(none)NaN(none)NaN(none)wrigh101Harry Wrightyounn801Nick Youngspala101Al Spaldingbraia102Asa BrainardNaN(none)NaN(none)spala101Al Spaldingbraia102Asa Brainardwrigg101George Wright6.0barnr102Ross Barnes4.0birdd102Dave Birdsall9.0mcvec101Cal McVey2.0wrigh101Harry Wright8.0goulc101Charlie Gould3.0schah101Harry Schafer5.0conef101Fred Cone7.0spala101Al Spalding1.0watef102Fred Waterman5.0forcd101Davy Force6.0mille105Everett Mills3.0allid101Doug Allison2.0hallg101George Hall7.0leona101Andy Leonard4.0braia102Asa Brainard1.0burrh101Henry Burroughs9.0berth101Henry Berthrong8.0HTBFY
2187105060SatCL1na2RC1na112454.0DNaNNaNNaNRCK011000.0140.061002000301002010049.011.01.01.00.08.00.00.00.00.0-1.01.00.0-1.0-1.0-1.010.01.00.00.02.00.027.012.08.05.00.00.036.07.02.01.00.02.00.00.00.00.0-1.03.05.0-1.0-1.0-1.05.01.03.03.01.00.027.012.013.03.00.00.0mawnj901J.H. MannyNaN(none)NaN(none)NaN(none)NaN(none)NaN(none)paboc101Charlie Paborhasts101Scott Hastingsprata101Al Prattfishc102Cherokee FisherNaN(none)NaN(none)prata101Al Prattfishc102Cherokee Fisherwhitd102Deacon White2.0kimbg101Gene Kimball4.0paboc101Charlie Pabor7.0allia101Art Allison8.0white104Elmer White9.0prata101Al Pratt1.0sutte101Ezra Sutton5.0carlj102Jim Carleton3.0bassj101John Bass6.0mackd101Denny Mack3.0addyb101Bob Addy4.0fishc102Cherokee Fisher1.0hasts101Scott Hastings8.0ham-r101Ralph Ham5.0ansoc101Cap Anson2.0sagep101Pony Sager6.0birdg101George Bird7.0stirg101Gat Stires9.0NaNY
3187105080MonCL1na3CH1na1121454.0DNaNNaNNaNCHI015000.0150.010140311107700000046.015.02.01.02.010.00.00.00.00.0-1.01.00.0-1.0-1.0-1.07.01.06.06.00.00.027.015.011.06.00.00.043.011.02.00.00.08.00.00.00.04.0-1.02.01.0-1.0-1.0-1.06.01.04.04.00.00.027.014.07.02.00.00.0willg901Gardner WillardNaN(none)NaN(none)NaN(none)NaN(none)NaN(none)paboc101Charlie Paborwoodj106Jimmy Woodzettg101George Zettleinprata101Al PrattNaN(none)NaN(none)prata101Al Prattzettg101George Zettleinwhitd102Deacon White2.0kimbg101Gene Kimball4.0paboc101Charlie Pabor7.0allia101Art Allison8.0white104Elmer White9.0prata101Al Pratt1.0sutte101Ezra Sutton5.0carlj102Jim Carleton3.0bassj101John Bass6.0mcatb101Bub McAtee3.0kingm101Marshall King8.0hodec101Charlie Hodes2.0woodj106Jimmy Wood4.0simmj101Joe Simmons9.0folet101Tom Foley7.0duffe101Ed Duffy6.0pinke101Ed Pinkham5.0zettg101George Zettlein1.0NaNY
4187105090TueBS1na2TROna19554.0DNaNNaNNaNTRO013250.0145.000000223210100300046.017.04.01.00.06.00.00.00.02.0-1.00.01.0-1.0-1.0-1.012.01.02.02.00.00.027.012.05.00.01.00.036.09.00.00.00.02.00.00.00.03.0-1.00.02.0-1.0-1.0-1.07.01.03.03.01.00.027.011.07.03.00.00.0leroi901Isaac LeroyNaN(none)NaN(none)NaN(none)NaN(none)NaN(none)wrigh101Harry Wrightpikel101Lip Pikespala101Al Spaldingmcmuj101John McMullinNaN(none)NaN(none)spala101Al Spaldingmcmuj101John McMullinwrigg101George Wright6.0barnr102Ross Barnes4.0birdd102Dave Birdsall9.0mcvec101Cal McVey2.0wrigh101Harry Wright8.0goulc101Charlie Gould3.0schah101Harry Schafer5.0conef101Fred Cone7.0spala101Al Spalding1.0flync101Clipper Flynn9.0mcgem101Mike McGeary2.0yorkt101Tom York8.0mcmuj101John McMullin1.0kings101Steve King7.0beave101Edward Beavens4.0bells101Steve Bellan5.0pikel101Lip Pike3.0cravb101Bill Craver6.0HTBFY

We’ve summarized some of the important columns below, however if you’d like to see a guide to all columns we have created a data dictionary for the whole data set:

  • date — Date of the game.
  • v_name — Visiting team name.
  • v_league — Visiting team league.
  • h_name — Home team name.
  • h_league — Home team league.
  • v_score — Visiting team score.
  • h_score — Home team score.
  • v_line_score — Visiting team line score, eg 010000(10)00.
  • h_line_score — Home team line score, eg 010000(10)0X.
  • park_id — ID of the park where the game was held.
  • attendance— Game attendance.

We can use the DataFrame.info() method to give us some high level information about our dataframe, including its size, information about data types and memory usage.

By default, pandas approximates of the memory usage of the dataframe to save time. Because we’re interested in accuracy, we’ll set the memory_usage parameter to 'deep' to get an accurate number.

gl.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>RangeIndex: 171907 entries, 0 to 171906
Columns: 161 entries, date to acquisition_infodtypes: float64(77), int64(6), object(78)
memory usage: 861.6 MB

We can see that we have 171,907 rows and 161 columns. Pandas has automatically detected types for us, with 83 numeric columns and 78 object columns. Object columns are used for strings or where a column contains mixed data types.

So we can get a better understanding of where we can reduce this memory usage, let’s take a look into how Python and pandas store data in memory.

The Internal Representation of a Dataframe

Under the hood, pandas groups the columns into blocks of values of the same type. Here’s a preview of how pandas stores the first twelve columns of our dataframe.

You’ll notice that the blocks don’t maintain references to the column names. This is because blocks are optimized for storing the actual values in the dataframe. The BlockManager class is responsible for maintaining the mapping between the row and column indexes and the actual blocks. It acts as an API that provides access to the underlying data. Whenever we select, edit, or delete values, the dataframe class interfaces with the BlockManager class to translate our requests to function and method calls.

Each type has a specialized class in the pandas.core.internals module. Pandas uses the ObjectBlock class to represent the block containing string columns, and the FloatBlock class to represent the block containing float columns. For blocks representing numeric values like integers and floats, pandas combines the columns and stores them as a NumPy ndarray. The NumPy ndarray is built around a C array, and the values are stored in a contiguous block of memory. Due to this storage scheme, accessing a slice of values is incredibly fast.

Because each data type is stored separately, we’re going to examine the memory usage by data type. Let’s start by looking at the average memory usage for data type.


for dtype in ['float','int','object']:
    selected_dtype = gl.select_dtypes(include=[dtype])
    mean_usage_b = selected_dtype.memory_usage(deep=True).mean()
    mean_usage_mb = mean_usage_b / 1024 ** 2
    print("Average memory usage for {} columns: {:03.2f} MB".format(dtype,mean_usage_mb))
Average memory usage for float columns: 1.29 MB
Average memory usage for int columns: 1.12 MB
Average memory usage for object columns: 9.53 MB

Immediately we can see that most of our memory is used by our 78 object columns. We’ll look at those later, but first lets see if we can improve on the memory usage for our numeric columns.

Understanding Subtypes

As we mentioned briefly before, under the hood pandas represents numeric values as NumPy ndarrays and stores them in a continuous block of memory. This storage model consumes less space and allows us to access the values themselves quickly. Because pandas represents each value of the same type using the same number of bytes, and a NumPy ndarray stores the number of values, pandas can return the number of bytes a numeric column consumes quickly and accurately.

Many types in pandas have multiple subtypes that can use fewer bytes to represent each value. For example, the float type has the float16, float32, and float64 subtypes. The number portion of a type’s name indicates the number of bits that type uses to represent values. For example, the subtypes we just listed use 2, 4, 8 and 16 bytes, respectively. The following table shows the subtypes for the most common pandas types:

memory usagefloatintuintdatetimeboolobject
1 bytesint8uint8bool
2 bytesfloat16int16uint16
4 bytesfloat32int32uint32
8 bytesfloat64int64uint64datetime64
variableobject

An int8 value uses 1 byte (or 8 bits) to store a value, and can represent 256 values (2^8) in binary. This means that we can use this subtype to represent values ranging from -128 to 127 (including 0).

We can use the numpy.iinfo class to verify the minimum and maximum values for each integer subtype. Let’s look at an example:


import numpy as np
int_types = ["uint8", "int8", "int16"]
for it in int_types:
    print(np.iinfo(it))

Machine parameters for uint8----------------------------------------
min = 0
max = 255
Machine parameters for int8-----------------------------------------
min = -128
max = 127
Machine parameters for int16----------------------------------------
min = -32768
max = 32767

We can see here the difference between uint (unsigned integers) and int (signed integers). Both types have the same capacity for storage, but by only storing positive values, unsigned integers allow us to be more efficient with our storage of columns that only contain positive values.

Optimizing Numeric Columns with Subtypes

We can use the function pd.to_numeric() to downcast our numeric types. We’ll use DataFrame.select_dtypes to select only the integer columns, then we’ll optimize the types and compare the memory usage.


# We're going to be calculating memory usage a lot,
# so we'll create a function to save us some time!
def mem_usage(pandas_obj):
    if isinstance(pandas_obj,pd.DataFrame):
        usage_b = pandas_obj.memory_usage(deep=True).sum()
    else: # we assume if not a df it's a series
        usage_b = pandas_obj.memory_usage(deep=True)
    usage_mb = usage_b / 1024 ** 2 # convert bytes to megabytes
    return "{:03.2f} MB".format(usage_mb)
gl_int = gl.select_dtypes(include=['int'])
converted_int = gl_int.apply(pd.to_numeric,downcast='unsigned')
print(mem_usage(gl_int))
print(mem_usage(converted_int))
compare_ints = pd.concat([gl_int.dtypes,converted_int.dtypes],axis=1)
compare_ints.columns = ['before','after']
compare_ints.apply(pd.Series.value_counts)
7.87 MB
1.48 MB
beforeafter
uint8NaN5.0
uint32NaN1.0
int646.0NaN

We can see a drop from 7.9 to 1.5 megabytes in memory usage, which is a more than 80% reduction. The overall impact on our original dataframe isn’t massive though, because there are so few integer columns.

Lets do the same thing with our float columns.


gl_float = gl.select_dtypes(include=['float'])
converted_float = gl_float.apply(pd.to_numeric,downcast='float')
print(mem_usage(gl_float))
print(mem_usage(converted_float))
compare_floats = pd.concat([gl_float.dtypes,converted_float.dtypes],axis=1)
compare_floats.columns = ['before','after']
compare_floats.apply(pd.Series.value_counts)
100.99 MB
50.49 MB
beforeafter
float32NaN77.0
float6477.0NaN

We can see that all our float columns were converted from float64 to float32, giving us a 50% reduction in memory usage.

Let’s create a copy of our original dataframe, assign these optimized numeric columns in place of the originals, and see what our overall memory usage is now.


optimized_gl = gl.copy()
optimized_gl[converted_int.columns] = converted_int
optimized_gl[converted_float.columns] = converted_float
print(mem_usage(gl))
print(mem_usage(optimized_gl))
861.57 MB
804.69 MB

While we’ve dramatically reduced the memory usage of our numeric columns, overall we’ve only reduced the memory usage of our dataframe by 7%. Most of our gains are going to come from optimizing the object types.

Before we do, let’s take a closer look at how strings are stored in pandas compared to the numeric types

Comparing Numeric to String storage

The object type represents values using Python string objects, partly due to the lack of support for missing string values in NumPy. Because Python is a high-level, interpreted language, it doesn’t have fine grained-control over how values in memory are stored.

This limitation causes strings to be stored in a fragmented way that consumes more memory and is slower to access. Each element in an object column is really a pointer that contains the “address” for the actual value’s location in memory.

Below is a diagram showing how numeric data is stored in NumPy data types vs how strings are stored using Python’s inbuilt types.

Diagram adapted from the excellent post Why Python Is Slow.

You may have noticed our chart earlier described object types as using a variable amount of memory. While each pointer takes up 1 byte of memory, each actual string value uses the same amount of memory that string would use if stored individually in Python. Let’s use sys.getsizeof() to prove that out, first by looking at individual strings, and then items in a pandas series.


from sys import getsizeof
s1 = 'working out'
s2 = 'memory usage for'
s3 = 'strings in python is fun!'
s4 = 'strings in python is fun!'
for s in [s1, s2, s3, s4]:
    print(getsizeof(s))

60
65
74
74
obj_series = pd.Series(['working out',
    'memory usage for',
    'strings in python is fun!',
    'strings in python is fun!'])
obj_series.apply(getsizeof)

0 60
1 65
2 74
3 74
dtype: int64

You can see that the size of strings when stored in a pandas series are identical to their usage as separate strings in Python.

Optimizing object types using categoricals

Pandas introduced Categoricals in version 0.15. The category type uses integer values under the hood to represent the values in a column, rather than the raw values. Pandas uses a separate mapping dictionary that maps the integer values to the raw ones. This arrangement is useful whenever a column contains a limited set of values. When we convert a column to the category dtype, pandas uses the most space efficient int subtype that can represent all of the unique values in a column.

To get an overview of where we might be able to use this type to reduce memory, let’s take a look at the number of unique values of each of our object types.


gl_obj = gl.select_dtypes(include=['object']).copy()
gl_obj.describe()
day_of_weekv_namev_leagueh_nameh_leagueday_nightcompletionforefeitprotestpark_idv_line_scoreh_line_scorehp_umpire_idhp_umpire_name1b_umpire_id1b_umpire_name2b_umpire_id2b_umpire_name3b_umpire_id3b_umpire_namelf_umpire_idlf_umpire_namerf_umpire_idrf_umpire_namev_manager_idv_manager_nameh_manager_idh_manager_namewinning_pitcher_idwinning_pitcher_namelosing_pitcher_idlosing_pitcher_namesaving_pitcher_idsaving_pitcher_namewinning_rbi_batter_idwinning_rbi_batter_id_namev_starting_pitcher_idv_starting_pitcher_nameh_starting_pitcher_idh_starting_pitcher_namev_player_1_idv_player_1_namev_player_2_idv_player_2_namev_player_3_idv_player_3_namev_player_4_idv_player_4_namev_player_5_idv_player_5_namev_player_6_idv_player_6_namev_player_7_idv_player_7_namev_player_8_idv_player_8_namev_player_9_idv_player_9_nameh_player_1_idh_player_1_nameh_player_2_idh_player_2_nameh_player_3_idh_player_3_nameh_player_4_idh_player_4_nameh_player_5_idh_player_5_nameh_player_6_idh_player_6_nameh_player_7_idh_player_7_nameh_player_8_idh_player_8_nameh_player_9_idh_player_9_nameadditional_infoacquisition_info
count171907171907171907171907171907140150116145180171907147271147271171888171891147040171891885401711271167231711352031719029171902171907171907171907171907140229140229140229140229480181408381056991408381718631718631718631718631408381408381408381408381408381408381408381408381408381408381408381408381408381408381408381408381408351408351408381408381408381408381408381408381408381408381408381408381408381408381408381408381408381408381408381408381456140841
unique7148714872116352453636737859114911466786783243253623633132896486486596595123508456535606313331175739567451935129517051252870284737093673298929642581256337573722479447365301524148124763564355852802278236483614288128582533251736963660477447205253519747604710519351423321
topSatCHNNLCHNNLD19510725,,6,6,46HVSTL07000000000000000000klemb901Bill Klemconnt901(none)westj901(none)mcgob901(none)sudoe901(none)gormt101(none)mackc101Connie Mackmackc101Connie Mackjohnw102Walter Johnsonrixee101Dutch Leonardrivem002(none)pujoa001(none)younc102Youngyounc102Youngsuzui001Ichiro Suzukifox-n101Nellie Foxspeat101Tris Speakerbottj101Jim Bottomleyheilh101Harry Heilmanngrimc101Charlie Grimmgrimc101Charlie Grimmlopea102Al Lopezgrifa001Alfredo Griffinsuzui001Ichiro Suzukifox-n101Nellie Foxspeat101Tris Speakergehrl101Lou Gehrigheilh101Harry Heilmanngrimc101Charlie Grimmgrimc101Charlie Grimmlopea102Al Lopezspahw101Warren SpahnHTBFY
freq288918870888669024888678272416990702210102802835453545202924851815825871129544123017169921718933901390138483848385385251295523928202883513940344141245189389385285212241224816816663663465465485485687687333333927927859859116511657527526126124274274914916766763393391112140841

A quick glance reveals many columns where there are few unique values relative to the overall ~172,000 games in our data set.

Before we dive too far in, we’ll start by selecting just one of our object columns, and looking at what happens behind the scenes when we convert it to the categorical type. We’ll use the second column of our data set, day_of_week.

Looking at the table above. we can see that it only contains seven unique values. We’ll convert it to categorical by using the .astype() method.


dow = gl_obj.day_of_week
print(dow.head())
dow_cat = dow.astype('category')
print(dow_cat.head())

0 Thu
1 Fri
2 Sat
3 Mon
4 Tue
Name: day_of_week, dtype: object
0 Thu
1 Fri
2 Sat
3 Mon
4 Tue
Name: day_of_week, dtype: category
Categories (7, object): [Fri, Mon, Sat, Sun, Thu, Tue, Wed]

As you can see, apart from the fact that the type of the column has changed, the data looks exactly the same. Let’s take a look under the hood at what’s happening.

In the following code, we use the Series.cat.codes attribute to return the integer values the category type uses to represent each value.

dow_cat.head().cat.codes

0 4
1 0
2 2
3 1
4 5
dtype: int8

You can see that each unique value has been assigned an integer, and that the underlying datatype for the column is now int8. This column doesn’t have any missing values, but if it did, the category subtype handles missing values by setting them to -1.

Lastly, let’s look at the memory usage for this column before and after converting to the category type.


print(mem_usage(dow))
print(mem_usage(dow_cat))
9.84 MB
0.16 MB

We’ve gone from 9.8MB of memory usage to 0.16MB of memory usage, or a 98% reduction! Note that this particular column probably represents one of our best-case scenarios – a column with ~172,000 items of which there only 7 unique values.

While converting all of the columns to this type sounds appealing, it’s important to be aware of the trade-offs. The biggest one is the inability to perform numerical computations. We can’t do arithmetic with category columns or use methods like Series.min() and Series.max() without converting to a true numeric dtype first.

We should stick to using the category type primarily for object columns where less than 50% of the values are unique. If all of the values in a column are unique, the category type will end up using more memory. That’s because the column is storing all of the raw string values in addition to the integer category codes. You can read more about the limitations of the category type in the pandas documentation.

We’ll write a loop to iterate over each object column, check if the number of unique values is less than 50%, and if so, convert it to the category type.


converted_obj = pd.DataFrame()
for col in gl_obj.columns:
    num_unique_values = len(gl_obj[col].unique())
    num_total_values = len(gl_obj[col])
    if num_unique_values / num_total_values < 0.5:
        converted_obj.loc[:,col] = gl_obj[col].astype('category')
    else:
        converted_obj.loc[:,col] = gl_obj[col]

As before,


print(mem_usage(gl_obj))
print(mem_usage(converted_obj))
compare_obj = pd.concat([gl_obj.dtypes,converted_obj.dtypes],axis=1)
compare_obj.columns = ['before','after']
compare_obj.apply(pd.Series.value_counts)
752.72 MB
51.67 MB
beforeafter
object78.0NaN
categoryNaN78.0

In this case, all our object columns were converted to the category type, however this won’t be the case with all data sets, so you should be sure to use the process above to check.

What’s more, our memory usage for our object columns has gone from 752MB to 52MB, or a reduction of 93%. Let’s combine this with the rest of our dataframe and see where we sit in relation to the 861MB memory usage we started with.


optimized_gl[converted_obj.columns] = converted_obj
mem_usage(optimized_gl)
'103.64 MB'

Wow, we’ve really made some progress! We have one more optimization we can make – if you remember back to our table of types, there was a datetime type that we can use for the first column of our data set.


date = optimized_gl.date
print(mem_usage(date))
date.head()
0.66 MB
0 18710504
1 18710505
2 18710506
3 18710508
4 18710509
Name: date, dtype: uint32

You may remember that this was read in as an integer type and already optimized to unint32. Because of this, converting it to datetime will actually double it’s memory usage, as the datetime type is a 64 bit type. There’s value in converting it to datetime anyway since it will allow us to more easily do time series analysis.

We’ll convert using pandas.to_datetime() function, using the format parameter to tell it that our date data is stored YYYY-MM-DD.


optimized_gl['date'] = pd.to_datetime(date,format='%Y%m%d')
print(mem_usage(optimized_gl))
optimized_gl.date.head()
104.29 MB

0 1871-05-04
1 1871-05-05
2 1871-05-06
3 1871-05-08
4 1871-05-09
Name: date, dtype: datetime64[ns]

Selecting Types While Reading the Data In

So far, we’ve explored ways to reduce the memory footprint of an existing dataframe. By reading the dataframe in first and then iterating on ways to save memory, we were able to understand the amount of memory we can expect to save from each optimization better. As we mentioned earlier in the mission, however, we often won’t have enough memory to represent all the values in a data set. How can we apply memory-saving techniques when we can’t even create the dataframe in the first place?

Fortunately, we can specify the optimal column types when we read the data set in. The pandas.read_csv() function has a few different parameters that allow us to do this. The dtype parameter accepts a dictionary that has (string) column names as the keys and NumPy type objects as the values.

First, we’ll store the final types of every column in a dictionary with keys for column names, first removing the date column since that needs to be treated separately.


dtypes = optimized_gl.drop('date',axis=1).dtypes
dtypes_col = dtypes.index
dtypes_type = [i.name for i in dtypes.values]
column_types = dict(zip(dtypes_col, dtypes_type))
# rather than print all 161 items, we'll
# sample 10 key/value pairs from the dict
# and print it nicely using prettyprint
preview = first2pairs = {key:value for key,value in list(column_types.items())[:10]}
import pprintpp
pp = pp = pprint.PrettyPrinter(indent=4)
pp.pprint(preview)

{
'acquisition_info': 'category',
'h_caught_stealing': 'float32',
'h_player_1_name': 'category',
'h_player_9_name': 'category',
'v_assists': 'float32',
'v_first_catcher_interference': 'float32',
'v_grounded_into_double': 'float32',
'v_player_1_id': 'category',
'v_player_3_id': 'category',
'v_player_5_id': 'category'
}

Now we can use the dictionary, along with a few parameters for the date to read in the data with the correct types in a few lines:


read_and_optimized = pd.read_csv('game_logs.csv',dtype=column_types,parse_dates=['date'],infer_datetime_format=True)
print(mem_usage(read_and_optimized))
read_and_optimized.head()
104.28 MB
datenumber_of_gameday_of_weekv_namev_leaguev_game_numberh_nameh_leagueh_game_numberv_scoreh_scorelength_outsday_nightcompletionforefeitprotestpark_idattendancelength_minutesv_line_scoreh_line_scorev_at_batsv_hitsv_doublesv_triplesv_homerunsv_rbiv_sacrifice_hitsv_sacrifice_fliesv_hit_by_pitchv_walksv_intentional walksv_strikeoutsv_stolen_basesv_caught_stealingv_grounded_into_doublev_first_catcher_interferencev_left_on_basev_pitchers_usedv_individual_earned_runsv_team_earned_runsv_wild_pitchesv_balksv_putoutsv_assistsv_errorsv_passed_ballsv_double_playsv_triple_playsh_at_batsh_hitsh_doublesh_triplesh_homerunsh_rbih_sacrifice_hitsh_sacrifice_fliesh_hit_by_pitchh_walksh_intentional walksh_strikeoutsh_stolen_basesh_caught_stealingh_grounded_into_doubleh_first_catcher_interferenceh_left_on_baseh_pitchers_usedh_individual_earned_runsh_team_earned_runsh_wild_pitchesh_balksh_putoutsh_assistsh_errorsh_passed_ballsh_double_playsh_triple_playshp_umpire_idhp_umpire_name1b_umpire_id1b_umpire_name2b_umpire_id2b_umpire_name3b_umpire_id3b_umpire_namelf_umpire_idlf_umpire_namerf_umpire_idrf_umpire_namev_manager_idv_manager_nameh_manager_idh_manager_namewinning_pitcher_idwinning_pitcher_namelosing_pitcher_idlosing_pitcher_namesaving_pitcher_idsaving_pitcher_namewinning_rbi_batter_idwinning_rbi_batter_id_namev_starting_pitcher_idv_starting_pitcher_nameh_starting_pitcher_idh_starting_pitcher_namev_player_1_idv_player_1_namev_player_1_def_posv_player_2_idv_player_2_namev_player_2_def_posv_player_3_idv_player_3_namev_player_3_def_posv_player_4_idv_player_4_namev_player_4_def_posv_player_5_idv_player_5_namev_player_5_def_posv_player_6_idv_player_6_namev_player_6_def_posv_player_7_idv_player_7_namev_player_7_def_posv_player_8_idv_player_8_namev_player_8_def_posv_player_9_idv_player_9_namev_player_9_def_posh_player_1_idh_player_1_nameh_player_1_def_posh_player_2_idh_player_2_nameh_player_2_def_posh_player_3_idh_player_3_nameh_player_3_def_posh_player_4_idh_player_4_nameh_player_4_def_posh_player_5_idh_player_5_nameh_player_5_def_posh_player_6_idh_player_6_nameh_player_6_def_posh_player_7_idh_player_7_nameh_player_7_def_posh_player_8_idh_player_8_nameh_player_8_def_posh_player_9_idh_player_9_nameh_player_9_def_posadditional_infoacquisition_info
01871-05-040ThuCL1na1FW1na10254.0DNaNNaNNaNFOR01200.0120.000000000001001000030.04.01.00.00.00.00.00.00.01.0-1.06.01.0-1.0-1.0-1.04.01.01.01.00.00.027.09.00.03.00.00.031.04.01.00.00.02.00.00.00.01.0-1.00.00.0-1.0-1.0-1.03.01.00.00.00.00.027.03.03.01.01.00.0boakj901John BoakeNaN(none)NaN(none)NaN(none)NaN(none)NaN(none)paboc101Charlie Paborlennb101Bill Lennonmathb101Bobby Mathewsprata101Al PrattNaN(none)NaN(none)prata101Al Prattmathb101Bobby Mathewswhitd102Deacon White2.0kimbg101Gene Kimball4.0paboc101Charlie Pabor7.0allia101Art Allison8.0white104Elmer White9.0prata101Al Pratt1.0sutte101Ezra Sutton5.0carlj102Jim Carleton3.0bassj101John Bass6.0selmf101Frank Sellman5.0mathb101Bobby Mathews1.0foraj101Jim Foran3.0goldw101Wally Goldsmith6.0lennb101Bill Lennon2.0caret101Tom Carey4.0mince101Ed Mincher7.0mcdej101James McDermott8.0kellb105Bill Kelly9.0NaNY
11871-05-050FriBS1na1WS3na1201854.0DNaNNaNNaNWAS015000.0145.010700043564011303041.013.01.02.00.013.00.00.00.018.0-1.05.03.0-1.0-1.0-1.012.01.06.06.01.00.027.013.010.01.02.00.049.014.02.00.00.011.00.00.00.010.0-1.02.01.0-1.0-1.0-1.014.01.07.07.00.00.027.020.010.02.03.00.0dobsh901Henry DobsonNaN(none)NaN(none)NaN(none)NaN(none)NaN(none)wrigh101Harry Wrightyounn801Nick Youngspala101Al Spaldingbraia102Asa BrainardNaN(none)NaN(none)spala101Al Spaldingbraia102Asa Brainardwrigg101George Wright6.0barnr102Ross Barnes4.0birdd102Dave Birdsall9.0mcvec101Cal McVey2.0wrigh101Harry Wright8.0goulc101Charlie Gould3.0schah101Harry Schafer5.0conef101Fred Cone7.0spala101Al Spalding1.0watef102Fred Waterman5.0forcd101Davy Force6.0mille105Everett Mills3.0allid101Doug Allison2.0hallg101George Hall7.0leona101Andy Leonard4.0braia102Asa Brainard1.0burrh101Henry Burroughs9.0berth101Henry Berthrong8.0HTBFY
21871-05-060SatCL1na2RC1na112454.0DNaNNaNNaNRCK011000.0140.061002000301002010049.011.01.01.00.08.00.00.00.00.0-1.01.00.0-1.0-1.0-1.010.01.00.00.02.00.027.012.08.05.00.00.036.07.02.01.00.02.00.00.00.00.0-1.03.05.0-1.0-1.0-1.05.01.03.03.01.00.027.012.013.03.00.00.0mawnj901J.H. MannyNaN(none)NaN(none)NaN(none)NaN(none)NaN(none)paboc101Charlie Paborhasts101Scott Hastingsprata101Al Prattfishc102Cherokee FisherNaN(none)NaN(none)prata101Al Prattfishc102Cherokee Fisherwhitd102Deacon White2.0kimbg101Gene Kimball4.0paboc101Charlie Pabor7.0allia101Art Allison8.0white104Elmer White9.0prata101Al Pratt1.0sutte101Ezra Sutton5.0carlj102Jim Carleton3.0bassj101John Bass6.0mackd101Denny Mack3.0addyb101Bob Addy4.0fishc102Cherokee Fisher1.0hasts101Scott Hastings8.0ham-r101Ralph Ham5.0ansoc101Cap Anson2.0sagep101Pony Sager6.0birdg101George Bird7.0stirg101Gat Stires9.0NaNY
31871-05-080MonCL1na3CH1na1121454.0DNaNNaNNaNCHI015000.0150.010140311107700000046.015.02.01.02.010.00.00.00.00.0-1.01.00.0-1.0-1.0-1.07.01.06.06.00.00.027.015.011.06.00.00.043.011.02.00.00.08.00.00.00.04.0-1.02.01.0-1.0-1.0-1.06.01.04.04.00.00.027.014.07.02.00.00.0willg901Gardner WillardNaN(none)NaN(none)NaN(none)NaN(none)NaN(none)paboc101Charlie Paborwoodj106Jimmy Woodzettg101George Zettleinprata101Al PrattNaN(none)NaN(none)prata101Al Prattzettg101George Zettleinwhitd102Deacon White2.0kimbg101Gene Kimball4.0paboc101Charlie Pabor7.0allia101Art Allison8.0white104Elmer White9.0prata101Al Pratt1.0sutte101Ezra Sutton5.0carlj102Jim Carleton3.0bassj101John Bass6.0mcatb101Bub McAtee3.0kingm101Marshall King8.0hodec101Charlie Hodes2.0woodj106Jimmy Wood4.0simmj101Joe Simmons9.0folet101Tom Foley7.0duffe101Ed Duffy6.0pinke101Ed Pinkham5.0zettg101George Zettlein1.0NaNY
41871-05-090TueBS1na2TROna19554.0DNaNNaNNaNTRO013250.0145.000000223210100300046.017.04.01.00.06.00.00.00.02.0-1.00.01.0-1.0-1.0-1.012.01.02.02.00.00.027.012.05.00.01.00.036.09.00.00.00.02.00.00.00.03.0-1.00.02.0-1.0-1.0-1.07.01.03.03.01.00.027.011.07.03.00.00.0leroi901Isaac LeroyNaN(none)NaN(none)NaN(none)NaN(none)NaN(none)wrigh101Harry Wrightpikel101Lip Pikespala101Al Spaldingmcmuj101John McMullinNaN(none)NaN(none)spala101Al Spaldingmcmuj101John McMullinwrigg101George Wright6.0barnr102Ross Barnes4.0birdd102Dave Birdsall9.0mcvec101Cal McVey2.0wrigh101Harry Wright8.0goulc101Charlie Gould3.0schah101Harry Schafer5.0conef101Fred Cone7.0spala101Al Spalding1.0flync101Clipper Flynn9.0mcgem101Mike McGeary2.0yorkt101Tom York8.0mcmuj101John McMullin1.0kings101Steve King7.0beave101Edward Beavens4.0bells101Steve Bellan5.0pikel101Lip Pike3.0cravb101Bill Craver6.0HTBFY

By optimizing the columns, we’ve managed to reduce the memory usage in pandas from 861.6 MB to 104.28 MB – an impressive 88% reduction!

Analyzing baseball games

Now that we’ve optimized our data, we can perform some analysis. Let’s start by looking at the distribution of game days.


optimized_gl['year'] = optimized_gl.date.dt.year
games_per_day = optimized_gl.pivot_table(index='year',columns='day_of_week',values='date',aggfunc=len)
games_per_day = games_per_day.divide(games_per_day.sum(axis=1),axis=0)
ax = games_per_day.plot(kind='area',stacked='true')
ax.legend(loc='upper right')
ax.set_ylim(0,1)
plt.show()

We can see that before the 1920s, Sunday baseball games were rare on Sundays before coming gradually more popular through the latter half of last century.

We can also see pretty clearly that the distribution of game days has been relatively static for the last 50 years.

Let’s also look at how game length has varied over the years.


game_lengths = optimized_gl.pivot_table(index='year', values='length_minutes')
game_lengths.reset_index().plot.scatter('year','length_minutes')
plt.show()

Looks like baseball games have continued to get longer from the 1940s onwards.

Summary and Next Steps

We’ve learned how pandas stores data using different types, and then we used that knowledge to reduce the memory usage of our pandas dataframe by almost 90%, just by using a few simple techniques:

  • Downcasting numeric columns to more efficient types.
  • Converting string columns to the categorical type.

If you’d like to dive into working with larger data in pandas some more, a lot of the content in this blog post is available in our interactive course Processing Large Datasets In Pandas Course, which you can start for free.


Tags

big data, intermediate, large data set, Learn Python, Pandas, python, tutorial, Tutorials


You may also like

Get started with Dataquest today - for free!

__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"493ef":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"493ef":{"val":"rgb(44, 168, 116)","hsl":{"h":154,"s":0.58,"l":0.42}}},"gradients":[]},"original":{"colors":{"493ef":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
Sign up now

Or, visit our pricing page to learn about our Basic and Premium plans.

(function(d) { d.addEventListener("DOMContentLoaded", function() { var pathname = d.location.pathname.replace(/^[/]|[/]$/g, "").replace("/", "-"); var tags = d.getElementsByTagName("iframe"); var type = pathname.startsWith("course") ? "?course=" : pathname.startsWith("path") ? "?path=" : null; if (type) { var i; for (i = 0; i < tags.length; i++) { if (tags[i].src.indexOf("signup#iframe") !== -1) { tags[i].src = tags[i].src.replace("#iframe", "") + type + pathname + "#iframe"; } } } }, false); })(document);