MISSION 163

Optimizing DataFrame Memory Footprint

In previous courses in the Data Scientist track, we used pandas to explore and analyze data sets without much consideration for performance. While performance is rarely a problem with small data sets (under 100 megabytes), it can start to become an issue with larger data sets (100 gigabytes to multiple terabytes). Performance issues can make run times much longer, or cause code to fail entirely due to insufficient memory.

While tools like Spark can handle large data sets (100 gigabytes to multiple terabytes), taking full advantage of their capabilities usually requires more expensive hardware. And unlike pandas, these tools generally lack rich feature sets for high quality data cleaning, exploration, and analysis. For medium-sized data sets, we're better off trying to get more performance out of pandas, rather than switching to a different tool.

In this course, we'll explore different techniques for processing large datasets in pandas. In this lesson, we'll learn how pandas represents the values in a data set in memory, and explore optimizing memory of a dataframe and reducing its footprint by selecting the appropriate data types for columns.

As you work through optimizing memory of a dataframe, you’ll get to apply what you’ve learned from within your browser so that there's no need to use your own machine to do the exercises (although of course you can download the data and use your own machine if you’d prefer!). The Python environment inside of this course includes answer checking so you can ensure that you've fully mastered each concept before learning the next concept.

Objectives

  • Learn how to use pandas for large data sets.
  • Learn how much memory pandas' datasets use.
  • Learn how to optimize pandas data types.

Mission Outline

1. Introduction
2. How Pandas Represents Values in a Dataframe
3. Different Types Have Different Memory Footprints
4. Calculating the True Memory Footprint
5. Optimizing Integer Columns With Subtypes
6. Optimizing Float Columns With Subtypes
7. Converting To DateTime
8. Converting to Categorical to Save Memory
9. Selecting Types While Reading the Data In
10. Next Steps
11. Takeaways

pandas-large-datasets

Course Info:

Intermediate

The median completion time for this course is 5.3 hours. View Details

This course requires a premium subscription and includes three missions, and two guided projects.  It is the third course in the Data Engineer path.

START LEARNING FREE

Take a Look Inside