March Madness Challenge – Compete, Learn, and Win – Register Now
The Dataquest Download
Level up your data and AI skills, one newsletter at a time.
Hello, Dataquesters!
Here’s what we have in store for you in this edition:
-
Concept of the Week: Speed up data filtering in pandas with Boolean indexing—no loops required! Learn more
-
From the Community: Explore hyperparameter tuning, SQL-powered profit optimization, and a data-driven approach to winning Jeopardy. Join the discussion
-
New Resource: Profitable App Profiles—a hands-on project in Python to analyze mobile app markets. Learn now
Have you ever found yourself writing a loop to filter rows in a pandas DataFrame, only to realize that it’s taking way too long? Or maybe you’ve struggled with getting the right rows when filtering data? If so, you’re not alone. Many beginners (and even experienced analysts) overlook one of pandas’ most powerful features: Boolean indexing.
Boolean indexing lets you filter data using conditions, eliminating the need for inefficient loops. Once you get the hang of it, you’ll be able to select the exact data you need in just one line of code. Let’s walk through the best ways to filter data in pandas, the right way!
The Dataset: Fortune 500 Companies
To practice Boolean indexing, we’ll use the Fortune 500 dataset, which contains financial data on the top 500 companies in the world. This dataset includes columns such as:
company
: Name of the companyrevenues
: Total revenue (in millions)profits
: Net income (in millions)country
: Country where the company is headquartered
You can download the dataset and follow along locally using this link: Fortune 500 dataset.
Now, let’s learn some Boolean indexing techniques!
Boolean Filtering Basics
Boolean indexing works by applying a condition to a DataFrame or Series, which results in a Boolean mask—a series of True
and False
values. When it’s used as a filter, only the rows where the condition evaluated to True
are returned.
Example: Selecting Profitable Companies
Say we have a DataFrame f500
that contains financial data for Fortune 500 companies, including revenue and profit columns. If we want to select only the companies that reported a profit, we can load the data and use:
import pandas as pd
f500 = pd.read_csv("f500.csv", index_col=0)
f500.index.name = None
bool_profitable = f500["profits"] > 0
profitable = f500[bool_profitable]
Here’s what’s happening:
f500["profits"] > 0
creates a Boolean mask (bool_profitable
) where each row is marked asTrue
if the profits are positive.f500[bool_profitable]
filters the DataFrame, returning only the rows where the mask isTrue
.
And just like that—no loops needed!
Combining Conditions with Boolean Operators
You’re not limited to filtering by a single condition. You can combine multiple conditions using &
(AND), |
(OR), and ~
(NOT) operators. Just be sure to wrap each condition in parentheses to avoid syntax errors.
Example: Companies with High Revenue and Negative Profit
big_rev_neg_profit = f500[(f500["revenues"] > 100000) & (f500["profits"] < 0)]
This selects companies that:
- Have revenues greater than $100B (
100000
in millions) - Reported negative profits
Want to find companies with high revenue OR negative profits? Just swap &
for |
:
big_or_neg_profit = f500[(f500["revenues"] > 100000) | (f500["profits"] < 0)]
Or if you want to exclude companies in the USA:
non_us_companies = f500[~(f500["country"] == "USA")]
The ~
operator negates the condition, selecting everything except companies in the USA.
Filtering Missing Values with .notnull()
Since it is such a common task, let’s talk about handling missing values in pandas. Often, datasets contain NaN
(Not a Number) values, which represent missing or undefined data. The .notnull()
method is a quick way to filter out these missing values.
Example: Selecting Companies with a Previous Rank
previously_ranked = f500[f500["previous_rank"].notnull()]
print(previously_ranked.shape) # Output: (467,)
This selects only the rows where the previous_rank
column contains valid (non-null) values. Using .notnull()
ensures that we don’t perform calculations on missing data, preventing unexpected errors.
Understanding pandas Index Alignment
One of the lesser-known but powerful features of pandas is index alignment. When performing operations on DataFrames or Series, pandas automatically aligns values based on their index labels. This makes merging, updating, and performing calculations across different datasets much easier.
How Index Alignment Works
Let’s say we want to calculate how much each company’s ranking has changed compared to the previous year. We can compute the rank_change
column by subtracting the current rank
from the previous_rank
:
previously_ranked = f500[f500["previous_rank"].notnull()]
rank_change = previously_ranked["previous_rank"] - previously_ranked["rank"]
f500["rank_change"] = rank_change
Even though the previously_ranked
DataFrame contains fewer rows than f500
, pandas automatically aligns the index labels. This ensures that only the rows that exist in previously_ranked
are updated in f500
, while all other rows remain unchanged. For these unaffected rows, the rank_change
column will contain NaN
values, since no corresponding previous rank exists to compute a change.
Advantages of Index Alignment
- Prevents incorrect data merges – Ensures values correspond to the correct rows.
- No need for manual sorting – pandas automatically aligns values based on index labels.
- Works even when indexes are non-sequential – Making transformations and calculations easier.
This built-in behavior allows seamless operations, reducing errors and saving time when working with large datasets.
Filtering Multiple Values with .isin()
What if you want to filter by a list of values? Instead of chaining multiple OR conditions, we can use .isin()
.
Example: Selecting Companies from Brazil or Venezuela
latin_american_companies = f500[f500["country"].isin(["Brazil", "Venezuela"])]
This is much cleaner than writing multiple |
conditions!
Filtering Numeric Ranges with .between()
If you’re filtering numeric values within a range, .between()
is a more readable alternative to >=
and <=
comparisons.
Example: Finding Mid-Sized Companies by Revenue
mid_size_companies = f500[f500["revenues"].between(50000, 100000)]
This selects companies with revenue between $50B and $100B (inclusive).
Why Boolean Indexing Matters
- Faster than loops – Filtering happens in a single, optimized operation.
- More readable – Cleaner code that’s easier to understand.
- More scalable – Works efficiently even on large datasets.
If you’re still writing loops to filter DataFrames, it’s time to level up with Boolean indexing!
Want to sharpen your pandas skills even more? Learn all about data selection and exploration in our lesson on Exploring Data with Pandas: Intermediate or enroll in the Junior Data Analyst path to build on your skills.
Happy coding, and keep experimenting!
From the Community
Fine-Tuning for Success: Neha breaks down hyperparameter tuning and why it’s crucial for better model performance. What’s your go-to tuning method?
Profit Optimization with SQL: Linky’s project skillfully combines SQL and Python to analyze customers and products—packed with clear insights and a structured narrative. A great SQL learning resource!
Winning Jeopardy with Data: Dimitar explores 200,000 Jeopardy questions in this function-based project, featuring clean code, well-structured insights, and engaging storytelling.
DQ Resources
Profitable App Profiles—A Data Analysis Project: Learn to analyze mobile app market data and recommend profitable development strategies in this guided project. Gain hands-on experience with Python, data cleaning, and exploratory analysis. Learn now
Breaking Into a Data Career—Lessons From a CEO: In this video, Kishawna Peck, CEO of Womxn in Data Science, shares insights on navigating the data industry, building a strong portfolio, and breaking into a data career. Learn more
Build Your First Data Project: Learn how to start your first data project with this beginner-friendly guide, featuring step-by-step walkthroughs for Python and SQL projects. Read more
What We're Reading
Make Your Charts Look Glorious: Improve your Matplotlib charts with simple tweaks like better titles, annotations, and layout adjustments to create clearer, more polished visuals.
The Lady Tasting Tea & Fisher’s Exact Test: Discover how a simple tea-tasting experiment led to Fisher’s exact test, illustrating how small datasets can provide powerful statistical insights.
The Future of AI, LLMs, and Observability on Google Cloud: Google’s AI Director and Datadog’s VP of Engineering share key insights on AI, ML, and LLMs, covering upskilling, observability, and future trends.
Give 20%, Get $20: Time to Refer a Friend!
Give 20% Get $20
Now is the perfect time to share Dataquest with a friend. Gift a 20% discount, and for every friend who subscribes, earn a $20 bonus. Use your bonuses for digital gift cards, prepaid cards, or donate to charity. Your choice! Click here
High-fives from Vik, Celeste, Anna P, Anna S, Anishta, Bruno, Elena, Mike, Daniel, and Brayan.