The Dataquest Download

Level up your data and AI skills, one newsletter at a time.

Each week, the Dataquest Download brings the latest behind-the-scenes developments at Dataquest directly to your inbox. Discover our top tutorial of the week to boost your data skills, get the scoop on any course changes, and pick up a useful tip to apply in your projects. We also spotlight standout projects from our students and share their personal learning journeys.

Hello, Dataquesters!

Here’s what we have in store for you in this edition:

  • Concept of the Week: Speed up data filtering in pandas with Boolean indexing—no loops required! Learn more

  • From the Community: Explore hyperparameter tuning, SQL-powered profit optimization, and a data-driven approach to winning Jeopardy. Join the discussion

  • New Resource: Profitable App Profiles—a hands-on project in Python to analyze mobile app markets. Learn now

Have you ever found yourself writing a loop to filter rows in a pandas DataFrame, only to realize that it’s taking way too long? Or maybe you’ve struggled with getting the right rows when filtering data? If so, you’re not alone. Many beginners (and even experienced analysts) overlook one of pandas’ most powerful features: Boolean indexing.

Boolean indexing lets you filter data using conditions, eliminating the need for inefficient loops. Once you get the hang of it, you’ll be able to select the exact data you need in just one line of code. Let’s walk through the best ways to filter data in pandas, the right way!

The Dataset: Fortune 500 Companies

To practice Boolean indexing, we’ll use the Fortune 500 dataset, which contains financial data on the top 500 companies in the world. This dataset includes columns such as:

  • company: Name of the company
  • revenues: Total revenue (in millions)
  • profits: Net income (in millions)
  • country: Country where the company is headquartered

You can download the dataset and follow along locally using this link: Fortune 500 dataset.

Now, let’s learn some Boolean indexing techniques!

Boolean Filtering Basics

Boolean indexing works by applying a condition to a DataFrame or Series, which results in a Boolean mask—a series of True and False values. When it’s used as a filter, only the rows where the condition evaluated to True are returned.

Example: Selecting Profitable Companies

Say we have a DataFrame f500 that contains financial data for Fortune 500 companies, including revenue and profit columns. If we want to select only the companies that reported a profit, we can load the data and use:

import pandas as pd

f500 = pd.read_csv("f500.csv", index_col=0)
f500.index.name = None

bool_profitable = f500["profits"] > 0
profitable = f500[bool_profitable]

Here’s what’s happening:

  • f500["profits"] > 0 creates a Boolean mask (bool_profitable) where each row is marked as True if the profits are positive.
  • f500[bool_profitable] filters the DataFrame, returning only the rows where the mask is True.

And just like that—no loops needed!

Combining Conditions with Boolean Operators

You’re not limited to filtering by a single condition. You can combine multiple conditions using & (AND), | (OR), and ~ (NOT) operators. Just be sure to wrap each condition in parentheses to avoid syntax errors.

Example: Companies with High Revenue and Negative Profit

big_rev_neg_profit = f500[(f500["revenues"] > 100000) & (f500["profits"] < 0)]

This selects companies that:

  • Have revenues greater than $100B (100000 in millions)
  • Reported negative profits

Want to find companies with high revenue OR negative profits? Just swap & for |:

big_or_neg_profit = f500[(f500["revenues"] > 100000) | (f500["profits"] < 0)]

Or if you want to exclude companies in the USA:

non_us_companies = f500[~(f500["country"] == "USA")]

The ~ operator negates the condition, selecting everything except companies in the USA.

Filtering Missing Values with .notnull()

Since it is such a common task, let’s talk about handling missing values in pandas. Often, datasets contain NaN (Not a Number) values, which represent missing or undefined data. The .notnull() method is a quick way to filter out these missing values.

Example: Selecting Companies with a Previous Rank

previously_ranked = f500[f500["previous_rank"].notnull()]
print(previously_ranked.shape)  # Output: (467,)

This selects only the rows where the previous_rank column contains valid (non-null) values. Using .notnull() ensures that we don’t perform calculations on missing data, preventing unexpected errors.

Understanding pandas Index Alignment

One of the lesser-known but powerful features of pandas is index alignment. When performing operations on DataFrames or Series, pandas automatically aligns values based on their index labels. This makes merging, updating, and performing calculations across different datasets much easier.

How Index Alignment Works

Let’s say we want to calculate how much each company’s ranking has changed compared to the previous year. We can compute the rank_change column by subtracting the current rank from the previous_rank:

previously_ranked = f500[f500["previous_rank"].notnull()]
rank_change = previously_ranked["previous_rank"] - previously_ranked["rank"]
f500["rank_change"] = rank_change

Even though the previously_ranked DataFrame contains fewer rows than f500, pandas automatically aligns the index labels. This ensures that only the rows that exist in previously_ranked are updated in f500, while all other rows remain unchanged. For these unaffected rows, the rank_change column will contain NaN values, since no corresponding previous rank exists to compute a change.

Advantages of Index Alignment

  • Prevents incorrect data merges – Ensures values correspond to the correct rows.
  • No need for manual sorting – pandas automatically aligns values based on index labels.
  • Works even when indexes are non-sequential – Making transformations and calculations easier.

This built-in behavior allows seamless operations, reducing errors and saving time when working with large datasets.

Filtering Multiple Values with .isin()

What if you want to filter by a list of values? Instead of chaining multiple OR conditions, we can use .isin().

Example: Selecting Companies from Brazil or Venezuela

latin_american_companies = f500[f500["country"].isin(["Brazil", "Venezuela"])]

This is much cleaner than writing multiple | conditions!

Filtering Numeric Ranges with .between()

If you’re filtering numeric values within a range, .between() is a more readable alternative to >= and <= comparisons.

Example: Finding Mid-Sized Companies by Revenue

mid_size_companies = f500[f500["revenues"].between(50000, 100000)]

This selects companies with revenue between $50B and $100B (inclusive).

Why Boolean Indexing Matters

  • Faster than loops – Filtering happens in a single, optimized operation.
  • More readable – Cleaner code that’s easier to understand.
  • More scalable – Works efficiently even on large datasets.

If you’re still writing loops to filter DataFrames, it’s time to level up with Boolean indexing!

Want to sharpen your pandas skills even more? Learn all about data selection and exploration in our lesson on Exploring Data with Pandas: Intermediate or enroll in the Junior Data Analyst path to build on your skills.

Happy coding, and keep experimenting!

From the Community

Fine-Tuning for Success: Neha breaks down hyperparameter tuning and why it’s crucial for better model performance. What’s your go-to tuning method?

Profit Optimization with SQL: Linky’s project skillfully combines SQL and Python to analyze customers and products—packed with clear insights and a structured narrative. A great SQL learning resource!

Winning Jeopardy with Data: Dimitar explores 200,000 Jeopardy questions in this function-based project, featuring clean code, well-structured insights, and engaging storytelling.

DQ Resources

Profitable App Profiles—A Data Analysis Project:  Learn to analyze mobile app market data and recommend profitable development strategies in this guided project. Gain hands-on experience with Python, data cleaning, and exploratory analysis. Learn now

Breaking Into a Data Career—Lessons From a CEO: In this video, Kishawna Peck, CEO of Womxn in Data Science, shares insights on navigating the data industry, building a strong portfolio, and breaking into a data career. Learn more

Build Your First Data Project: Learn how to start your first data project with this beginner-friendly guide, featuring step-by-step walkthroughs for Python and SQL projects. Read more

What We're Reading

Make Your Charts Look Glorious: Improve your Matplotlib charts with simple tweaks like better titles, annotations, and layout adjustments to create clearer, more polished visuals.

The Lady Tasting Tea & Fisher’s Exact Test: Discover how a simple tea-tasting experiment led to Fisher’s exact test, illustrating how small datasets can provide powerful statistical insights.

The Future of AI, LLMs, and Observability on Google Cloud: Google’s AI Director and Datadog’s VP of Engineering share key insights on AI, ML, and LLMs, covering upskilling, observability, and future trends.

Give 20%, Get $20: Time to Refer a Friend!

Give 20% Get $20

Now is the perfect time to share Dataquest with a friend. Gift a 20% discount, and for every friend who subscribes, earn a $20 bonus. Use your bonuses for digital gift cards, prepaid cards, or donate to charity. Your choice! Click here

High-fives from Vik, Celeste, Anna P, Anna S, Anishta, Bruno, Elena, Mike, Daniel, and Brayan.

2025-03-19

Python Project, Data Storytelling & DQ Resources

Investigate Fandango movie ratings with a hands-on Python project and learn what makes a data story. Plus, gain access to essential DQ resources. Read More
2025-03-12

PyTorch vs. TensorFlow

More AI developers are switching to PyTorch for its flexibility and ease of use. Learn why it’s gaining traction and build your first deep learning model to predict salaries. Read More
2025-03-05

Messy Column Names? Here’s How to Fix Them

Are messy column names slowing you down? Learn how to clean them up with pandas—remove spaces, standardize formatting, and make your dataset easy to work with. Read More

Learn faster and retain more.
Dataquest is the best way to learn