February 24, 2022

Tutorial: Filtering Pandas DataFrames

The Pandas library is a fast, powerful, and easy-to-use tool for working with data. It helps us cleanse, explore, analyze, and visualize data by providing game-changing capabilities. Having data as a Pandas DataFrame allows us to slice and dice data in various ways and filter the DataFrame's rows effortlessly. This tutorial will go over the primary methods for getting data out of a DataFrame using different filtering techniques.

What is a DataFrame?

Before discussing different filtering DataFrames techniques, let’s review a DataFrame. Technically a DataFrame builds upon the Pandas Series object that a Series is a one-dimensional array of indexed data. Intuitively a DataFrame is similar to a spreadsheet in many ways; it can have multiple columns with varying types of data and rows labeled with row indices.

Creating a DataFrame

First let’s create a dummy DataFrame containing the personal details of a company’s employees using the following snippet:

import pandas as pd
import numpy as np
from IPython.display import display

data = pd.DataFrame({'EmployeeName': ['Callen Dunkley', 'Sarah Rayner', 'Jeanette Sloan','Kaycee Acosta', 'Henri Conroy', 'Emma Peralta','Martin Butt', 'Alex Jensen', 'Kim Howarth', 'Jane Burnett'],
                    'Department': ['Accounting', 'Engineering', 'Engineering', 'HR', 'HR','HR','Data Science','Data Science', 'Accounting', 'Data Science'],
                    'HireDate': [2010, 2018, 2012, 2014, 2014, 2018, 2020, 2018, 2020, 2012],
                    'Gender': ['M','F','F','F','M','F','M','M','M','F'],
                    'DoB':['04/09/1982', '14/04/1981','06/05/1997','08/01/1986','10/10/1988','12/11/1992', '10/04/1991','16/07/1995','08/10/1992','11/10/1979'],
                    'Weight':[78,80,66,67,90,57,115,87,95,57],
                    'Height':[176,160,169,157,185,164,195,180,174,165]
                  })
display(data)

	EmployeeName	Department	HireDate	Gender	DoB	Weight	Height
0	Callen Dunkley	Accounting	2010	M	04/09/1982	78	176
1	Sarah Rayner	Engineering	2018	F	14/04/1981	80	160
2	Jeanette Sloan	Engineering	2012	F	06/05/1997	66	169
3	Kaycee Acosta	HR	2014	F	08/01/1986	67	157
4	Henri Conroy	HR	2014	M	10/10/1988	90	185
5	Emma Peralta	HR	2018	F	12/11/1992	57	164
6	Martin Butt	Data Science	2020	M	10/04/1991	115	195
7	Alex Jensen	Data Science	2018	M	16/07/1995	87	180
8	Kim Howarth	Accounting	2020	M	08/10/1992	95	174
9	Jane Burnett	Data Science	2012	F	11/10/1979	57	165

Run the code above. It creates a DataFrame with the above data.
Let’s take a look at the columns’ data types by running the following statement:

print(data.dtypes)

EmployeeName    object
Department      object
HireDate         int64
Gender          object
DoB             object
Weight           int64
Height           int64
dtype: object

The Pandas data type for string values is object that is why the data type of columns like EmployeeName or Department containing strings is object. The DataFrame's columns each have a specific data type; in other words, the DataFrame's columns don't share the same data type.
The other useful DataFrame property is .values, which returns an array of lists, each list representing a DataFrame’s row.

print(data.values)

[['Callen Dunkley' 'Accounting' 2010 'M' '04/09/1982' 78 176]
 ['Sarah Rayner' 'Engineering' 2018 'F' '14/04/1981' 80 160]
 ['Jeanette Sloan' 'Engineering' 2012 'F' '06/05/1997' 66 169]
 ['Kaycee Acosta' 'HR' 2014 'F' '08/01/1986' 67 157]
 ['Henri Conroy' 'HR' 2014 'M' '10/10/1988' 90 185]
 ['Emma Peralta' 'HR' 2018 'F' '12/11/1992' 57 164]
 ['Martin Butt' 'Data Science' 2020 'M' '10/04/1991' 115 195]
 ['Alex Jensen' 'Data Science' 2018 'M' '16/07/1995' 87 180]
 ['Kim Howarth' 'Accounting' 2020 'M' '08/10/1992' 95 174]
 ['Jane Burnett' 'Data Science' 2012 'F' '11/10/1979' 57 165]]

If you’re interested to know the number of rows and columns of the DataFrame, you can use the .shape property as follows:

print(data.shape)

(10, 7)

The tuple above represents the number of rows and columns, respectively.

Slicing and Dicing a DataFrame from Every Possible Angle

So far, we’ve created and inspected a DataFrame. As mentioned earlier, there are different ways for filtering a DataFrame. In this section, we will discuss the most prevalent and efficient methods. However, it would be better to look at how we can select specific columns, first.
To grab the entire of specific columns, we can use the following syntax:

DataFrame.ColumnName  # if the column name does not contain any spaces
OR 
DataFrame["ColumnName"]

For example, let’s grab the Department column of the DataFrame.

display(data.Department)

0      Accounting
1     Engineering
2     Engineering
3              HR
4              HR
5              HR
6    Data Science
7    Data Science
8      Accounting
9    Data Science
Name: Department, dtype: object

Running the statement above returns the Series containing the Department column’s data.

Also, we can select multiple columns by passing a list of column names, as follows:

display(data[['EmployeeName','Department','Gender']])

	EmployeeName	Department	Gender
0	Callen Dunkley	Accounting	M
1	Sarah Rayner	Engineering	F
2	Jeanette Sloan	Engineering	F
3	Kaycee Acosta	HR	F
4	Henri Conroy	HR	M
5	Emma Peralta	HR	F
6	Martin Butt	Data Science	M
7	Alex Jensen	Data Science	M
8	Kim Howarth	Accounting	M
9	Jane Burnett	Data Science	F

Running the statement above gives us a DataFrame that is a subset of the original DataFrame.

The `.loc` and `.iloc` Methods

We can use slicing techniques to extract specific rows from a DataFrame. The best way to slice a DataFrame and select particular columns is to use the .loc and .iloc methods. The twin methods create a subset of a DataFrame using label-based or integer-based indexing, respectively.

DataFrame.loc[row_indexer,column_indexer]
DataFrame.iloc[row_indexer,column_indexer]

The first position in both methods specifies the row indexer, and the second specifies the column indexer separated with a comma.
Before we want to practice working with the .loc and .iloc methods, let’s set the EmployeeName column as the index of the DataFrame. Doing this is not required but helps us discuss all the capabilities of these methods. To do so, use the set_index() method to make the EmployeeName column the index of the DataFrame:

data = data.set_index('EmployeeName')
display(data)

	Department	HireDate	Gender	DoB	Weight	Height
EmployeeName
Callen Dunkley	Accounting	2010	M	04/09/1982	78	176
Sarah Rayner	Engineering	2018	F	14/04/1981	80	160
Jeanette Sloan	Engineering	2012	F	06/05/1997	66	169
Kaycee Acosta	HR	2014	F	08/01/1986	67	157
Henri Conroy	HR	2014	M	10/10/1988	90	185
Emma Peralta	HR	2018	F	12/11/1992	57	164
Martin Butt	Data Science	2020	M	10/04/1991	115	195
Alex Jensen	Data Science	2018	M	16/07/1995	87	180
Kim Howarth	Accounting	2020	M	08/10/1992	95	174
Jane Burnett	Data Science	2012	F	11/10/1979	57	165

Now, instead of having integer-based indexes, we have label-based indexes, which makes it more apt to try all the capabilities of .loc and .iloc methods for slicing and dicing the DataFrame.

NOTE

The .loc method is label-based, so you have to specify rows and columns based on their labels. On the other hand, the iloc method is integer index-based, so you have to select rows and columns by their integer index.

The following statements return the identical results practically:

display(data.loc[['Henri Conroy', 'Kim Howarth']])
display(data.iloc[[4,8]])

	Department	HireDate	Gender	DoB	Weight	Height
EmployeeName
Henri Conroy	HR	2014	M	10/10/1988	90	185
Kim Howarth	Accounting	2020	M	08/10/1992	95	174

	Department	HireDate	Gender	DoB	Weight	Height
EmployeeName
Henri Conroy	HR	2014	M	10/10/1988	90	185
Kim Howarth	Accounting	2020	M	08/10/1992	95	174

In the above code, the .loc method gets a list of the employee names as row labels and returns a DataFrame containing the rows associated with the two employees.
The .iloc method does the same thing, but instead of getting the labeled indexes, it gets the integer indexes of the rows.
Now, let’s extract a sub-DataFrame that contains the name, department, and gender values of the first three employees using the .loc and .iloc methods:

display(data.iloc[:3,[0,2]])
display(data.loc['Callen Dunkley':'Jeanette Sloan',['Department','Gender']])

	Department	Gender
EmployeeName
Callen Dunkley	Accounting	M
Sarah Rayner	Engineering	F
Jeanette Sloan	Engineering	F

	Department	Gender
EmployeeName
Callen Dunkley	Accounting	M
Sarah Rayner	Engineering	F
Jeanette Sloan	Engineering	F

Running the statements above returns the identical results as follows:

EmployeeName     Department Gender                      
Callen Dunkley   Accounting      M
Sarah Rayner    Engineering      F
Jeanette Sloan  Engineering      F

The .loc and .iloc methods are pretty similar; the only difference is how referring to columns and rows in a DataFrame.
The following statements return all the rows of the DataFrame:

display(data.iloc[:,[0,2]])
display(data.loc[:,['Department','Gender']])

	Department	Gender
EmployeeName
Callen Dunkley	Accounting	M
Sarah Rayner	Engineering	F
Jeanette Sloan	Engineering	F
Kaycee Acosta	HR	F
Henri Conroy	HR	M
Emma Peralta	HR	F
Martin Butt	Data Science	M
Alex Jensen	Data Science	M
Kim Howarth	Accounting	M
Jane Burnett	Data Science	F

	Department	Gender
EmployeeName
Callen Dunkley	Accounting	M
Sarah Rayner	Engineering	F
Jeanette Sloan	Engineering	F
Kaycee Acosta	HR	F
Henri Conroy	HR	M
Emma Peralta	HR	F
Martin Butt	Data Science	M
Alex Jensen	Data Science	M
Kim Howarth	Accounting	M
Jane Burnett	Data Science	F

We’re able to do slicing for columns too. The statement below returns the Weight and Height columns of every second row of the DataFrame:

display(data.loc[::2,'Weight':'Height'])
display(data.iloc[::2,4:6])

	Weight	Height
EmployeeName
Callen Dunkley	78	176
Jeanette Sloan	66	169
Henri Conroy	90	185
Martin Butt	115	195
Kim Howarth	95	174

	Weight	Height
EmployeeName
Callen Dunkley	78	176
Jeanette Sloan	66	169
Henri Conroy	90	185
Martin Butt	115	195
Kim Howarth	95	174

We perform slicing on the DataFrame’s rows and columns in the code above. The syntax is similar to slicing a list in Python. We need to specify the start index, the stop index, and the step size.

EmployeeName    Weight  Height                  
Callen Dunkley      78     176
Jeanette Sloan      66     169
Henri Conroy        90     185
Martin Butt        115     195
Kim Howarth         95     174

Notice that in slicing with labels using the .loc method, both the start and the stop labels are included. However, in slicing using .iloc method the stop index is excluded.

Filtering a DataFrame in Pandas

So far, we’ve learned how to slice and dice a DataFrame, but how can we take data that meets specific criteria?
Using Boolean masks is the common method for filtering a DataFrame in Pandas. First let’s see what is a Boolean mask:

print(data.HireDate > 2015)

EmployeeName
Callen Dunkley    False
Sarah Rayner       True
Jeanette Sloan    False
Kaycee Acosta     False
Henri Conroy      False
Emma Peralta       True
Martin Butt        True
Alex Jensen        True
Kim Howarth        True
Jane Burnett      False
Name: HireDate, dtype: bool

By running the code above, we can see which entries in the HireDate column are greater than 2015. The Boolean values indicate whether the condition is met or not.
Now, we can use the boolean mask to get the subset of the DataFrame containing the
employees who are hired after 2015:

display(data[data.HireDate > 2015])

	Department	HireDate	Gender	DoB	Weight	Height
EmployeeName
Sarah Rayner	Engineering	2018	F	14/04/1981	80	160
Emma Peralta	HR	2018	F	12/11/1992	57	164
Martin Butt	Data Science	2020	M	10/04/1991	115	195
Alex Jensen	Data Science	2018	M	16/07/1995	87	180
Kim Howarth	Accounting	2020	M	08/10/1992	95	174

We also can use the .loc method to filter the rows along with returning certain columns, as follows:

display(data.loc[data.HireDate > 2015, ['Department', 'HireDate']])

	Department	HireDate
EmployeeName
Sarah Rayner	Engineering	2018
Emma Peralta	HR	2018
Martin Butt	Data Science	2020
Alex Jensen	Data Science	2018
Kim Howarth	Accounting	2020

We're able to combine two or more criteria using logical operators. Let's list those male employees working in Data Science team:

display(data.loc[(data.Gender == "M") & (data.Department=='Data Science'), 
                ['Department', 'Gender', 'HireDate']]
        )

	Department	Gender	HireDate
EmployeeName
Martin Butt	Data Science	M	2020
Alex Jensen	Data Science	M	2018

What if we want to get all employees with heights between 170 cm and 190 cm?

display(data.loc[data.Height.between(170,190), 
                ['Department', 'Gender', 'Height']]
        )

	Department	Gender	Height
EmployeeName
Callen Dunkley	Accounting	M	176
Henri Conroy	HR	M	185
Alex Jensen	Data Science	M	180
Kim Howarth	Accounting	M	174

In the between() method both ends of the range are inclusive.
The other helpful method is isin(), which creates a Boolean mask for values contained in a list of specified values. Let’s try the method to retrieve the employees who are working in either Human Resources or Accounting departments:

display(data.loc[data.Department.isin(['HR', 'Accounting']),['Department', 'Gender', 'HireDate']])

	Department	Gender	HireDate
EmployeeName
Callen Dunkley	Accounting	M	2010
Kaycee Acosta	HR	F	2014
Henri Conroy	HR	M	2014
Emma Peralta	HR	F	2018
Kim Howarth	Accounting	M	2020

To filter a DataFrame, we can use regular expressions to search for a pattern rather than the exact content. The following example shows how we can use a regular expression to get those employees whose last names end with either ‘tt’ or ‘th’:

display(data.loc[data.index.str.contains(r'tt$|th$'),
                ['Department', 'Gender']])

	Department	Gender
EmployeeName
Martin Butt	Data Science	M
Kim Howarth	Accounting	M
Jane Burnett	Data Science	F

Conclusion

In this tutorial, we discussed different ways of slicing and dicing a DataFrame and filtering rows with specific criteria. we addressed how to take subsets of data through selecting columns, slicing rows, and filtering based on conditions, methods or regular expressions.

Pandas

Tutorials

Tutorial: Filtering Pandas DataFrames

What is a DataFrame?

Creating a DataFrame

Slicing and Dicing a DataFrame from Every Possible Angle

The `.loc` and `.iloc` Methods

Filtering a DataFrame in Pandas

Conclusion

Tutorial: Add a Column to a Pandas DataFrame Based on an If-Else Condition

Python Subprocess: The Simple Beginner’s Tutorial

Tutorial: Filtering Pandas DataFrames

What is a DataFrame?

Creating a DataFrame

Slicing and Dicing a DataFrame from Every Possible Angle

The .loc and .iloc Methods

Filtering a DataFrame in Pandas

Conclusion

More learning resources

Tutorial: Add a Column to a Pandas DataFrame Based on an If-Else Condition

Python Subprocess: The Simple Beginner’s Tutorial

The `.loc` and `.iloc` Methods