Badge photo

ALL PREMIUM PLANS ON SALE – SAVE UP TO 60%

Sale Ends in
00
days
00
hours
00
mins
00
secs
February 24, 2022

Tutorial: Filtering Pandas DataFrames

The Pandas library is a fast, powerful, and easy-to-use tool for working with data. It helps us cleanse, explore, analyze, and visualize data by providing game-changing capabilities. Having data as a Pandas DataFrame allows us to slice and dice data in various ways and filter the DataFrame's rows effortlessly. This tutorial will go over the primary methods for getting data out of a DataFrame using different filtering techniques.

What is a DataFrame?

Before discussing different filtering DataFrames techniques, let’s review a DataFrame. Technically a DataFrame builds upon the Pandas Series object that a Series is a one-dimensional array of indexed data. Intuitively a DataFrame is similar to a spreadsheet in many ways; it can have multiple columns with varying types of data and rows labeled with row indices.

Creating a DataFrame

First let’s create a dummy DataFrame containing the personal details of a company’s employees using the following snippet:

import pandas as pd
import numpy as np
from IPython.display import display

data = pd.DataFrame({'EmployeeName': ['Callen Dunkley', 'Sarah Rayner', 'Jeanette Sloan','Kaycee Acosta', 'Henri Conroy', 'Emma Peralta','Martin Butt', 'Alex Jensen', 'Kim Howarth', 'Jane Burnett'],
                    'Department': ['Accounting', 'Engineering', 'Engineering', 'HR', 'HR','HR','Data Science','Data Science', 'Accounting', 'Data Science'],
                    'HireDate': [2010, 2018, 2012, 2014, 2014, 2018, 2020, 2018, 2020, 2012],
                    'Gender': ['M','F','F','F','M','F','M','M','M','F'],
                    'DoB':['04/09/1982', '14/04/1981','06/05/1997','08/01/1986','10/10/1988','12/11/1992', '10/04/1991','16/07/1995','08/10/1992','11/10/1979'],
                    'Weight':[78,80,66,67,90,57,115,87,95,57],
                    'Height':[176,160,169,157,185,164,195,180,174,165]
                  })
display(data)
EmployeeName Department HireDate Gender DoB Weight Height
0 Callen Dunkley Accounting 2010 M 04/09/1982 78 176
1 Sarah Rayner Engineering 2018 F 14/04/1981 80 160
2 Jeanette Sloan Engineering 2012 F 06/05/1997 66 169
3 Kaycee Acosta HR 2014 F 08/01/1986 67 157
4 Henri Conroy HR 2014 M 10/10/1988 90 185
5 Emma Peralta HR 2018 F 12/11/1992 57 164
6 Martin Butt Data Science 2020 M 10/04/1991 115 195
7 Alex Jensen Data Science 2018 M 16/07/1995 87 180
8 Kim Howarth Accounting 2020 M 08/10/1992 95 174
9 Jane Burnett Data Science 2012 F 11/10/1979 57 165

Run the code above. It creates a DataFrame with the above data.
Let’s take a look at the columns’ data types by running the following statement:

print(data.dtypes)
EmployeeName    object
Department      object
HireDate         int64
Gender          object
DoB             object
Weight           int64
Height           int64
dtype: object

The Pandas data type for string values is object that is why the data type of columns like EmployeeName or Department containing strings is object. The DataFrame's columns each have a specific data type; in other words, the DataFrame's columns don't share the same data type.
The other useful DataFrame property is .values, which returns an array of lists, each list representing a DataFrame’s row.

print(data.values)
[['Callen Dunkley' 'Accounting' 2010 'M' '04/09/1982' 78 176]
 ['Sarah Rayner' 'Engineering' 2018 'F' '14/04/1981' 80 160]
 ['Jeanette Sloan' 'Engineering' 2012 'F' '06/05/1997' 66 169]
 ['Kaycee Acosta' 'HR' 2014 'F' '08/01/1986' 67 157]
 ['Henri Conroy' 'HR' 2014 'M' '10/10/1988' 90 185]
 ['Emma Peralta' 'HR' 2018 'F' '12/11/1992' 57 164]
 ['Martin Butt' 'Data Science' 2020 'M' '10/04/1991' 115 195]
 ['Alex Jensen' 'Data Science' 2018 'M' '16/07/1995' 87 180]
 ['Kim Howarth' 'Accounting' 2020 'M' '08/10/1992' 95 174]
 ['Jane Burnett' 'Data Science' 2012 'F' '11/10/1979' 57 165]]

If you’re interested to know the number of rows and columns of the DataFrame, you can use the .shape property as follows:

print(data.shape)
(10, 7)

The tuple above represents the number of rows and columns, respectively.

Slicing and Dicing a DataFrame from Every Possible Angle

So far, we’ve created and inspected a DataFrame. As mentioned earlier, there are different ways for filtering a DataFrame. In this section, we will discuss the most prevalent and efficient methods. However, it would be better to look at how we can select specific columns, first.
To grab the entire of specific columns, we can use the following syntax:

DataFrame.ColumnName  # if the column name does not contain any spaces
OR 
DataFrame["ColumnName"]

For example, let’s grab the Department column of the DataFrame.

display(data.Department)
0      Accounting
1     Engineering
2     Engineering
3              HR
4              HR
5              HR
6    Data Science
7    Data Science
8      Accounting
9    Data Science
Name: Department, dtype: object

Running the statement above returns the Series containing the Department column’s data.

Also, we can select multiple columns by passing a list of column names, as follows:

display(data[['EmployeeName','Department','Gender']])
EmployeeName Department Gender
0 Callen Dunkley Accounting M
1 Sarah Rayner Engineering F
2 Jeanette Sloan Engineering F
3 Kaycee Acosta HR F
4 Henri Conroy HR M
5 Emma Peralta HR F
6 Martin Butt Data Science M
7 Alex Jensen Data Science M
8 Kim Howarth Accounting M
9 Jane Burnett Data Science F

Running the statement above gives us a DataFrame that is a subset of the original DataFrame.

The .loc and .iloc Methods

We can use slicing techniques to extract specific rows from a DataFrame. The best way to slice a DataFrame and select particular columns is to use the .loc and .iloc methods. The twin methods create a subset of a DataFrame using label-based or integer-based indexing, respectively.

DataFrame.loc[row_indexer,column_indexer]
DataFrame.iloc[row_indexer,column_indexer]

The first position in both methods specifies the row indexer, and the second specifies the column indexer separated with a comma.
Before we want to practice working with the .loc and .iloc methods, let’s set the EmployeeName column as the index of the DataFrame. Doing this is not required but helps us discuss all the capabilities of these methods. To do so, use the set_index() method to make the EmployeeName column the index of the DataFrame:

data = data.set_index('EmployeeName')
display(data)
Department HireDate Gender DoB Weight Height
EmployeeName
Callen Dunkley Accounting 2010 M 04/09/1982 78 176
Sarah Rayner Engineering 2018 F 14/04/1981 80 160
Jeanette Sloan Engineering 2012 F 06/05/1997 66 169
Kaycee Acosta HR 2014 F 08/01/1986 67 157
Henri Conroy HR 2014 M 10/10/1988 90 185
Emma Peralta HR 2018 F 12/11/1992 57 164
Martin Butt Data Science 2020 M 10/04/1991 115 195
Alex Jensen Data Science 2018 M 16/07/1995 87 180
Kim Howarth Accounting 2020 M 08/10/1992 95 174
Jane Burnett Data Science 2012 F 11/10/1979 57 165

Now, instead of having integer-based indexes, we have label-based indexes, which makes it more apt to try all the capabilities of .loc and .iloc methods for slicing and dicing the DataFrame.


NOTE

The .loc method is label-based, so you have to specify rows and columns based on their labels. On the other hand, the iloc method is integer index-based, so you have to select rows and columns by their integer index.


The following statements return the identical results practically:

display(data.loc[['Henri Conroy', 'Kim Howarth']])
display(data.iloc[[4,8]])
Department HireDate Gender DoB Weight Height
EmployeeName
Henri Conroy HR 2014 M 10/10/1988 90 185
Kim Howarth Accounting 2020 M 08/10/1992 95 174
Department HireDate Gender DoB Weight Height
EmployeeName
Henri Conroy HR 2014 M 10/10/1988 90 185
Kim Howarth Accounting 2020 M 08/10/1992 95 174

In the above code, the .loc method gets a list of the employee names as row labels and returns a DataFrame containing the rows associated with the two employees.
The .iloc method does the same thing, but instead of getting the labeled indexes, it gets the integer indexes of the rows.
Now, let’s extract a sub-DataFrame that contains the name, department, and gender values of the first three employees using the .loc and .iloc methods:

display(data.iloc[:3,[0,2]])
display(data.loc['Callen Dunkley':'Jeanette Sloan',['Department','Gender']])
Department Gender
EmployeeName
Callen Dunkley Accounting M
Sarah Rayner Engineering F
Jeanette Sloan Engineering F
Department Gender
EmployeeName
Callen Dunkley Accounting M
Sarah Rayner Engineering F
Jeanette Sloan Engineering F

Running the statements above returns the identical ​results as follows:

EmployeeName     Department Gender                      
Callen Dunkley   Accounting      M
Sarah Rayner    Engineering      F
Jeanette Sloan  Engineering      F

The .loc and .iloc methods are pretty similar; the only difference is how referring to columns and rows in a DataFrame.
The following statements return all the rows of the DataFrame:

display(data.iloc[:,[0,2]])
display(data.loc[:,['Department','Gender']])
Department Gender
EmployeeName
Callen Dunkley Accounting M
Sarah Rayner Engineering F
Jeanette Sloan Engineering F
Kaycee Acosta HR F
Henri Conroy HR M
Emma Peralta HR F
Martin Butt Data Science M
Alex Jensen Data Science M
Kim Howarth Accounting M
Jane Burnett Data Science F
Department Gender
EmployeeName
Callen Dunkley Accounting M
Sarah Rayner Engineering F
Jeanette Sloan Engineering F
Kaycee Acosta HR F
Henri Conroy HR M
Emma Peralta HR F
Martin Butt Data Science M
Alex Jensen Data Science M
Kim Howarth Accounting M
Jane Burnett Data Science F

We’re able to do slicing for columns too. The statement below returns the Weight and Height columns of every second row of the DataFrame:

display(data.loc[::2,'Weight':'Height'])
display(data.iloc[::2,4:6])
Weight Height
EmployeeName
Callen Dunkley 78 176
Jeanette Sloan 66 169
Henri Conroy 90 185
Martin Butt 115 195
Kim Howarth 95 174
Weight Height
EmployeeName
Callen Dunkley 78 176
Jeanette Sloan 66 169
Henri Conroy 90 185
Martin Butt 115 195
Kim Howarth 95 174

We perform slicing on the DataFrame’s rows and columns in the code above. The syntax is similar to slicing a list in Python. We need to specify the start index, the stop index, and the step size.

EmployeeName    Weight  Height                  
Callen Dunkley      78     176
Jeanette Sloan      66     169
Henri Conroy        90     185
Martin Butt        115     195
Kim Howarth         95     174

Notice that in slicing with labels using the .loc method, both the start and the stop labels are included. However, in slicing using .iloc method the stop index is excluded.

Filtering a DataFrame in Pandas

So far, we’ve learned how to slice and dice a DataFrame, but how can we take data that meets specific criteria?
Using Boolean masks is the common method for filtering a DataFrame in Pandas. First let’s see what is a Boolean mask:

print(data.HireDate > 2015)
EmployeeName
Callen Dunkley    False
Sarah Rayner       True
Jeanette Sloan    False
Kaycee Acosta     False
Henri Conroy      False
Emma Peralta       True
Martin Butt        True
Alex Jensen        True
Kim Howarth        True
Jane Burnett      False
Name: HireDate, dtype: bool

By running the code above, we can see which entries in the HireDate column are greater than 2015. The Boolean values indicate whether the condition is met or not.
Now, we can use the boolean mask to get the subset of the DataFrame containing the
employees who are hired after 2015:

display(data[data.HireDate > 2015])
Department HireDate Gender DoB Weight Height
EmployeeName
Sarah Rayner Engineering 2018 F 14/04/1981 80 160
Emma Peralta HR 2018 F 12/11/1992 57 164
Martin Butt Data Science 2020 M 10/04/1991 115 195
Alex Jensen Data Science 2018 M 16/07/1995 87 180
Kim Howarth Accounting 2020 M 08/10/1992 95 174

We also can use the .loc method to filter the rows along with returning certain columns, as follows:

display(data.loc[data.HireDate > 2015, ['Department', 'HireDate']])
Department HireDate
EmployeeName
Sarah Rayner Engineering 2018
Emma Peralta HR 2018
Martin Butt Data Science 2020
Alex Jensen Data Science 2018
Kim Howarth Accounting 2020

We're able to combine two or more criteria using logical operators. Let's list those male employees working in Data Science team:

display(data.loc[(data.Gender == "M") & (data.Department=='Data Science'), 
                ['Department', 'Gender', 'HireDate']]
        )
Department Gender HireDate
EmployeeName
Martin Butt Data Science M 2020
Alex Jensen Data Science M 2018

What if we want to get all employees with heights between 170 cm and 190 cm?

display(data.loc[data.Height.between(170,190), 
                ['Department', 'Gender', 'Height']]
        )
Department Gender Height
EmployeeName
Callen Dunkley Accounting M 176
Henri Conroy HR M 185
Alex Jensen Data Science M 180
Kim Howarth Accounting M 174

In the between() method both ends of the range are inclusive.
The other helpful method is isin(), which creates a Boolean mask for values contained in a list of specified values. Let’s try the method to retrieve the employees who are working in either Human Resources or Accounting departments:

display(data.loc[data.Department.isin(['HR', 'Accounting']),['Department', 'Gender', 'HireDate']])
Department Gender HireDate
EmployeeName
Callen Dunkley Accounting M 2010
Kaycee Acosta HR F 2014
Henri Conroy HR M 2014
Emma Peralta HR F 2018
Kim Howarth Accounting M 2020

To filter a DataFrame, we can use regular expressions to search for a pattern rather than the exact content. The following example shows how we can use a regular expression to get those employees whose last names end with either ‘tt’ or ‘th’:

display(data.loc[data.index.str.contains(r'tt$|th$'),
                ['Department', 'Gender']])
Department Gender
EmployeeName
Martin Butt Data Science M
Kim Howarth Accounting M
Jane Burnett Data Science F

Conclusion

In this tutorial, we discussed different ways of slicing and dicing a DataFrame and filtering rows with specific criteria. we addressed how to take subsets of data through selecting columns, slicing rows, and filtering based on conditions, methods or regular expressions.

Mehdi Lotfinejad

About the author

Mehdi Lotfinejad

Mehdi is a Senior Data Engineer and Team Lead at ADA. He is a professional trainer who loves writing data analytics tutorials.