Tutorial: Filtering Pandas DataFrames
The Pandas library is a fast, powerful, and easy-to-use tool for working with data. It helps us cleanse, explore, analyze, and visualize data by providing game-changing capabilities. Having data as a Pandas DataFrame allows us to slice and dice data in various ways and filter the DataFrame's rows effortlessly. This tutorial will go over the primary methods for getting data out of a DataFrame using different filtering techniques.
What is a DataFrame?
Before discussing different filtering DataFrames techniques, let’s review a DataFrame. Technically a DataFrame builds upon the Pandas Series object that a Series is a one-dimensional array of indexed data. Intuitively a DataFrame is similar to a spreadsheet in many ways; it can have multiple columns with varying types of data and rows labeled with row indices.
Creating a DataFrame
First let’s create a dummy DataFrame containing the personal details of a company’s employees using the following snippet:
import pandas as pd
import numpy as np
from IPython.display import display
data = pd.DataFrame({'EmployeeName': ['Callen Dunkley', 'Sarah Rayner', 'Jeanette Sloan','Kaycee Acosta', 'Henri Conroy', 'Emma Peralta','Martin Butt', 'Alex Jensen', 'Kim Howarth', 'Jane Burnett'],
                    'Department': ['Accounting', 'Engineering', 'Engineering', 'HR', 'HR','HR','Data Science','Data Science', 'Accounting', 'Data Science'],
                    'HireDate': [2010, 2018, 2012, 2014, 2014, 2018, 2020, 2018, 2020, 2012],
                    'Gender': ['M','F','F','F','M','F','M','M','M','F'],
                    'DoB':['04/09/1982', '14/04/1981','06/05/1997','08/01/1986','10/10/1988','12/11/1992', '10/04/1991','16/07/1995','08/10/1992','11/10/1979'],
                    'Weight':[78,80,66,67,90,57,115,87,95,57],
                    'Height':[176,160,169,157,185,164,195,180,174,165]
                  })
display(data)| EmployeeName | Department | HireDate | Gender | DoB | Weight | Height | |
|---|---|---|---|---|---|---|---|
| 0 | Callen Dunkley | Accounting | 2010 | M | 04/09/1982 | 78 | 176 | 
| 1 | Sarah Rayner | Engineering | 2018 | F | 14/04/1981 | 80 | 160 | 
| 2 | Jeanette Sloan | Engineering | 2012 | F | 06/05/1997 | 66 | 169 | 
| 3 | Kaycee Acosta | HR | 2014 | F | 08/01/1986 | 67 | 157 | 
| 4 | Henri Conroy | HR | 2014 | M | 10/10/1988 | 90 | 185 | 
| 5 | Emma Peralta | HR | 2018 | F | 12/11/1992 | 57 | 164 | 
| 6 | Martin Butt | Data Science | 2020 | M | 10/04/1991 | 115 | 195 | 
| 7 | Alex Jensen | Data Science | 2018 | M | 16/07/1995 | 87 | 180 | 
| 8 | Kim Howarth | Accounting | 2020 | M | 08/10/1992 | 95 | 174 | 
| 9 | Jane Burnett | Data Science | 2012 | F | 11/10/1979 | 57 | 165 | 
Run the code above. It creates a DataFrame with the above data.
Let’s take a look at the columns’ data types by running the following statement:
print(data.dtypes)EmployeeName    object
Department      object
HireDate         int64
Gender          object
DoB             object
Weight           int64
Height           int64
dtype: objectThe Pandas data type for string values is object that is why the data type of columns like EmployeeName or Department containing strings is object. The DataFrame's columns each have a specific data type; in other words, the DataFrame's columns don't share the same data type.
The other useful DataFrame property is .values, which returns an array of lists, each list representing a DataFrame’s row.
print(data.values)[['Callen Dunkley' 'Accounting' 2010 'M' '04/09/1982' 78 176]
 ['Sarah Rayner' 'Engineering' 2018 'F' '14/04/1981' 80 160]
 ['Jeanette Sloan' 'Engineering' 2012 'F' '06/05/1997' 66 169]
 ['Kaycee Acosta' 'HR' 2014 'F' '08/01/1986' 67 157]
 ['Henri Conroy' 'HR' 2014 'M' '10/10/1988' 90 185]
 ['Emma Peralta' 'HR' 2018 'F' '12/11/1992' 57 164]
 ['Martin Butt' 'Data Science' 2020 'M' '10/04/1991' 115 195]
 ['Alex Jensen' 'Data Science' 2018 'M' '16/07/1995' 87 180]
 ['Kim Howarth' 'Accounting' 2020 'M' '08/10/1992' 95 174]
 ['Jane Burnett' 'Data Science' 2012 'F' '11/10/1979' 57 165]]If you’re interested to know the number of rows and columns of the DataFrame, you can use the .shape property as follows:
print(data.shape)(10, 7)The tuple above represents the number of rows and columns, respectively.
Slicing and Dicing a DataFrame from Every Possible Angle
So far, we’ve created and inspected a DataFrame. As mentioned earlier, there are different ways for filtering a DataFrame. In this section, we will discuss the most prevalent and efficient methods. However, it would be better to look at how we can select specific columns, first.
To grab the entire of specific columns, we can use the following syntax:
DataFrame.ColumnName  # if the column name does not contain any spaces
OR 
DataFrame["ColumnName"]For example, let’s grab the Department column of the DataFrame.
display(data.Department)0      Accounting
1     Engineering
2     Engineering
3              HR
4              HR
5              HR
6    Data Science
7    Data Science
8      Accounting
9    Data Science
Name: Department, dtype: objectRunning the statement above returns the Series containing the Department column’s data.
Also, we can select multiple columns by passing a list of column names, as follows:
display(data[['EmployeeName','Department','Gender']])| EmployeeName | Department | Gender | |
|---|---|---|---|
| 0 | Callen Dunkley | Accounting | M | 
| 1 | Sarah Rayner | Engineering | F | 
| 2 | Jeanette Sloan | Engineering | F | 
| 3 | Kaycee Acosta | HR | F | 
| 4 | Henri Conroy | HR | M | 
| 5 | Emma Peralta | HR | F | 
| 6 | Martin Butt | Data Science | M | 
| 7 | Alex Jensen | Data Science | M | 
| 8 | Kim Howarth | Accounting | M | 
| 9 | Jane Burnett | Data Science | F | 
Running the statement above gives us a DataFrame that is a subset of the original DataFrame.
The .loc and .iloc Methods
We can use slicing techniques to extract specific rows from a DataFrame. The best way to slice a DataFrame and select particular columns is to use the .loc and .iloc methods. The twin methods create a subset of a DataFrame using label-based or integer-based indexing, respectively.
DataFrame.loc[row_indexer,column_indexer]
DataFrame.iloc[row_indexer,column_indexer]The first position in both methods specifies the row indexer, and the second specifies the column indexer separated with a comma.
Before we want to practice working with the .loc and .iloc methods, let’s set the EmployeeName column as the index of the DataFrame. Doing this is not required but helps us discuss all the capabilities of these methods. To do so, use the set_index() method to make the EmployeeName column the index of the DataFrame:
data = data.set_index('EmployeeName')
display(data)| Department | HireDate | Gender | DoB | Weight | Height | |
|---|---|---|---|---|---|---|
| EmployeeName | ||||||
| Callen Dunkley | Accounting | 2010 | M | 04/09/1982 | 78 | 176 | 
| Sarah Rayner | Engineering | 2018 | F | 14/04/1981 | 80 | 160 | 
| Jeanette Sloan | Engineering | 2012 | F | 06/05/1997 | 66 | 169 | 
| Kaycee Acosta | HR | 2014 | F | 08/01/1986 | 67 | 157 | 
| Henri Conroy | HR | 2014 | M | 10/10/1988 | 90 | 185 | 
| Emma Peralta | HR | 2018 | F | 12/11/1992 | 57 | 164 | 
| Martin Butt | Data Science | 2020 | M | 10/04/1991 | 115 | 195 | 
| Alex Jensen | Data Science | 2018 | M | 16/07/1995 | 87 | 180 | 
| Kim Howarth | Accounting | 2020 | M | 08/10/1992 | 95 | 174 | 
| Jane Burnett | Data Science | 2012 | F | 11/10/1979 | 57 | 165 | 
Now, instead of having integer-based indexes, we have label-based indexes, which makes it more apt to try all the capabilities of .loc and .iloc methods for slicing and dicing the DataFrame.
NOTE
The .loc method is label-based, so you have to specify rows and columns based on their labels. On the other hand, the iloc method is integer index-based, so you have to select rows and columns by their integer index.
The following statements return the identical results practically:
display(data.loc[['Henri Conroy', 'Kim Howarth']])
display(data.iloc[[4,8]])| Department | HireDate | Gender | DoB | Weight | Height | |
|---|---|---|---|---|---|---|
| EmployeeName | ||||||
| Henri Conroy | HR | 2014 | M | 10/10/1988 | 90 | 185 | 
| Kim Howarth | Accounting | 2020 | M | 08/10/1992 | 95 | 174 | 
| Department | HireDate | Gender | DoB | Weight | Height | |
|---|---|---|---|---|---|---|
| EmployeeName | ||||||
| Henri Conroy | HR | 2014 | M | 10/10/1988 | 90 | 185 | 
| Kim Howarth | Accounting | 2020 | M | 08/10/1992 | 95 | 174 | 
In the above code, the  .loc method gets a list of the employee names as row labels and returns a DataFrame containing the rows associated with the two employees.
The .iloc method does the same thing, but instead of getting the labeled indexes, it gets the integer indexes of the rows.
Now, let’s extract a sub-DataFrame that contains the name, department, and gender values of the first three employees using the .loc and .iloc methods:
display(data.iloc[:3,[0,2]])
display(data.loc['Callen Dunkley':'Jeanette Sloan',['Department','Gender']])| Department | Gender | |
|---|---|---|
| EmployeeName | ||
| Callen Dunkley | Accounting | M | 
| Sarah Rayner | Engineering | F | 
| Jeanette Sloan | Engineering | F | 
| Department | Gender | |
|---|---|---|
| EmployeeName | ||
| Callen Dunkley | Accounting | M | 
| Sarah Rayner | Engineering | F | 
| Jeanette Sloan | Engineering | F | 
Running the statements above returns the identical results as follows:
EmployeeName     Department Gender                      
Callen Dunkley   Accounting      M
Sarah Rayner    Engineering      F
Jeanette Sloan  Engineering      FThe .loc and .iloc methods are pretty similar; the only difference is how referring to columns and rows in a DataFrame.
The following statements return all the rows of the DataFrame:
display(data.iloc[:,[0,2]])
display(data.loc[:,['Department','Gender']])| Department | Gender | |
|---|---|---|
| EmployeeName | ||
| Callen Dunkley | Accounting | M | 
| Sarah Rayner | Engineering | F | 
| Jeanette Sloan | Engineering | F | 
| Kaycee Acosta | HR | F | 
| Henri Conroy | HR | M | 
| Emma Peralta | HR | F | 
| Martin Butt | Data Science | M | 
| Alex Jensen | Data Science | M | 
| Kim Howarth | Accounting | M | 
| Jane Burnett | Data Science | F | 
| Department | Gender | |
|---|---|---|
| EmployeeName | ||
| Callen Dunkley | Accounting | M | 
| Sarah Rayner | Engineering | F | 
| Jeanette Sloan | Engineering | F | 
| Kaycee Acosta | HR | F | 
| Henri Conroy | HR | M | 
| Emma Peralta | HR | F | 
| Martin Butt | Data Science | M | 
| Alex Jensen | Data Science | M | 
| Kim Howarth | Accounting | M | 
| Jane Burnett | Data Science | F | 
We’re able to do slicing for columns too. The statement below returns the Weight and Height columns of every second row of the DataFrame:
display(data.loc[::2,'Weight':'Height'])
display(data.iloc[::2,4:6])| Weight | Height | |
|---|---|---|
| EmployeeName | ||
| Callen Dunkley | 78 | 176 | 
| Jeanette Sloan | 66 | 169 | 
| Henri Conroy | 90 | 185 | 
| Martin Butt | 115 | 195 | 
| Kim Howarth | 95 | 174 | 
| Weight | Height | |
|---|---|---|
| EmployeeName | ||
| Callen Dunkley | 78 | 176 | 
| Jeanette Sloan | 66 | 169 | 
| Henri Conroy | 90 | 185 | 
| Martin Butt | 115 | 195 | 
| Kim Howarth | 95 | 174 | 
We perform slicing on the DataFrame’s rows and columns in the code above. The syntax is similar to slicing a list in Python. We need to specify the start index, the stop index, and the step size.
EmployeeName    Weight  Height                  
Callen Dunkley      78     176
Jeanette Sloan      66     169
Henri Conroy        90     185
Martin Butt        115     195
Kim Howarth         95     174Notice that in slicing with labels using the .loc method, both the start and the stop labels are included. However, in slicing using .iloc method the stop index is excluded.
Filtering a DataFrame in Pandas
So far, we’ve learned how to slice and dice a DataFrame, but how can we take data that meets specific criteria?
Using Boolean masks is the common method for filtering a DataFrame in Pandas. First let’s see what is a Boolean mask:
print(data.HireDate > 2015)EmployeeName
Callen Dunkley    False
Sarah Rayner       True
Jeanette Sloan    False
Kaycee Acosta     False
Henri Conroy      False
Emma Peralta       True
Martin Butt        True
Alex Jensen        True
Kim Howarth        True
Jane Burnett      False
Name: HireDate, dtype: boolBy running the code above, we can see which entries in the HireDate column are greater than 2015. The Boolean values indicate whether the condition is met or not.
Now, we can use the boolean mask to get the subset of the DataFrame containing the
employees who are hired after 2015:
display(data[data.HireDate > 2015])| Department | HireDate | Gender | DoB | Weight | Height | |
|---|---|---|---|---|---|---|
| EmployeeName | ||||||
| Sarah Rayner | Engineering | 2018 | F | 14/04/1981 | 80 | 160 | 
| Emma Peralta | HR | 2018 | F | 12/11/1992 | 57 | 164 | 
| Martin Butt | Data Science | 2020 | M | 10/04/1991 | 115 | 195 | 
| Alex Jensen | Data Science | 2018 | M | 16/07/1995 | 87 | 180 | 
| Kim Howarth | Accounting | 2020 | M | 08/10/1992 | 95 | 174 | 
We also can use the .loc method to filter the rows along with returning certain columns, as follows:
display(data.loc[data.HireDate > 2015, ['Department', 'HireDate']])| Department | HireDate | |
|---|---|---|
| EmployeeName | ||
| Sarah Rayner | Engineering | 2018 | 
| Emma Peralta | HR | 2018 | 
| Martin Butt | Data Science | 2020 | 
| Alex Jensen | Data Science | 2018 | 
| Kim Howarth | Accounting | 2020 | 
We're able to combine two or more criteria using logical operators. Let's list those male employees working in Data Science team:
display(data.loc[(data.Gender == "M") & (data.Department=='Data Science'), 
                ['Department', 'Gender', 'HireDate']]
        )| Department | Gender | HireDate | |
|---|---|---|---|
| EmployeeName | |||
| Martin Butt | Data Science | M | 2020 | 
| Alex Jensen | Data Science | M | 2018 | 
What if we want to get all employees with heights between 170 cm and 190 cm?
display(data.loc[data.Height.between(170,190), 
                ['Department', 'Gender', 'Height']]
        )| Department | Gender | Height | |
|---|---|---|---|
| EmployeeName | |||
| Callen Dunkley | Accounting | M | 176 | 
| Henri Conroy | HR | M | 185 | 
| Alex Jensen | Data Science | M | 180 | 
| Kim Howarth | Accounting | M | 174 | 
In the between() method both ends of the range are inclusive.
The other helpful method is isin(), which creates a Boolean mask for values contained in a list of specified values. Let’s try the method to retrieve the employees who are working in either Human Resources or Accounting departments:
display(data.loc[data.Department.isin(['HR', 'Accounting']),['Department', 'Gender', 'HireDate']])| Department | Gender | HireDate | |
|---|---|---|---|
| EmployeeName | |||
| Callen Dunkley | Accounting | M | 2010 | 
| Kaycee Acosta | HR | F | 2014 | 
| Henri Conroy | HR | M | 2014 | 
| Emma Peralta | HR | F | 2018 | 
| Kim Howarth | Accounting | M | 2020 | 
To filter a DataFrame, we can use regular expressions to search for a pattern rather than the exact content. The following example shows how we can use a regular expression to get those employees whose last names end with either ‘tt’ or ‘th’:
display(data.loc[data.index.str.contains(r'tt$|th$'),
                ['Department', 'Gender']])| Department | Gender | |
|---|---|---|
| EmployeeName | ||
| Martin Butt | Data Science | M | 
| Kim Howarth | Accounting | M | 
| Jane Burnett | Data Science | F | 
Conclusion
In this tutorial, we discussed different ways of slicing and dicing a DataFrame and filtering rows with specific criteria. we addressed how to take subsets of data through selecting columns, slicing rows, and filtering based on conditions, methods or regular expressions.
 
        