Tutorial: Filtering Pandas DataFrames
The Pandas library is a fast, powerful, and easy-to-use tool for working with data. It helps us cleanse, explore, analyze, and visualize data by providing game-changing capabilities. Having data as a Pandas DataFrame allows us to slice and dice data in various ways and filter the DataFrame's rows effortlessly. This tutorial will go over the primary methods for getting data out of a DataFrame using different filtering techniques.
What is a DataFrame?
Before discussing different filtering DataFrames techniques, let’s review a DataFrame. Technically a DataFrame builds upon the Pandas Series object that a Series is a one-dimensional array of indexed data. Intuitively a DataFrame is similar to a spreadsheet in many ways; it can have multiple columns with varying types of data and rows labeled with row indices.
Creating a DataFrame
First let’s create a dummy DataFrame containing the personal details of a company’s employees using the following snippet:
import pandas as pd
import numpy as np
from IPython.display import display
data = pd.DataFrame({'EmployeeName': ['Callen Dunkley', 'Sarah Rayner', 'Jeanette Sloan','Kaycee Acosta', 'Henri Conroy', 'Emma Peralta','Martin Butt', 'Alex Jensen', 'Kim Howarth', 'Jane Burnett'],
'Department': ['Accounting', 'Engineering', 'Engineering', 'HR', 'HR','HR','Data Science','Data Science', 'Accounting', 'Data Science'],
'HireDate': [2010, 2018, 2012, 2014, 2014, 2018, 2020, 2018, 2020, 2012],
'Gender': ['M','F','F','F','M','F','M','M','M','F'],
'DoB':['04/09/1982', '14/04/1981','06/05/1997','08/01/1986','10/10/1988','12/11/1992', '10/04/1991','16/07/1995','08/10/1992','11/10/1979'],
'Weight':[78,80,66,67,90,57,115,87,95,57],
'Height':[176,160,169,157,185,164,195,180,174,165]
})
display(data)
EmployeeName | Department | HireDate | Gender | DoB | Weight | Height | |
---|---|---|---|---|---|---|---|
0 | Callen Dunkley | Accounting | 2010 | M | 04/09/1982 | 78 | 176 |
1 | Sarah Rayner | Engineering | 2018 | F | 14/04/1981 | 80 | 160 |
2 | Jeanette Sloan | Engineering | 2012 | F | 06/05/1997 | 66 | 169 |
3 | Kaycee Acosta | HR | 2014 | F | 08/01/1986 | 67 | 157 |
4 | Henri Conroy | HR | 2014 | M | 10/10/1988 | 90 | 185 |
5 | Emma Peralta | HR | 2018 | F | 12/11/1992 | 57 | 164 |
6 | Martin Butt | Data Science | 2020 | M | 10/04/1991 | 115 | 195 |
7 | Alex Jensen | Data Science | 2018 | M | 16/07/1995 | 87 | 180 |
8 | Kim Howarth | Accounting | 2020 | M | 08/10/1992 | 95 | 174 |
9 | Jane Burnett | Data Science | 2012 | F | 11/10/1979 | 57 | 165 |
Run the code above. It creates a DataFrame with the above data.
Let’s take a look at the columns’ data types by running the following statement:
print(data.dtypes)
EmployeeName object
Department object
HireDate int64
Gender object
DoB object
Weight int64
Height int64
dtype: object
The Pandas data type for string values is object that is why the data type of columns like EmployeeName or Department containing strings is object. The DataFrame's columns each have a specific data type; in other words, the DataFrame's columns don't share the same data type.
The other useful DataFrame property is .values
, which returns an array of lists, each list representing a DataFrame’s row.
print(data.values)
[['Callen Dunkley' 'Accounting' 2010 'M' '04/09/1982' 78 176]
['Sarah Rayner' 'Engineering' 2018 'F' '14/04/1981' 80 160]
['Jeanette Sloan' 'Engineering' 2012 'F' '06/05/1997' 66 169]
['Kaycee Acosta' 'HR' 2014 'F' '08/01/1986' 67 157]
['Henri Conroy' 'HR' 2014 'M' '10/10/1988' 90 185]
['Emma Peralta' 'HR' 2018 'F' '12/11/1992' 57 164]
['Martin Butt' 'Data Science' 2020 'M' '10/04/1991' 115 195]
['Alex Jensen' 'Data Science' 2018 'M' '16/07/1995' 87 180]
['Kim Howarth' 'Accounting' 2020 'M' '08/10/1992' 95 174]
['Jane Burnett' 'Data Science' 2012 'F' '11/10/1979' 57 165]]
If you’re interested to know the number of rows and columns of the DataFrame, you can use the .shape
property as follows:
print(data.shape)
(10, 7)
The tuple above represents the number of rows and columns, respectively.
Slicing and Dicing a DataFrame from Every Possible Angle
So far, we’ve created and inspected a DataFrame. As mentioned earlier, there are different ways for filtering a DataFrame. In this section, we will discuss the most prevalent and efficient methods. However, it would be better to look at how we can select specific columns, first.
To grab the entire of specific columns, we can use the following syntax:
DataFrame.ColumnName # if the column name does not contain any spaces
OR
DataFrame["ColumnName"]
For example, let’s grab the Department column of the DataFrame.
display(data.Department)
0 Accounting
1 Engineering
2 Engineering
3 HR
4 HR
5 HR
6 Data Science
7 Data Science
8 Accounting
9 Data Science
Name: Department, dtype: object
Running the statement above returns the Series containing the Department column’s data.
Also, we can select multiple columns by passing a list of column names, as follows:
display(data[['EmployeeName','Department','Gender']])
EmployeeName | Department | Gender | |
---|---|---|---|
0 | Callen Dunkley | Accounting | M |
1 | Sarah Rayner | Engineering | F |
2 | Jeanette Sloan | Engineering | F |
3 | Kaycee Acosta | HR | F |
4 | Henri Conroy | HR | M |
5 | Emma Peralta | HR | F |
6 | Martin Butt | Data Science | M |
7 | Alex Jensen | Data Science | M |
8 | Kim Howarth | Accounting | M |
9 | Jane Burnett | Data Science | F |
Running the statement above gives us a DataFrame that is a subset of the original DataFrame.
The .loc
and .iloc
Methods
We can use slicing techniques to extract specific rows from a DataFrame. The best way to slice a DataFrame and select particular columns is to use the .loc
and .iloc
methods. The twin methods create a subset of a DataFrame using label-based or integer-based indexing, respectively.
DataFrame.loc[row_indexer,column_indexer]
DataFrame.iloc[row_indexer,column_indexer]
The first position in both methods specifies the row indexer, and the second specifies the column indexer separated with a comma.
Before we want to practice working with the .loc
and .iloc
methods, let’s set the EmployeeName column as the index of the DataFrame. Doing this is not required but helps us discuss all the capabilities of these methods. To do so, use the set_index()
method to make the EmployeeName column the index of the DataFrame:
data = data.set_index('EmployeeName')
display(data)
Department | HireDate | Gender | DoB | Weight | Height | |
---|---|---|---|---|---|---|
EmployeeName | ||||||
Callen Dunkley | Accounting | 2010 | M | 04/09/1982 | 78 | 176 |
Sarah Rayner | Engineering | 2018 | F | 14/04/1981 | 80 | 160 |
Jeanette Sloan | Engineering | 2012 | F | 06/05/1997 | 66 | 169 |
Kaycee Acosta | HR | 2014 | F | 08/01/1986 | 67 | 157 |
Henri Conroy | HR | 2014 | M | 10/10/1988 | 90 | 185 |
Emma Peralta | HR | 2018 | F | 12/11/1992 | 57 | 164 |
Martin Butt | Data Science | 2020 | M | 10/04/1991 | 115 | 195 |
Alex Jensen | Data Science | 2018 | M | 16/07/1995 | 87 | 180 |
Kim Howarth | Accounting | 2020 | M | 08/10/1992 | 95 | 174 |
Jane Burnett | Data Science | 2012 | F | 11/10/1979 | 57 | 165 |
Now, instead of having integer-based indexes, we have label-based indexes, which makes it more apt to try all the capabilities of .loc
and .iloc
methods for slicing and dicing the DataFrame.
NOTE
The .loc
method is label-based, so you have to specify rows and columns based on their labels. On the other hand, the iloc
method is integer index-based, so you have to select rows and columns by their integer index.
The following statements return the identical results practically:
display(data.loc[['Henri Conroy', 'Kim Howarth']])
display(data.iloc[[4,8]])
Department | HireDate | Gender | DoB | Weight | Height | |
---|---|---|---|---|---|---|
EmployeeName | ||||||
Henri Conroy | HR | 2014 | M | 10/10/1988 | 90 | 185 |
Kim Howarth | Accounting | 2020 | M | 08/10/1992 | 95 | 174 |
Department | HireDate | Gender | DoB | Weight | Height | |
---|---|---|---|---|---|---|
EmployeeName | ||||||
Henri Conroy | HR | 2014 | M | 10/10/1988 | 90 | 185 |
Kim Howarth | Accounting | 2020 | M | 08/10/1992 | 95 | 174 |
In the above code, the .loc
method gets a list of the employee names as row labels and returns a DataFrame containing the rows associated with the two employees.
The .iloc
method does the same thing, but instead of getting the labeled indexes, it gets the integer indexes of the rows.
Now, let’s extract a sub-DataFrame that contains the name, department, and gender values of the first three employees using the .loc
and .iloc
methods:
display(data.iloc[:3,[0,2]])
display(data.loc['Callen Dunkley':'Jeanette Sloan',['Department','Gender']])
Department | Gender | |
---|---|---|
EmployeeName | ||
Callen Dunkley | Accounting | M |
Sarah Rayner | Engineering | F |
Jeanette Sloan | Engineering | F |
Department | Gender | |
---|---|---|
EmployeeName | ||
Callen Dunkley | Accounting | M |
Sarah Rayner | Engineering | F |
Jeanette Sloan | Engineering | F |
Running the statements above returns the identical results as follows:
EmployeeName Department Gender
Callen Dunkley Accounting M
Sarah Rayner Engineering F
Jeanette Sloan Engineering F
The .loc
and .iloc
methods are pretty similar; the only difference is how referring to columns and rows in a DataFrame.
The following statements return all the rows of the DataFrame:
display(data.iloc[:,[0,2]])
display(data.loc[:,['Department','Gender']])
Department | Gender | |
---|---|---|
EmployeeName | ||
Callen Dunkley | Accounting | M |
Sarah Rayner | Engineering | F |
Jeanette Sloan | Engineering | F |
Kaycee Acosta | HR | F |
Henri Conroy | HR | M |
Emma Peralta | HR | F |
Martin Butt | Data Science | M |
Alex Jensen | Data Science | M |
Kim Howarth | Accounting | M |
Jane Burnett | Data Science | F |
Department | Gender | |
---|---|---|
EmployeeName | ||
Callen Dunkley | Accounting | M |
Sarah Rayner | Engineering | F |
Jeanette Sloan | Engineering | F |
Kaycee Acosta | HR | F |
Henri Conroy | HR | M |
Emma Peralta | HR | F |
Martin Butt | Data Science | M |
Alex Jensen | Data Science | M |
Kim Howarth | Accounting | M |
Jane Burnett | Data Science | F |
We’re able to do slicing for columns too. The statement below returns the Weight
and Height
columns of every second row of the DataFrame:
display(data.loc[::2,'Weight':'Height'])
display(data.iloc[::2,4:6])
Weight | Height | |
---|---|---|
EmployeeName | ||
Callen Dunkley | 78 | 176 |
Jeanette Sloan | 66 | 169 |
Henri Conroy | 90 | 185 |
Martin Butt | 115 | 195 |
Kim Howarth | 95 | 174 |
Weight | Height | |
---|---|---|
EmployeeName | ||
Callen Dunkley | 78 | 176 |
Jeanette Sloan | 66 | 169 |
Henri Conroy | 90 | 185 |
Martin Butt | 115 | 195 |
Kim Howarth | 95 | 174 |
We perform slicing on the DataFrame’s rows and columns in the code above. The syntax is similar to slicing a list in Python. We need to specify the start index, the stop index, and the step size.
EmployeeName Weight Height
Callen Dunkley 78 176
Jeanette Sloan 66 169
Henri Conroy 90 185
Martin Butt 115 195
Kim Howarth 95 174
Notice that in slicing with labels using the .loc
method, both the start and the stop labels are included. However, in slicing using .iloc
method the stop index is excluded.
Filtering a DataFrame in Pandas
So far, we’ve learned how to slice and dice a DataFrame, but how can we take data that meets specific criteria?
Using Boolean masks is the common method for filtering a DataFrame in Pandas. First let’s see what is a Boolean mask:
print(data.HireDate > 2015)
EmployeeName
Callen Dunkley False
Sarah Rayner True
Jeanette Sloan False
Kaycee Acosta False
Henri Conroy False
Emma Peralta True
Martin Butt True
Alex Jensen True
Kim Howarth True
Jane Burnett False
Name: HireDate, dtype: bool
By running the code above, we can see which entries in the HireDate
column are greater than 2015. The Boolean values indicate whether the condition is met or not.
Now, we can use the boolean mask to get the subset of the DataFrame containing the
employees who are hired after 2015:
display(data[data.HireDate > 2015])
Department | HireDate | Gender | DoB | Weight | Height | |
---|---|---|---|---|---|---|
EmployeeName | ||||||
Sarah Rayner | Engineering | 2018 | F | 14/04/1981 | 80 | 160 |
Emma Peralta | HR | 2018 | F | 12/11/1992 | 57 | 164 |
Martin Butt | Data Science | 2020 | M | 10/04/1991 | 115 | 195 |
Alex Jensen | Data Science | 2018 | M | 16/07/1995 | 87 | 180 |
Kim Howarth | Accounting | 2020 | M | 08/10/1992 | 95 | 174 |
We also can use the .loc
method to filter the rows along with returning certain columns, as follows:
display(data.loc[data.HireDate > 2015, ['Department', 'HireDate']])
Department | HireDate | |
---|---|---|
EmployeeName | ||
Sarah Rayner | Engineering | 2018 |
Emma Peralta | HR | 2018 |
Martin Butt | Data Science | 2020 |
Alex Jensen | Data Science | 2018 |
Kim Howarth | Accounting | 2020 |
We're able to combine two or more criteria using logical operators. Let's list those male employees working in Data Science team:
display(data.loc[(data.Gender == "M") & (data.Department=='Data Science'),
['Department', 'Gender', 'HireDate']]
)
Department | Gender | HireDate | |
---|---|---|---|
EmployeeName | |||
Martin Butt | Data Science | M | 2020 |
Alex Jensen | Data Science | M | 2018 |
What if we want to get all employees with heights between 170 cm and 190 cm?
display(data.loc[data.Height.between(170,190),
['Department', 'Gender', 'Height']]
)
Department | Gender | Height | |
---|---|---|---|
EmployeeName | |||
Callen Dunkley | Accounting | M | 176 |
Henri Conroy | HR | M | 185 |
Alex Jensen | Data Science | M | 180 |
Kim Howarth | Accounting | M | 174 |
In the between()
method both ends of the range are inclusive.
The other helpful method is isin()
, which creates a Boolean mask for values contained in a list of specified values. Let’s try the method to retrieve the employees who are working in either Human Resources or Accounting departments:
display(data.loc[data.Department.isin(['HR', 'Accounting']),['Department', 'Gender', 'HireDate']])
Department | Gender | HireDate | |
---|---|---|---|
EmployeeName | |||
Callen Dunkley | Accounting | M | 2010 |
Kaycee Acosta | HR | F | 2014 |
Henri Conroy | HR | M | 2014 |
Emma Peralta | HR | F | 2018 |
Kim Howarth | Accounting | M | 2020 |
To filter a DataFrame, we can use regular expressions to search for a pattern rather than the exact content. The following example shows how we can use a regular expression to get those employees whose last names end with either ‘tt’ or ‘th’:
display(data.loc[data.index.str.contains(r'tt$|th$'),
['Department', 'Gender']])
Department | Gender | |
---|---|---|
EmployeeName | ||
Martin Butt | Data Science | M |
Kim Howarth | Accounting | M |
Jane Burnett | Data Science | F |
Conclusion
In this tutorial, we discussed different ways of slicing and dicing a DataFrame and filtering rows with specific criteria. we addressed how to take subsets of data through selecting columns, slicing rows, and filtering based on conditions, methods or regular expressions.