Tutorial: Indexing DataFrames in Pandas
In this tutorial, we are going to discuss what indexing pandas dataframes means, why we need it, what kinds of dataframe indexing exist, and what syntax should be used for selecting different subsets.
What is Indexing Dataframes in Pandas?
Indexing a pandas dataframe means selecting particular subsets of data (such as rows, columns, individual cells) from that dataframe. Pandas dataframes have an intrinsic tabular structure represented by rows and columns where each row and column has a unique label (name) and position number (like a coordinate) inside the dataframe, and each data point is characterized by its location at the intersection of a specific row and column.
The row labels, called dataframe index, can be integer numbers or string values, the column labels, called column names, are usually strings. Since both the dataframe index and column names contain only unique values, we can use these labels to refer to particular rows, columns, or data points of the dataframe. On the other hand, we can describe each row, column, or data point by its position in the structure of the dataframe. The position numbers are integers starting from 0 for the first row or column and increasing by 1 for each subsequent row/column, so they can also be used as unique coordinates of particular dataframe elements (rows, columns, or data points). This ability to refer dataframe elements by label or position number is exactly what makes dataframe indexing possible.
A specific (and virtually, the most common) case of pandas dataframe indexing is slicing. It is used to access a sequence of dataframe elements rather than individual dataframe elements.
Pandas dataframe indexing can be performed for various tasks: pulling a subset of data based on predefined criteria, reorganizing data, getting a sample of data, data manipulation, modifying values of data points, etc.
To select a subset from a dataframe, we use the indexing operator []
, attribute operator .
, and an appropriate method of pandas dataframe indexing such as loc
, iloc
, at
, iat
, and some others.
Essentially, there are two main ways of indexing pandas dataframes: label-based and position-based (aka location-based or integer-based). Also, it is possible to apply boolean dataframe indexing based on predefined conditions, or even mix different types of dataframe indexing. Let's consider all these approaches in detail.
For our further experiments, we will create a fake dataframe:
import pandas as pd
df = pd.DataFrame({'col_1': list(range(1, 11)), 'col_2': list(range(11, 21)), 'col_3': list(range(21, 31)),
'col_4': list(range(31, 41)), 'col_5': list(range(41, 51)), 'col_6': list(range(51, 61)),
'col_7': list(range(61, 71)), 'col_8': list(range(71, 81)), 'col_9': list(range(81, 91))})
df
col_1 | col_2 | col_3 | col_4 | col_5 | col_6 | col_7 | col_8 | col_9 | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 11 | 21 | 31 | 41 | 51 | 61 | 71 | 81 |
1 | 2 | 12 | 22 | 32 | 42 | 52 | 62 | 72 | 82 |
2 | 3 | 13 | 23 | 33 | 43 | 53 | 63 | 73 | 83 |
3 | 4 | 14 | 24 | 34 | 44 | 54 | 64 | 74 | 84 |
4 | 5 | 15 | 25 | 35 | 45 | 55 | 65 | 75 | 85 |
5 | 6 | 16 | 26 | 36 | 46 | 56 | 66 | 76 | 86 |
6 | 7 | 17 | 27 | 37 | 47 | 57 | 67 | 77 | 87 |
7 | 8 | 18 | 28 | 38 | 48 | 58 | 68 | 78 | 88 |
8 | 9 | 19 | 29 | 39 | 49 | 59 | 69 | 79 | 89 |
9 | 10 | 20 | 30 | 40 | 50 | 60 | 70 | 80 | 90 |
Label-based Dataframe Indexing
As its name suggests, this approach implies selecting dataframe subsets based on the row and column labels. Let's explore four methods of label-based dataframe indexing: using the indexing operator []
, attribute operator .
, loc
indexer, and at
indexer.
Using the Indexing Operator
If we need to select all data from one or multiple columns of a pandas dataframe, we can simply use the indexing operator []
. To select all data from a single column, we pass the name of this column:
df['col_2']
0 11
1 12
2 13
3 14
4 15
5 16
6 17
7 18
8 19
9 20
Name: col_2, dtype: int64
The resulting object is a pandas Series. If instead, we want a one-column dataframe as an output, we need to include the second pair of square brackets [[]]
:
print(df[['col_2']])
type(df[['col_2']])
col_2
0 11
1 12
2 13
3 14
4 15
5 16
6 17
7 18
8 19
9 20
pandas.core.frame.DataFrame
It is also possible to select multiple columns from a pandas dataframe passing a list of column names to the indexing operator. The result of this operation will be always a pandas dataframe:
df[['col_5', 'col_1', 'col_8']]
col_5 | col_1 | col_8 | |
---|---|---|---|
0 | 41 | 1 | 71 |
1 | 42 | 2 | 72 |
2 | 43 | 3 | 73 |
3 | 44 | 4 | 74 |
4 | 45 | 5 | 75 |
5 | 46 | 6 | 76 |
6 | 47 | 7 | 77 |
7 | 48 | 8 | 78 |
8 | 49 | 9 | 79 |
9 | 50 | 10 | 80 |
As we see, the columns appear in the output dataframe in the same order as in the list. This can be helpful when we want to reorganize the original data.
If at least one column name from the provided list is absent in the dataframe, a KeyError
will be thrown:
df[['col_5', 'col_1', 'col_8', 'col_100']]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_2100/3615225278.py in
----> 1 df[['col_5', 'col_1', 'col_8', 'col_100']]
~\anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
3462 if is_iterator(key):
3463 key = list(key)
-> 3464 indexer = self.loc._get_listlike_indexer(key, axis=1)[1]
3465
3466 # take() does not accept boolean indexers
~\anaconda3\lib\site-packages\pandas\core\indexing.py in _get_listlike_indexer(self, key, axis)
1312 keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
1313
-> 1314 self._validate_read_indexer(keyarr, indexer, axis)
1315
1316 if needs_i8_conversion(ax.dtype) or isinstance(
~\anaconda3\lib\site-packages\pandas\core\indexing.py in _validate_read_indexer(self, key, indexer, axis)
1375
1376 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 1377 raise KeyError(f"{not_found} not in index")
1378
1379
KeyError: "['col_100'] not in index"
Using the Attribute Operator
To select only one column of a dataframe, we can access it directly by its name as an attribute:
df.col_3
0 21
1 22
2 23
3 24
4 25
5 26
6 27
7 28
8 29
9 30
Name: col_3, dtype: int64
The piece of code above is equivalent to df['col_3']
:
df['col_3']
0 21
1 22
2 23
3 24
4 25
5 26
6 27
7 28
8 29
9 30
Name: col_3, dtype: int64
However, the approach of accessing a column as an attribute has quite a few drawbacks. It cannot be applied in the following cases:
- if the column name contains white spaces or punctuation marks (except for the underscore
_
), - if the column name coincides with a pandas method name (e.g., "max", "first", "sum"),
- if the column name is not of string type (although using such column names is not a good practice in general),
- for selecting multiple columns,
- for creating a new column (trying to use the attribute access, in this case, will just create a new attribute instead of a new column).
Let's introduce temporarily some chaos in the column names of our dataframe and see what happens if we try to use the attribute access:
df.columns = ['col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col 6', 'col-7', 8, 'last']
df.columns
Index(['col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col 6', 'col-7', 8,
'last'],
dtype='object')
Above, we changed the names of the last four columns. Now, let's see how using the attribute operator works on those columns:
# The column name contains a white space
df.col 6
File "C:\Users\Utente\AppData\Local\Temp/ipykernel_2100/3654995016.py", line 2
df.col 6
^
SyntaxError: invalid syntax
# The column name contains a punctuation mark (except for the underscore)
df.col-7
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_2100/1932640420.py in
1 # The column name contains a punctuation mark
----> 2 df.col-7
~\anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
5485 ):
5486 return self[name]
-> 5487 return object.__getattribute__(self, name)
5488
5489 def __setattr__(self, name: str, value) -> None:
AttributeError: 'DataFrame' object has no attribute 'col'
# The column name coincides with a pandas method name
df.last
<bound method NDFrame.last of col_1 col_2 col_3 col_4 col_5 col 6 col-7 8 last
0 1 11 21 31 41 51 61 71 81
1 2 12 22 32 42 52 62 72 82
2 3 13 23 33 43 53 63 73 83
3 4 14 24 34 44 54 64 74 84
4 5 15 25 35 45 55 65 75 85
5 6 16 26 36 46 56 66 76 86
6 7 17 27 37 47 57 67 77 87
7 8 18 28 38 48 58 68 78 88
8 9 19 29 39 49 59 69 79 89
9 10 20 30 40 50 60 70 80 90>
# The column name is not a string
df.8
File "C:\Users\Utente\AppData\Local\Temp/ipykernel_2100/2774159673.py", line 2
df.8
^
SyntaxError: invalid syntax
# An attempt to create a new column using the attribute access
df.new = 0
print(df.new) # an attribute was created
print(df['new']) # a column was not created
0
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
3360 try:
-> 3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
~\anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
~\anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'new'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_2100/2886494677.py in
2 df.new = 0
3 print(df.new) # an attribute was created
----> 4 print(df['new']) # a column was not created
~\anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
3456 if self.columns.nlevels > 1:
3457 return self._getitem_multilevel(key)
-> 3458 indexer = self.columns.get_loc(key)
3459 if is_integer(indexer):
3460 indexer = [indexer]
~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
-> 3363 raise KeyError(key) from err
3364
3365 if is_scalar(key) and isna(key) and not self.hasnans:
KeyError: 'new'
Note that in all the cases above, the syntax df[column_name]
would work perfectly. In addition, using the same coding style throughout the project, including the way of dataframe indexing, improves the overall code readability, hence it makes sense to be consistent and stick to a more universal style (in our case – using square brackets []
).
Let's restore the original column names and move on to the next approach of label-based dataframe indexing:
df.columns = ['col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6', 'col_7', 'col_8', 'col_9']
df
col_1 | col_2 | col_3 | col_4 | col_5 | col_6 | col_7 | col_8 | col_9 | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 11 | 21 | 31 | 41 | 51 | 61 | 71 | 81 |
1 | 2 | 12 | 22 | 32 | 42 | 52 | 62 | 72 | 82 |
2 | 3 | 13 | 23 | 33 | 43 | 53 | 63 | 73 | 83 |
3 | 4 | 14 | 24 | 34 | 44 | 54 | 64 | 74 | 84 |
4 | 5 | 15 | 25 | 35 | 45 | 55 | 65 | 75 | 85 |
5 | 6 | 16 | 26 | 36 | 46 | 56 | 66 | 76 | 86 |
6 | 7 | 17 | 27 | 37 | 47 | 57 | 67 | 77 | 87 |
7 | 8 | 18 | 28 | 38 | 48 | 58 | 68 | 78 | 88 |
8 | 9 | 19 | 29 | 39 | 49 | 59 | 69 | 79 | 89 |
9 | 10 | 20 | 30 | 40 | 50 | 60 | 70 | 80 | 90 |
Using the loc
Indexer
If we need to select not only columns but also rows (or only rows) from a dataframe, we can use the loc
method, aka loc
indexer. This method implies using the indexing operator []
as well. This is the most common way of accessing dataframe rows and columns by label.
Before proceeding, let's take a look at what the current labels of our dataframe are. For this purpose, we will use the attributes columns
and index
:
print(df.columns)
print(df.index)
Index(['col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6', 'col_7', 'col_8',
'col_9'],
dtype='object')
RangeIndex(start=0, stop=10, step=1)
Note that the labels of the dataframe rows are represented by a specific type of objects, RangeIndex
, which, in our case, consists of ordered integers from 0 to 9. These integers are valid row labels, and they can be used when applying the loc
indexer. For example, to extract the row with the label 0, which is also the first row of our dataframe, we can use the following syntax:
df.loc[0]
col_1 1
col_2 11
col_3 21
col_4 31
col_5 41
col_6 51
col_7 61
col_8 71
col_9 81
Name: 0, dtype: int64
In general, the syntax df.loc[row_label]
is used to pull a specific row from a dataframe as a pandas Series object.
However, for our further experiments with the loc
indexer, let's rename the row labels into something more meaningful and of a string data type:
df.index = ['row_1', 'row_2', 'row_3', 'row_4', 'row_5', 'row_6', 'row_7', 'row_8', 'row_9', 'row_10']
df
col_1 | col_2 | col_3 | col_4 | col_5 | col_6 | col_7 | col_8 | col_9 | |
---|---|---|---|---|---|---|---|---|---|
row_1 | 1 | 11 | 21 | 31 | 41 | 51 | 61 | 71 | 81 |
row_2 | 2 | 12 | 22 | 32 | 42 | 52 | 62 | 72 | 82 |
row_3 | 3 | 13 | 23 | 33 | 43 | 53 | 63 | 73 | 83 |
row_4 | 4 | 14 | 24 | 34 | 44 | 54 | 64 | 74 | 84 |
row_5 | 5 | 15 | 25 | 35 | 45 | 55 | 65 | 75 | 85 |
row_6 | 6 | 16 | 26 | 36 | 46 | 56 | 66 | 76 | 86 |
row_7 | 7 | 17 | 27 | 37 | 47 | 57 | 67 | 77 | 87 |
row_8 | 8 | 18 | 28 | 38 | 48 | 58 | 68 | 78 | 88 |
row_9 | 9 | 19 | 29 | 39 | 49 | 59 | 69 | 79 | 89 |
row_10 | 10 | 20 | 30 | 40 | 50 | 60 | 70 | 80 | 90 |
Let's extract again the values of the first row of the dataframe using the loc
indexer and the new row label:
df.loc['row_1']
col_1 1
col_2 11
col_3 21
col_4 31
col_5 41
col_6 51
col_7 61
col_8 71
col_9 81
Name: row_1, dtype: int64
If we want to access multiple disparate rows, not necessarily in the same order as in the original dataframe, we have to pass a list of the row labels:
df.loc[['row_6', 'row_2', 'row_9']]
col_1 | col_2 | col_3 | col_4 | col_5 | col_6 | col_7 | col_8 | col_9 | |
---|---|---|---|---|---|---|---|---|---|
row_6 | 6 | 16 | 26 | 36 | 46 | 56 | 66 | 76 | 86 |
row_2 | 2 | 12 | 22 | 32 | 42 | 52 | 62 | 72 | 82 |
row_9 | 9 | 19 | 29 | 39 | 49 | 59 | 69 | 79 | 89 |
The rows in the resulting dataframe appear in the same order as in the list.
If at least one label from the provided list is absent in the dataframe, a KeyError
will be thrown:
df.loc[['row_6', 'row_2', 'row_100']]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_2100/3587512526.py in
----> 1 df.loc[['row_6', 'row_2', 'row_100']]
~\anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
929
930 maybe_callable = com.apply_if_callable(key, self.obj)
--> 931 return self._getitem_axis(maybe_callable, axis=axis)
932
933 def _is_scalar_access(self, key: tuple):
~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
1151 raise ValueError("Cannot index with multidimensional key")
1152
-> 1153 return self._getitem_iterable(key, axis=axis)
1154
1155 # nested tuple slicing
~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_iterable(self, key, axis)
1091
1092 # A collection of keys
-> 1093 keyarr, indexer = self._get_listlike_indexer(key, axis)
1094 return self.obj._reindex_with_indexers(
1095 {axis: [keyarr, indexer]}, copy=True, allow_dups=True
~\anaconda3\lib\site-packages\pandas\core\indexing.py in _get_listlike_indexer(self, key, axis)
1312 keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
1313
-> 1314 self._validate_read_indexer(keyarr, indexer, axis)
1315
1316 if needs_i8_conversion(ax.dtype) or isinstance(
~\anaconda3\lib\site-packages\pandas\core\indexing.py in _validate_read_indexer(self, key, indexer, axis)
1375
1376 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 1377 raise KeyError(f"{not_found} not in index")
1378
1379
KeyError: "['row_100'] not in index"
Instead of selecting disparate rows, we may need to select multiple sequential rows from the original dataframe. In this case, we can apply slicing, i.e., specify the start and end row labels separated by a colon:
df.loc['row_7':'row_9']
col_1 | col_2 | col_3 | col_4 | col_5 | col_6 | col_7 | col_8 | col_9 | |
---|---|---|---|---|---|---|---|---|---|
row_7 | 7 | 17 | 27 | 37 | 47 | 57 | 67 | 77 | 87 |
row_8 | 8 | 18 | 28 | 38 | 48 | 58 | 68 | 78 | 88 |
row_9 | 9 | 19 | 29 | 39 | 49 | 59 | 69 | 79 | 89 |
Pay attention that with the loc
indexer, both the start and stop bounds are inclusive, which is not a common slicing style for Python, where usually the stop bound is exclusive.
It is possible to leave one of the slicing ends open, if we need all the rows up to and including a specific row (e.g., df.loc[:'row_4']
), or starting from a specific row up to the end (e.g., df.loc['row_4':]
):
# Selecting all the rows up to and including 'row_4'
df.loc[:'row_4']
col_1 | col_2 | col_3 | col_4 | col_5 | col_6 | col_7 | col_8 | col_9 | |
---|---|---|---|---|---|---|---|---|---|
row_1 | 1 | 11 | 21 | 31 | 41 | 51 | 61 | 71 | 81 |
row_2 | 2 | 12 | 22 | 32 | 42 | 52 | 62 | 72 | 82 |
row_3 | 3 | 13 | 23 | 33 | 43 | 53 | 63 | 73 | 83 |
row_4 | 4 | 14 | 24 | 34 | 44 | 54 | 64 | 74 | 84 |
In order to select multiple rows and columns, practically meaning a subset of data points of a dataframe at the intersection of those rows and columns, we pass to the loc
indexer two arguments separated by a comma: the necessary row labels and column labels. Both for row and column labels, it can be a list of labels, a slice of labels (or an open-ending slice), or a single label as a string. Some examples:
df.loc[['row_4', 'row_2'], ['col_5', 'col_2', 'col_9']]
col_5 | col_2 | col_9 | |
---|---|---|---|
row_4 | 44 | 14 | 84 |
row_2 | 42 | 12 | 82 |
df.loc[['row_4', 'row_2'], 'col_5':'col_7']
col_5 | col_6 | col_7 | |
---|---|---|---|
row_4 | 44 | 54 | 64 |
row_2 | 42 | 52 | 62 |
df.loc['row_4':'row_6', 'col_5':]
col_5 | col_6 | col_7 | col_8 | col_9 | |
---|---|---|---|---|---|
row_4 | 44 | 54 | 64 | 74 | 84 |
row_5 | 45 | 55 | 65 | 75 | 85 |
row_6 | 46 | 56 | 66 | 76 | 86 |
df.loc[:'row_4', 'col_5']
row_1 41
row_2 42
row_3 43
row_4 44
Name: col_5, dtype: int64
A particular case is when we need to explicitly get one value from a dataframe. For this purpose, we pass the necessary row and column labels as two arguments separated by a comma:
df.loc['row_6', 'col_3']
26
Using the at
Indexer
For the last case from the previous section, i.e., for selecting only one value from a dataframe, there is a faster method – using the at
indexer. The syntax is identical to the one of the loc
indexer, only that here we always use exactly two labels (for the row and column) separated by a comma:
df.at['row_6', 'col_3']
26
Position-based Dataframe Indexing
Using this approach, aka location-based or integer-based, each dataframe element (row, column, or data point) is referred to by its position number rather than the label. The position numbers are integers starting from 0 for the first row or column (typical Python 0-based indexing) and increasing by 1 for each subsequent row/column.
The critical difference between label-based and position-based dataframe indexing approaches is in the way of dataframe slicing: for the position-based indexing, it is purely Python-style, i.e, the start bound of the range is inclusive while the stop bound is exclusive. For the label-based indexing, the stop bound is inclusive.
Using the Indexing Operator
To retrieve all data from multiple sequential rows of a pandas dataframe, we can simply use the indexing operator []
and a range of the necessary row positions (it can be an open-ending range):
df[3:6]
col_1 | col_2 | col_3 | col_4 | col_5 | col_6 | col_7 | col_8 | col_9 | |
---|---|---|---|---|---|---|---|---|---|
row_4 | 4 | 14 | 24 | 34 | 44 | 54 | 64 | 74 | 84 |
row_5 | 5 | 15 | 25 | 35 | 45 | 55 | 65 | 75 | 85 |
row_6 | 6 | 16 | 26 | 36 | 46 | 56 | 66 | 76 | 86 |
df[:3]
col_1 | col_2 | col_3 | col_4 | col_5 | col_6 | col_7 | col_8 | col_9 | |
---|---|---|---|---|---|---|---|---|---|
row_1 | 1 | 11 | 21 | 31 | 41 | 51 | 61 | 71 | 81 |
row_2 | 2 | 12 | 22 | 32 | 42 | 52 | 62 | 72 | 82 |
row_3 | 3 | 13 | 23 | 33 | 43 | 53 | 63 | 73 | 83 |
Note that the indexing operator doesn't work for selecting a single row.
Using the iloc
Indexer
This is the most common way of selecting dataframe rows and columns by their position numbers. The syntax of this method is very similar to that of the loc
indexer and also implies using the indexing operator []
.
To access a single row or multiple rows, we pass one argument to the iloc
indexer representing the corresponding row position, a list of row positions (if the rows are disparate), or a range of row positions (if the rows are sequential):
# Selecting one row
df.iloc[3]
col_1 4
col_2 14
col_3 24
col_4 34
col_5 44
col_6 54
col_7 64
col_8 74
col_9 84
Name: row_4, dtype: int64
# Selecting disparate rows in the necessary order
df.iloc[[9, 8, 7]]
col_1 | col_2 | col_3 | col_4 | col_5 | col_6 | col_7 | col_8 | col_9 | |
---|---|---|---|---|---|---|---|---|---|
row_10 | 10 | 20 | 30 | 40 | 50 | 60 | 70 | 80 | 90 |
row_9 | 9 | 19 | 29 | 39 | 49 | 59 | 69 | 79 | 89 |
row_8 | 8 | 18 | 28 | 38 | 48 | 58 | 68 | 78 | 88 |
# Selecting a slice of sequential rows
df.iloc[3:6]
col_1 | col_2 | col_3 | col_4 | col_5 | col_6 | col_7 | col_8 | col_9 | |
---|---|---|---|---|---|---|---|---|---|
row_4 | 4 | 14 | 24 | 34 | 44 | 54 | 64 | 74 | 84 |
row_5 | 5 | 15 | 25 | 35 | 45 | 55 | 65 | 75 | 85 |
row_6 | 6 | 16 | 26 | 36 | 46 | 56 | 66 | 76 | 86 |
# Selecting a slice of sequential rows (an open-ending range)
df.iloc[:3]
col_1 | col_2 | col_3 | col_4 | col_5 | col_6 | col_7 | col_8 | col_9 | |
---|---|---|---|---|---|---|---|---|---|
row_1 | 1 | 11 | 21 | 31 | 41 | 51 | 61 | 71 | 81 |
row_2 | 2 | 12 | 22 | 32 | 42 | 52 | 62 | 72 | 82 |
row_3 | 3 | 13 | 23 | 33 | 43 | 53 | 63 | 73 | 83 |
Pay attention that in the last two pieces of code, the stop bound of the range is not included, as discussed earlier.
In all the other cases, to select a subset of data of any size (one column, multiple columns, multiple rows and columns together, individual data points), we pass two arguments to the iloc
indexer: the necessary row position numbers and column position numbers. For both arguments, the potential types of values can be:
- an integer (to select a single row/column),
- a list of integers (multiple disparate rows/columns),
- a range of integers (multiple sequential rows/columns),
- an open-ending range of integers (multiple sequential rows/columns up to and excluding a specific position number (e.g.,
df.iloc[:4, 1]
), or starting from a specific number up to the end (e.g.,df.loc[4:, 1]
), - a colon (to select all the rows/columns).
Some examples:
# Selecting an individual data point
df.iloc[0, 0]
1
# Selecting one row and multiple disparate columns
df.iloc[0, [2, 8, 3]]
col_3 21
col_9 81
col_4 31
Name: row_1, dtype: int64
# Selecting multiple disparate rows and multiple sequential columns
df.iloc[[8, 1, 9], 5:9]
col_6 | col_7 | col_8 | col_9 | |
---|---|---|---|---|
row_9 | 59 | 69 | 79 | 89 |
row_2 | 52 | 62 | 72 | 82 |
row_10 | 60 | 70 | 80 | 90 |
# Selecting multiple sequential rows and multiple sequential columns (with open-ending ranges)
df.iloc[:3, 6:]
col_7 | col_8 | col_9 | |
---|---|---|---|
row_1 | 61 | 71 | 81 |
row_2 | 62 | 72 | 82 |
row_3 | 63 | 73 | 83 |
# Selecting all rows and multiple disparate columns
df.iloc[:, [1, 3, 7]]
col_2 | col_4 | col_8 | |
---|---|---|---|
row_1 | 11 | 31 | 71 |
row_2 | 12 | 32 | 72 |
row_3 | 13 | 33 | 73 |
row_4 | 14 | 34 | 74 |
row_5 | 15 | 35 | 75 |
row_6 | 16 | 36 | 76 |
row_7 | 17 | 37 | 77 |
row_8 | 18 | 38 | 78 |
row_9 | 19 | 39 | 79 |
row_10 | 20 | 40 | 80 |
Note that if we pass at least one position number out of bounds to the iloc
indexer (whether it is a single integer for an entire row/column or one of the integers from a list), an IndexError
will be thrown:
df.iloc[[1, 3, 100]]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
~\anaconda3\lib\site-packages\pandas\core\indexing.py in _get_list_axis(self, key, axis)
1529 try:
-> 1530 return self.obj._take_with_is_copy(key, axis=axis)
1531 except IndexError as err:
~\anaconda3\lib\site-packages\pandas\core\generic.py in _take_with_is_copy(self, indices, axis)
3627 """
-> 3628 result = self.take(indices=indices, axis=axis)
3629 # Maybe set copy if we didn't actually change the index.
~\anaconda3\lib\site-packages\pandas\core\generic.py in take(self, indices, axis, is_copy, **kwargs)
3614
-> 3615 new_data = self._mgr.take(
3616 indices, axis=self._get_block_manager_axis(axis), verify=True
~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in take(self, indexer, axis, verify)
861 n = self.shape[axis]
--> 862 indexer = maybe_convert_indices(indexer, n, verify=verify)
863
~\anaconda3\lib\site-packages\pandas\core\indexers.py in maybe_convert_indices(indices, n, verify)
291 if mask.any():
--> 292 raise IndexError("indices are out-of-bounds")
293 return indices
IndexError: indices are out-of-bounds
The above exception was the direct cause of the following exception:
IndexError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_2100/47693281.py in
----> 1 df.iloc[[1, 3, 100]]
~\anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
929
930 maybe_callable = com.apply_if_callable(key, self.obj)
--> 931 return self._getitem_axis(maybe_callable, axis=axis)
932
933 def _is_scalar_access(self, key: tuple):
~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
1555 # a list of integers
1556 elif is_list_like_indexer(key):
-> 1557 return self._get_list_axis(key, axis=axis)
1558
1559 # a single integer
~\anaconda3\lib\site-packages\pandas\core\indexing.py in _get_list_axis(self, key, axis)
1531 except IndexError as err:
1532 # re-raise with different error message
-> 1533 raise IndexError("positional indexers are out-of-bounds") from err
1534
1535 def _getitem_axis(self, key, axis: int):
IndexError: positional indexers are out-of-bounds
However, for a range of integers (a slice of sequential rows or columns), the out-of-bounds position numbers are allowed:
df.iloc[1, 5:100]
col_6 52
col_7 62
col_8 72
col_9 82
Name: row_2, dtype: int64
This works also for the indexing operator discussed in the previous section:
df[7:1000]
col_1 | col_2 | col_3 | col_4 | col_5 | col_6 | col_7 | col_8 | col_9 | |
---|---|---|---|---|---|---|---|---|---|
row_8 | 8 | 18 | 28 | 38 | 48 | 58 | 68 | 78 | 88 |
row_9 | 9 | 19 | 29 | 39 | 49 | 59 | 69 | 79 | 89 |
row_10 | 10 | 20 | 30 | 40 | 50 | 60 | 70 | 80 | 90 |
Using the iat
Indexer
To select only one value from a dataframe, we can use the iat
indexer, which performs faster than iloc
. The syntax is identical to the one of the iloc
indexer, only that here we always use exactly two integers (the row and column position numbers):
df.iat[1, 2]
22
Boolean Dataframe Indexing
Apart from label-based or position-based pandas dataframe indexing, it is possible to select a subset from a dataframe based on a certain condition:
df[df['col_2'] > 15]
col_1 | col_2 | col_3 | col_4 | col_5 | col_6 | col_7 | col_8 | col_9 | |
---|---|---|---|---|---|---|---|---|---|
row_6 | 6 | 16 | 26 | 36 | 46 | 56 | 66 | 76 | 86 |
row_7 | 7 | 17 | 27 | 37 | 47 | 57 | 67 | 77 | 87 |
row_8 | 8 | 18 | 28 | 38 | 48 | 58 | 68 | 78 | 88 |
row_9 | 9 | 19 | 29 | 39 | 49 | 59 | 69 | 79 | 89 |
row_10 | 10 | 20 | 30 | 40 | 50 | 60 | 70 | 80 | 90 |
The piece of code above returns all the rows of our dataframe where the values of the col_2
column are greater than 15. The condition applied (in our case – df['col_2'] > 15
) is a boolean vector that has the same length as the dataframe index and checks if each row of the dataframe satisfies the defined criterion.
We can use also any other comparison operators:
==
equals,!=
not equals,>
greater than,<
less than,>=
greater than or equal to,<=
less than or equal to.
It is possible to define a boolean condition for a string column too (the comparison operators ==
and !=
make sense in such cases).
Also, we can define several criteria on the same column or multiple columns. The operators to be used for this purpose are &
(and), |
(or), ~
(not). Each condition must be put in a separate pair of brackets:
# Selecting all the rows of the dataframe where the value of col_2
is greater than 15 but not equal to 19
df[(df['col_2'] > 15) & (df['col_2'] != 19)]
col_1 | col_2 | col_3 | col_4 | col_5 | col_6 | col_7 | col_8 | col_9 | |
---|---|---|---|---|---|---|---|---|---|
row_6 | 6 | 16 | 26 | 36 | 46 | 56 | 66 | 76 | 86 |
row_7 | 7 | 17 | 27 | 37 | 47 | 57 | 67 | 77 | 87 |
row_8 | 8 | 18 | 28 | 38 | 48 | 58 | 68 | 78 | 88 |
row_10 | 10 | 20 | 30 | 40 | 50 | 60 | 70 | 80 | 90 |
# Selecting all the rows of the dataframe where the value of col_2
is greater than 15
# or the value of col_5
is equal to 42
df[(df['col_2'] > 15) | (df['col_5'] == 42)]
col_1 | col_2 | col_3 | col_4 | col_5 | col_6 | col_7 | col_8 | col_9 | |
---|---|---|---|---|---|---|---|---|---|
row_2 | 2 | 12 | 22 | 32 | 42 | 52 | 62 | 72 | 82 |
row_6 | 6 | 16 | 26 | 36 | 46 | 56 | 66 | 76 | 86 |
row_7 | 7 | 17 | 27 | 37 | 47 | 57 | 67 | 77 | 87 |
row_8 | 8 | 18 | 28 | 38 | 48 | 58 | 68 | 78 | 88 |
row_9 | 9 | 19 | 29 | 39 | 49 | 59 | 69 | 79 | 89 |
row_10 | 10 | 20 | 30 | 40 | 50 | 60 | 70 | 80 | 90 |
# Selecting all the rows of the dataframe where the value of col_2
is NOT greater than 15
df[~(df['col_2'] > 15)]
col_1 | col_2 | col_3 | col_4 | col_5 | col_6 | col_7 | col_8 | col_9 | |
---|---|---|---|---|---|---|---|---|---|
row_1 | 1 | 11 | 21 | 31 | 41 | 51 | 61 | 71 | 81 |
row_2 | 2 | 12 | 22 | 32 | 42 | 52 | 62 | 72 | 82 |
row_3 | 3 | 13 | 23 | 33 | 43 | 53 | 63 | 73 | 83 |
row_4 | 4 | 14 | 24 | 34 | 44 | 54 | 64 | 74 | 84 |
row_5 | 5 | 15 | 25 | 35 | 45 | 55 | 65 | 75 | 85 |
Combination of Dataframe Indexing Approaches
Finally, we can combine label-based, position-based, and boolean dataframe indexing approaches in various ways. For this purpose, we should apply the loc
indexer again and access the positional numbers using the index
attribute for rows and columns
attribute for columns:
df.loc[df.index[[3, 4]], ['col_3', 'col_7']]
col_3 | col_7 | |
---|---|---|
row_4 | 24 | 64 |
row_5 | 25 | 65 |
df.loc['row_3':'row_6', df.columns[[0, 5]]]
col_1 | col_6 | |
---|---|---|
row_3 | 3 | 53 |
row_4 | 4 | 54 |
row_5 | 5 | 55 |
row_6 | 6 | 56 |
df.loc[df['col_4'] > 35, 'col_4':'col_7']
col_4 | col_5 | col_6 | col_7 | |
---|---|---|---|---|
row_6 | 36 | 46 | 56 | 66 |
row_7 | 37 | 47 | 57 | 67 |
row_8 | 38 | 48 | 58 | 68 |
row_9 | 39 | 49 | 59 | 69 |
row_10 | 40 | 50 | 60 | 70 |
Conclusion
To sum up, in this tutorial, we explored indexing dataframes in pandas in-depth. We learned many things:
- What pandas dataframe indexing is
- The purposes of dataframe indexing
- What dataframe slicing is
- What row and column labels are
- How to check the current row and column labels of a dataframe
- What row and column position numbers are
- The main approaches for indexing dataframes in pandas
- The critical difference between label-based and position-based dataframe indexing
- Why using the attribute access to select a dataframe column is not preferable
- How boolean dataframe indexing works
- The most common methods related to each dataframe indexing approach, their syntax for selecting different types of subsets, and their limitations
- How to combine different indexing approaches