February 15, 2022

Tutorial: Indexing DataFrames in Pandas

In this tutorial, we are going to discuss what indexing pandas dataframes means, why we need it, what kinds of dataframe indexing exist, and what syntax should be used for selecting different subsets.

What is Indexing Dataframes in Pandas?

Indexing a pandas dataframe means selecting particular subsets of data (such as rows, columns, individual cells) from that dataframe. Pandas dataframes have an intrinsic tabular structure represented by rows and columns where each row and column has a unique label (name) and position number (like a coordinate) inside the dataframe, and each data point is characterized by its location at the intersection of a specific row and column.

The row labels, called dataframe index, can be integer numbers or string values, the column labels, called column names, are usually strings. Since both the dataframe index and column names contain only unique values, we can use these labels to refer to particular rows, columns, or data points of the dataframe. On the other hand, we can describe each row, column, or data point by its position in the structure of the dataframe. The position numbers are integers starting from 0 for the first row or column and increasing by 1 for each subsequent row/column, so they can also be used as unique coordinates of particular dataframe elements (rows, columns, or data points). This ability to refer dataframe elements by label or position number is exactly what makes dataframe indexing possible.

A specific (and virtually, the most common) case of pandas dataframe indexing is slicing. It is used to access a sequence of dataframe elements rather than individual dataframe elements.

Pandas dataframe indexing can be performed for various tasks: pulling a subset of data based on predefined criteria, reorganizing data, getting a sample of data, data manipulation, modifying values of data points, etc.

To select a subset from a dataframe, we use the indexing operator [], attribute operator ., and an appropriate method of pandas dataframe indexing such as loc, iloc, at, iat, and some others.

Essentially, there are two main ways of indexing pandas dataframes: label-based and position-based (aka location-based or integer-based). Also, it is possible to apply boolean dataframe indexing based on predefined conditions, or even mix different types of dataframe indexing. Let’s consider all these approaches in detail.

For our further experiments, we will create a fake dataframe:

import pandas as pd

df = pd.DataFrame({'col_1': list(range(1, 11)), 'col_2': list(range(11, 21)), 'col_3': list(range(21, 31)),
                   'col_4': list(range(31, 41)), 'col_5': list(range(41, 51)), 'col_6': list(range(51, 61)),
                   'col_7': list(range(61, 71)), 'col_8': list(range(71, 81)), 'col_9': list(range(81, 91))})
df
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
0 1 11 21 31 41 51 61 71 81
1 2 12 22 32 42 52 62 72 82
2 3 13 23 33 43 53 63 73 83
3 4 14 24 34 44 54 64 74 84
4 5 15 25 35 45 55 65 75 85
5 6 16 26 36 46 56 66 76 86
6 7 17 27 37 47 57 67 77 87
7 8 18 28 38 48 58 68 78 88
8 9 19 29 39 49 59 69 79 89
9 10 20 30 40 50 60 70 80 90

Label-based Dataframe Indexing

As its name suggests, this approach implies selecting dataframe subsets based on the row and column labels. Let’s explore four methods of label-based dataframe indexing: using the indexing operator [], attribute operator ., loc indexer, and at indexer.

Using the Indexing Operator

If we need to select all data from one or multiple columns of a pandas dataframe, we can simply use the indexing operator []. To select all data from a single column, we pass the name of this column:

df['col_2']
0    11
1    12
2    13
3    14
4    15
5    16
6    17
7    18
8    19
9    20
Name: col_2, dtype: int64

The resulting object is a pandas Series. If instead, we want a one-column dataframe as an output, we need to include the second pair of square brackets [[]]:

print(df[['col_2']])
type(df[['col_2']])
   col_2
0     11
1     12
2     13
3     14
4     15
5     16
6     17
7     18
8     19
9     20

pandas.core.frame.DataFrame

It is also possible to select multiple columns from a pandas dataframe passing a list of column names to the indexing operator. The result of this operation will be always a pandas dataframe:

df[['col_5', 'col_1', 'col_8']]
col_5 col_1 col_8
0 41 1 71
1 42 2 72
2 43 3 73
3 44 4 74
4 45 5 75
5 46 6 76
6 47 7 77
7 48 8 78
8 49 9 79
9 50 10 80

As we see, the columns appear in the output dataframe in the same order as in the list. This can be helpful when we want to reorganize the original data.

If at least one column name from the provided list is absent in the dataframe, a KeyError will be thrown:

df[['col_5', 'col_1', 'col_8', 'col_100']]
---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

~\AppData\Local\Temp/ipykernel_2100/3615225278.py in 
----> 1 df[['col_5', 'col_1', 'col_8', 'col_100']]

~\anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   3462             if is_iterator(key):
   3463                 key = list(key)
-> 3464             indexer = self.loc._get_listlike_indexer(key, axis=1)[1]
   3465 
   3466         # take() does not accept boolean indexers

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _get_listlike_indexer(self, key, axis)
   1312             keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
   1313 
-> 1314         self._validate_read_indexer(keyarr, indexer, axis)
   1315 
   1316         if needs_i8_conversion(ax.dtype) or isinstance(

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _validate_read_indexer(self, key, indexer, axis)
   1375 
   1376             not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 1377             raise KeyError(f"{not_found} not in index")
   1378 
   1379 

KeyError: "['col_100'] not in index"

Using the Attribute Operator

To select only one column of a dataframe, we can access it directly by its name as an attribute:

df.col_3
0    21
1    22
2    23
3    24
4    25
5    26
6    27
7    28
8    29
9    30
Name: col_3, dtype: int64

The piece of code above is equivalent to df['col_3']:

df['col_3']
0    21
1    22
2    23
3    24
4    25
5    26
6    27
7    28
8    29
9    30
Name: col_3, dtype: int64

However, the approach of accessing a column as an attribute has quite a few drawbacks. It cannot be applied in the following cases:

  • if the column name contains white spaces or punctuation marks (except for the underscore _),
  • if the column name coincides with a pandas method name (e.g., "max", "first", "sum"),
  • if the column name is not of string type (although using such column names is not a good practice in general),
  • for selecting multiple columns,
  • for creating a new column (trying to use the attribute access, in this case, will just create a new attribute instead of a new column).

Let’s introduce temporarily some chaos in the column names of our dataframe and see what happens if we try to use the attribute access:

df.columns = ['col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col 6', 'col-7', 8, 'last']
df.columns
Index(['col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col 6', 'col-7', 8,
       'last'],
      dtype='object')

Above, we changed the names of the last four columns. Now, let’s see how using the attribute operator works on those columns:

# The column name contains a white space
df.col 6
  File "C:\Users\Utente\AppData\Local\Temp/ipykernel_2100/3654995016.py", line 2
    df.col 6
           ^
SyntaxError: invalid syntax
# The column name contains a punctuation mark (except for the underscore)
df.col-7
---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

~\AppData\Local\Temp/ipykernel_2100/1932640420.py in 
      1 # The column name contains a punctuation mark
----> 2 df.col-7

~\anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5485         ):
   5486             return self[name]
-> 5487         return object.__getattribute__(self, name)
   5488 
   5489     def __setattr__(self, name: str, value) -> None:

AttributeError: 'DataFrame' object has no attribute 'col'
# The column name coincides with a pandas method name
df.last
    <bound method NDFrame.last of    col_1  col_2  col_3  col_4  col_5  col 6  col-7   8  last
    0      1     11     21     31     41     51     61  71    81
    1      2     12     22     32     42     52     62  72    82
    2      3     13     23     33     43     53     63  73    83
    3      4     14     24     34     44     54     64  74    84
    4      5     15     25     35     45     55     65  75    85
    5      6     16     26     36     46     56     66  76    86
    6      7     17     27     37     47     57     67  77    87
    7      8     18     28     38     48     58     68  78    88
    8      9     19     29     39     49     59     69  79    89
    9     10     20     30     40     50     60     70  80    90>
# The column name is not a string 
df.8
  File "C:\Users\Utente\AppData\Local\Temp/ipykernel_2100/2774159673.py", line 2
    df.8
      ^
SyntaxError: invalid syntax
# An attempt to create a new column using the attribute access
df.new = 0
print(df.new)     # an attribute was created
print(df['new'])  # a column was not created
0

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3360             try:
-> 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:

~\anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

~\anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'new'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)

~\AppData\Local\Temp/ipykernel_2100/2886494677.py in 
      2 df.new = 0
      3 print(df.new)     # an attribute was created
----> 4 print(df['new'])  # a column was not created

~\anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   3456             if self.columns.nlevels > 1:
   3457                 return self._getitem_multilevel(key)
-> 3458             indexer = self.columns.get_loc(key)
   3459             if is_integer(indexer):
   3460                 indexer = [indexer]

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: 'new'

Note that in all the cases above, the syntax df[column_name] would work perfectly. In addition, using the same coding style throughout the project, including the way of dataframe indexing, improves the overall code readability, hence it makes sense to be consistent and stick to a more universal style (in our case – using square brackets []).

Let’s restore the original column names and move on to the next approach of label-based dataframe indexing:

df.columns = ['col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6', 'col_7', 'col_8', 'col_9']
df
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
0 1 11 21 31 41 51 61 71 81
1 2 12 22 32 42 52 62 72 82
2 3 13 23 33 43 53 63 73 83
3 4 14 24 34 44 54 64 74 84
4 5 15 25 35 45 55 65 75 85
5 6 16 26 36 46 56 66 76 86
6 7 17 27 37 47 57 67 77 87
7 8 18 28 38 48 58 68 78 88
8 9 19 29 39 49 59 69 79 89
9 10 20 30 40 50 60 70 80 90

Using the loc Indexer

If we need to select not only columns but also rows (or only rows) from a dataframe, we can use the loc method, aka loc indexer. This method implies using the indexing operator [] as well. This is the most common way of accessing dataframe rows and columns by label.

Before proceeding, let’s take a look at what the current labels of our dataframe are. For this purpose, we will use the attributes columns and index:

print(df.columns)
print(df.index)
Index(['col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6', 'col_7', 'col_8',
       'col_9'],
      dtype='object')
RangeIndex(start=0, stop=10, step=1)

Note that the labels of the dataframe rows are represented by a specific type of objects, RangeIndex, which, in our case, consists of ordered integers from 0 to 9. These integers are valid row labels, and they can be used when applying the loc indexer. For example, to extract the row with the label 0, which is also the first row of our dataframe, we can use the following syntax:

df.loc[0]
col_1     1
col_2    11
col_3    21
col_4    31
col_5    41
col_6    51
col_7    61
col_8    71
col_9    81
Name: 0, dtype: int64

In general, the syntax df.loc[row_label] is used to pull a specific row from a dataframe as a pandas Series object.

However, for our further experiments with the loc indexer, let’s rename the row labels into something more meaningful and of a string data type:

df.index = ['row_1', 'row_2', 'row_3', 'row_4', 'row_5', 'row_6', 'row_7', 'row_8', 'row_9', 'row_10']
df
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
row_1 1 11 21 31 41 51 61 71 81
row_2 2 12 22 32 42 52 62 72 82
row_3 3 13 23 33 43 53 63 73 83
row_4 4 14 24 34 44 54 64 74 84
row_5 5 15 25 35 45 55 65 75 85
row_6 6 16 26 36 46 56 66 76 86
row_7 7 17 27 37 47 57 67 77 87
row_8 8 18 28 38 48 58 68 78 88
row_9 9 19 29 39 49 59 69 79 89
row_10 10 20 30 40 50 60 70 80 90

Let’s extract again the values of the first row of the dataframe using the loc indexer and the new row label:

df.loc['row_1']
col_1     1
col_2    11
col_3    21
col_4    31
col_5    41
col_6    51
col_7    61
col_8    71
col_9    81
Name: row_1, dtype: int64

If we want to access multiple disparate rows, not necessarily in the same order as in the original dataframe, we have to pass a list of the row labels:

df.loc[['row_6', 'row_2', 'row_9']]
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
row_6 6 16 26 36 46 56 66 76 86
row_2 2 12 22 32 42 52 62 72 82
row_9 9 19 29 39 49 59 69 79 89

The rows in the resulting dataframe appear in the same order as in the list.

If at least one label from the provided list is absent in the dataframe, a KeyError will be thrown:

df.loc[['row_6', 'row_2', 'row_100']]
---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

~\AppData\Local\Temp/ipykernel_2100/3587512526.py in 
----> 1 df.loc[['row_6', 'row_2', 'row_100']]

~\anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
    929 
    930             maybe_callable = com.apply_if_callable(key, self.obj)
--> 931             return self._getitem_axis(maybe_callable, axis=axis)
    932 
    933     def _is_scalar_access(self, key: tuple):

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
   1151                     raise ValueError("Cannot index with multidimensional key")
   1152 
-> 1153                 return self._getitem_iterable(key, axis=axis)
   1154 
   1155             # nested tuple slicing

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_iterable(self, key, axis)
   1091 
   1092         # A collection of keys
-> 1093         keyarr, indexer = self._get_listlike_indexer(key, axis)
   1094         return self.obj._reindex_with_indexers(
   1095             {axis: [keyarr, indexer]}, copy=True, allow_dups=True

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _get_listlike_indexer(self, key, axis)
   1312             keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
   1313 
-> 1314         self._validate_read_indexer(keyarr, indexer, axis)
   1315 
   1316         if needs_i8_conversion(ax.dtype) or isinstance(

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _validate_read_indexer(self, key, indexer, axis)
   1375 
   1376             not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 1377             raise KeyError(f"{not_found} not in index")
   1378 
   1379 

KeyError: "['row_100'] not in index"

Instead of selecting disparate rows, we may need to select multiple sequential rows from the original dataframe. In this case, we can apply slicing, i.e., specify the start and end row labels separated by a colon:

df.loc['row_7':'row_9']
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
row_7 7 17 27 37 47 57 67 77 87
row_8 8 18 28 38 48 58 68 78 88
row_9 9 19 29 39 49 59 69 79 89

Pay attention that with the loc indexer, both the start and stop bounds are inclusive, which is not a common slicing style for Python, where usually the stop bound is exclusive.

It is possible to leave one of the slicing ends open, if we need all the rows up to and including a specific row (e.g., df.loc[:'row_4']), or starting from a specific row up to the end (e.g., df.loc['row_4':]):

# Selecting all the rows up to and including 'row_4'
df.loc[:'row_4']
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
row_1 1 11 21 31 41 51 61 71 81
row_2 2 12 22 32 42 52 62 72 82
row_3 3 13 23 33 43 53 63 73 83
row_4 4 14 24 34 44 54 64 74 84

In order to select multiple rows and columns, practically meaning a subset of data points of a dataframe at the intersection of those rows and columns, we pass to the loc indexer two arguments separated by a comma: the necessary row labels and column labels. Both for row and column labels, it can be a list of labels, a slice of labels (or an open-ending slice), or a single label as a string. Some examples:

df.loc[['row_4', 'row_2'], ['col_5', 'col_2', 'col_9']]
col_5 col_2 col_9
row_4 44 14 84
row_2 42 12 82
df.loc[['row_4', 'row_2'], 'col_5':'col_7']
col_5 col_6 col_7
row_4 44 54 64
row_2 42 52 62
df.loc['row_4':'row_6', 'col_5':]
col_5 col_6 col_7 col_8 col_9
row_4 44 54 64 74 84
row_5 45 55 65 75 85
row_6 46 56 66 76 86
df.loc[:'row_4', 'col_5']
row_1    41
row_2    42
row_3    43
row_4    44
Name: col_5, dtype: int64

A particular case is when we need to explicitly get one value from a dataframe. For this purpose, we pass the necessary row and column labels as two arguments separated by a comma:

df.loc['row_6', 'col_3']
26

Using the at Indexer

For the last case from the previous section, i.e., for selecting only one value from a dataframe, there is a faster method – using the at indexer. The syntax is identical to the one of the loc indexer, only that here we always use exactly two labels (for the row and column) separated by a comma:

df.at['row_6', 'col_3']
26

Position-based Dataframe Indexing

Using this approach, aka location-based or integer-based, each dataframe element (row, column, or data point) is referred to by its position number rather than the label. The position numbers are integers starting from 0 for the first row or column (typical Python 0-based indexing) and increasing by 1 for each subsequent row/column.

The critical difference between label-based and position-based dataframe indexing approaches is in the way of dataframe slicing: for the position-based indexing, it is purely Python-style, i.e, the start bound of the range is inclusive while the stop bound is exclusive. For the label-based indexing, the stop bound is inclusive.

Using the Indexing Operator

To retrieve all data from multiple sequential rows of a pandas dataframe, we can simply use the indexing operator [] and a range of the necessary row positions (it can be an open-ending range):

df[3:6]
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
row_4 4 14 24 34 44 54 64 74 84
row_5 5 15 25 35 45 55 65 75 85
row_6 6 16 26 36 46 56 66 76 86
df[:3]
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
row_1 1 11 21 31 41 51 61 71 81
row_2 2 12 22 32 42 52 62 72 82
row_3 3 13 23 33 43 53 63 73 83

Note that the indexing operator doesn’t work for selecting a single row.

Using the iloc Indexer

This is the most common way of selecting dataframe rows and columns by their position numbers. The syntax of this method is very similar to that of the loc indexer and also implies using the indexing operator [].

To access a single row or multiple rows, we pass one argument to the iloc indexer representing the corresponding row position, a list of row positions (if the rows are disparate), or a range of row positions (if the rows are sequential):

# Selecting one row 
df.iloc[3]
col_1     4
col_2    14
col_3    24
col_4    34
col_5    44
col_6    54
col_7    64
col_8    74
col_9    84
Name: row_4, dtype: int64
# Selecting disparate rows in the necessary order
df.iloc[[9, 8, 7]]
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
row_10 10 20 30 40 50 60 70 80 90
row_9 9 19 29 39 49 59 69 79 89
row_8 8 18 28 38 48 58 68 78 88
# Selecting a slice of sequential rows
df.iloc[3:6]
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
row_4 4 14 24 34 44 54 64 74 84
row_5 5 15 25 35 45 55 65 75 85
row_6 6 16 26 36 46 56 66 76 86
# Selecting a slice of sequential rows (an open-ending range)
df.iloc[:3]
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
row_1 1 11 21 31 41 51 61 71 81
row_2 2 12 22 32 42 52 62 72 82
row_3 3 13 23 33 43 53 63 73 83

Pay attention that in the last two pieces of code, the stop bound of the range is not included, as discussed earlier.

In all the other cases, to select a subset of data of any size (one column, multiple columns, multiple rows and columns together, individual data points), we pass two arguments to the iloc indexer: the necessary row position numbers and column position numbers. For both arguments, the potential types of values can be:

  • an integer (to select a single row/column),
  • a list of integers (multiple disparate rows/columns),
  • a range of integers (multiple sequential rows/columns),
  • an open-ending range of integers (multiple sequential rows/columns up to and excluding a specific position number (e.g., df.iloc[:4, 1]), or starting from a specific number up to the end (e.g., df.loc[4:, 1]),
  • a colon (to select all the rows/columns).

Some examples:

# Selecting an individual data point
df.iloc[0, 0]
1
# Selecting one row and multiple disparate columns 
df.iloc[0, [2, 8, 3]]
col_3    21
col_9    81
col_4    31
Name: row_1, dtype: int64
# Selecting multiple disparate rows and multiple sequential columns 
df.iloc[[8, 1, 9], 5:9]
col_6 col_7 col_8 col_9
row_9 59 69 79 89
row_2 52 62 72 82
row_10 60 70 80 90
# Selecting multiple sequential rows and multiple sequential columns (with open-ending ranges)
df.iloc[:3, 6:]
col_7 col_8 col_9
row_1 61 71 81
row_2 62 72 82
row_3 63 73 83
# Selecting all rows and multiple disparate columns 
df.iloc[:, [1, 3, 7]]
col_2 col_4 col_8
row_1 11 31 71
row_2 12 32 72
row_3 13 33 73
row_4 14 34 74
row_5 15 35 75
row_6 16 36 76
row_7 17 37 77
row_8 18 38 78
row_9 19 39 79
row_10 20 40 80

Note that if we pass at least one position number out of bounds to the iloc indexer (whether it is a single integer for an entire row/column or one of the integers from a list), an IndexError will be thrown:

df.iloc[[1, 3, 100]]
---------------------------------------------------------------------------

IndexError                                Traceback (most recent call last)

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _get_list_axis(self, key, axis)
   1529         try:
-> 1530             return self.obj._take_with_is_copy(key, axis=axis)
   1531         except IndexError as err:

~\anaconda3\lib\site-packages\pandas\core\generic.py in _take_with_is_copy(self, indices, axis)
   3627         """
-> 3628         result = self.take(indices=indices, axis=axis)
   3629         # Maybe set copy if we didn't actually change the index.

~\anaconda3\lib\site-packages\pandas\core\generic.py in take(self, indices, axis, is_copy, **kwargs)
   3614 
-> 3615         new_data = self._mgr.take(
   3616             indices, axis=self._get_block_manager_axis(axis), verify=True

~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in take(self, indexer, axis, verify)
    861         n = self.shape[axis]
--> 862         indexer = maybe_convert_indices(indexer, n, verify=verify)
    863 

~\anaconda3\lib\site-packages\pandas\core\indexers.py in maybe_convert_indices(indices, n, verify)
    291         if mask.any():
--> 292             raise IndexError("indices are out-of-bounds")
    293     return indices

IndexError: indices are out-of-bounds

The above exception was the direct cause of the following exception:

IndexError                                Traceback (most recent call last)

~\AppData\Local\Temp/ipykernel_2100/47693281.py in 
----> 1 df.iloc[[1, 3, 100]]

~\anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
    929 
    930             maybe_callable = com.apply_if_callable(key, self.obj)
--> 931             return self._getitem_axis(maybe_callable, axis=axis)
    932 
    933     def _is_scalar_access(self, key: tuple):

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
   1555         # a list of integers
   1556         elif is_list_like_indexer(key):
-> 1557             return self._get_list_axis(key, axis=axis)
   1558 
   1559         # a single integer

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _get_list_axis(self, key, axis)
   1531         except IndexError as err:
   1532             # re-raise with different error message
-> 1533             raise IndexError("positional indexers are out-of-bounds") from err
   1534 
   1535     def _getitem_axis(self, key, axis: int):

IndexError: positional indexers are out-of-bounds

However, for a range of integers (a slice of sequential rows or columns), the out-of-bounds position numbers are allowed:

df.iloc[1, 5:100]
col_6    52
col_7    62
col_8    72
col_9    82
Name: row_2, dtype: int64

This works also for the indexing operator discussed in the previous section:

df[7:1000]
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
row_8 8 18 28 38 48 58 68 78 88
row_9 9 19 29 39 49 59 69 79 89
row_10 10 20 30 40 50 60 70 80 90

Using the iat Indexer

To select only one value from a dataframe, we can use the iat indexer, which performs faster than iloc. The syntax is identical to the one of the iloc indexer, only that here we always use exactly two integers (the row and column position numbers):

df.iat[1, 2]
22

Boolean Dataframe Indexing

Apart from label-based or position-based pandas dataframe indexing, it is possible to select a subset from a dataframe based on a certain condition:

df[df['col_2'] > 15]
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
row_6 6 16 26 36 46 56 66 76 86
row_7 7 17 27 37 47 57 67 77 87
row_8 8 18 28 38 48 58 68 78 88
row_9 9 19 29 39 49 59 69 79 89
row_10 10 20 30 40 50 60 70 80 90

The piece of code above returns all the rows of our dataframe where the values of the col_2 column are greater than 15. The condition applied (in our case – df['col_2'] > 15) is a boolean vector that has the same length as the dataframe index and checks if each row of the dataframe satisfies the defined criterion.

We can use also any other comparison operators:

  • == equals,
  • != not equals,
  • > greater than,
  • < less than,
  • >= greater than or equal to,
  • <= less than or equal to.

It is possible to define a boolean condition for a string column too (the comparison operators == and != make sense in such cases).

Also, we can define several criteria on the same column or multiple columns. The operators to be used for this purpose are & (and), | (or), ~ (not). Each condition must be put in a separate pair of brackets:

# Selecting all the rows of the dataframe where the value of `col_2` is greater than 15 but not equal to 19
df[(df['col_2'] > 15) & (df['col_2'] != 19)]
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
row_6 6 16 26 36 46 56 66 76 86
row_7 7 17 27 37 47 57 67 77 87
row_8 8 18 28 38 48 58 68 78 88
row_10 10 20 30 40 50 60 70 80 90
# Selecting all the rows of the dataframe where the value of `col_2` is greater than 15 
# or the value of `col_5` is equal to 42
df[(df['col_2'] > 15) | (df['col_5'] == 42)]
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
row_2 2 12 22 32 42 52 62 72 82
row_6 6 16 26 36 46 56 66 76 86
row_7 7 17 27 37 47 57 67 77 87
row_8 8 18 28 38 48 58 68 78 88
row_9 9 19 29 39 49 59 69 79 89
row_10 10 20 30 40 50 60 70 80 90
# Selecting all the rows of the dataframe where the value of `col_2` is NOT greater than 15 
df[~(df['col_2'] > 15)]
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
row_1 1 11 21 31 41 51 61 71 81
row_2 2 12 22 32 42 52 62 72 82
row_3 3 13 23 33 43 53 63 73 83
row_4 4 14 24 34 44 54 64 74 84
row_5 5 15 25 35 45 55 65 75 85

Combination of Dataframe Indexing Approaches

Finally, we can combine label-based, position-based, and boolean dataframe indexing approaches in various ways. For this purpose, we should apply the loc indexer again and access the positional numbers using the index attribute for rows and columns attribute for columns:

df.loc[df.index[[3, 4]], ['col_3', 'col_7']]
col_3 col_7
row_4 24 64
row_5 25 65
df.loc['row_3':'row_6', df.columns[[0, 5]]]
col_1 col_6
row_3 3 53
row_4 4 54
row_5 5 55
row_6 6 56
df.loc[df['col_4'] > 35, 'col_4':'col_7']
col_4 col_5 col_6 col_7
row_6 36 46 56 66
row_7 37 47 57 67
row_8 38 48 58 68
row_9 39 49 59 69
row_10 40 50 60 70

Conclusion

To sum up, in this tutorial, we explored indexing dataframes in pandas in-depth. We learned many things:

  • What pandas dataframe indexing is
  • The purposes of dataframe indexing
  • What dataframe slicing is
  • What row and column labels are
  • How to check the current row and column labels of a dataframe
  • What row and column position numbers are
  • The main approaches for indexing dataframes in pandas
  • The critical difference between label-based and position-based dataframe indexing
  • Why using the attribute access to select a dataframe column is not preferable
  • How boolean dataframe indexing works
  • The most common methods related to each dataframe indexing approach, their syntax for selecting different types of subsets, and their limitations
  • How to combine different indexing approaches
Elena Kosourova

About the author

Elena Kosourova

Elena is a petroleum geologist and community manager at Dataquest. You can find her chatting online with data enthusiasts and writing tutorials on data science topics. Find her on LinkedIn.

Learn data skills for free

Headshot Headshot

Join 1M+ learners

Try free courses