February 15, 2022

Tutorial: Indexing DataFrames in Pandas

In this tutorial, we are going to discuss what indexing pandas dataframes means, why we need it, what kinds of dataframe indexing exist, and what syntax should be used for selecting different subsets.

What is Indexing Dataframes in Pandas?

Indexing a pandas dataframe means selecting particular subsets of data (such as rows, columns, individual cells) from that dataframe. Pandas dataframes have an intrinsic tabular structure represented by rows and columns where each row and column has a unique label (name) and position number (like a coordinate) inside the dataframe, and each data point is characterized by its location at the intersection of a specific row and column.

The row labels, called dataframe index, can be integer numbers or string values, the column labels, called column names, are usually strings. Since both the dataframe index and column names contain only unique values, we can use these labels to refer to particular rows, columns, or data points of the dataframe. On the other hand, we can describe each row, column, or data point by its position in the structure of the dataframe. The position numbers are integers starting from 0 for the first row or column and increasing by 1 for each subsequent row/column, so they can also be used as unique coordinates of particular dataframe elements (rows, columns, or data points). This ability to refer dataframe elements by label or position number is exactly what makes dataframe indexing possible.

A specific (and virtually, the most common) case of pandas dataframe indexing is slicing. It is used to access a sequence of dataframe elements rather than individual dataframe elements.

Pandas dataframe indexing can be performed for various tasks: pulling a subset of data based on predefined criteria, reorganizing data, getting a sample of data, data manipulation, modifying values of data points, etc.

To select a subset from a dataframe, we use the indexing operator [], attribute operator ., and an appropriate method of pandas dataframe indexing such as loc, iloc, at, iat, and some others.

Essentially, there are two main ways of indexing pandas dataframes: label-based and position-based (aka location-based or integer-based). Also, it is possible to apply boolean dataframe indexing based on predefined conditions, or even mix different types of dataframe indexing. Let's consider all these approaches in detail.

For our further experiments, we will create a fake dataframe:

import pandas as pd

df = pd.DataFrame({'col_1': list(range(1, 11)), 'col_2': list(range(11, 21)), 'col_3': list(range(21, 31)),
                   'col_4': list(range(31, 41)), 'col_5': list(range(41, 51)), 'col_6': list(range(51, 61)),
                   'col_7': list(range(61, 71)), 'col_8': list(range(71, 81)), 'col_9': list(range(81, 91))})
df

	col_1	col_2	col_3	col_4	col_5	col_6	col_7	col_8	col_9
0	1	11	21	31	41	51	61	71	81
1	2	12	22	32	42	52	62	72	82
2	3	13	23	33	43	53	63	73	83
3	4	14	24	34	44	54	64	74	84
4	5	15	25	35	45	55	65	75	85
5	6	16	26	36	46	56	66	76	86
6	7	17	27	37	47	57	67	77	87
7	8	18	28	38	48	58	68	78	88
8	9	19	29	39	49	59	69	79	89
9	10	20	30	40	50	60	70	80	90

Label-based Dataframe Indexing

As its name suggests, this approach implies selecting dataframe subsets based on the row and column labels. Let's explore four methods of label-based dataframe indexing: using the indexing operator [], attribute operator ., loc indexer, and at indexer.

Using the Indexing Operator

If we need to select all data from one or multiple columns of a pandas dataframe, we can simply use the indexing operator []. To select all data from a single column, we pass the name of this column:

df['col_2']

0    11
1    12
2    13
3    14
4    15
5    16
6    17
7    18
8    19
9    20
Name: col_2, dtype: int64

The resulting object is a pandas Series. If instead, we want a one-column dataframe as an output, we need to include the second pair of square brackets [[]]:

print(df[['col_2']])
type(df[['col_2']])

   col_2
0     11
1     12
2     13
3     14
4     15
5     16
6     17
7     18
8     19
9     20

pandas.core.frame.DataFrame

It is also possible to select multiple columns from a pandas dataframe passing a list of column names to the indexing operator. The result of this operation will be always a pandas dataframe:

df[['col_5', 'col_1', 'col_8']]

	col_5	col_1	col_8
0	41	1	71
1	42	2	72
2	43	3	73
3	44	4	74
4	45	5	75
5	46	6	76
6	47	7	77
7	48	8	78
8	49	9	79
9	50	10	80

As we see, the columns appear in the output dataframe in the same order as in the list. This can be helpful when we want to reorganize the original data.

If at least one column name from the provided list is absent in the dataframe, a KeyError will be thrown:

df[['col_5', 'col_1', 'col_8', 'col_100']]

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

~\AppData\Local\Temp/ipykernel_2100/3615225278.py in 
----> 1 df[['col_5', 'col_1', 'col_8', 'col_100']]

~\anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   3462             if is_iterator(key):
   3463                 key = list(key)
-> 3464             indexer = self.loc._get_listlike_indexer(key, axis=1)[1]
   3465 
   3466         # take() does not accept boolean indexers

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _get_listlike_indexer(self, key, axis)
   1312             keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
   1313 
-> 1314         self._validate_read_indexer(keyarr, indexer, axis)
   1315 
   1316         if needs_i8_conversion(ax.dtype) or isinstance(

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _validate_read_indexer(self, key, indexer, axis)
   1375 
   1376             not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 1377             raise KeyError(f"{not_found} not in index")
   1378 
   1379 

KeyError: "['col_100'] not in index"

Using the Attribute Operator

To select only one column of a dataframe, we can access it directly by its name as an attribute:

df.col_3

0    21
1    22
2    23
3    24
4    25
5    26
6    27
7    28
8    29
9    30
Name: col_3, dtype: int64

The piece of code above is equivalent to df['col_3']:

df['col_3']

0    21
1    22
2    23
3    24
4    25
5    26
6    27
7    28
8    29
9    30
Name: col_3, dtype: int64

However, the approach of accessing a column as an attribute has quite a few drawbacks. It cannot be applied in the following cases:

if the column name contains white spaces or punctuation marks (except for the underscore _),
if the column name coincides with a pandas method name (e.g., "max", "first", "sum"),
if the column name is not of string type (although using such column names is not a good practice in general),
for selecting multiple columns,
for creating a new column (trying to use the attribute access, in this case, will just create a new attribute instead of a new column).

Let's introduce temporarily some chaos in the column names of our dataframe and see what happens if we try to use the attribute access:

df.columns = ['col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col 6', 'col-7', 8, 'last']
df.columns

Index(['col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col 6', 'col-7', 8,
       'last'],
      dtype='object')

Above, we changed the names of the last four columns. Now, let's see how using the attribute operator works on those columns:

# The column name contains a white space
df.col 6

  File "C:\Users\Utente\AppData\Local\Temp/ipykernel_2100/3654995016.py", line 2
    df.col 6
           ^
SyntaxError: invalid syntax

# The column name contains a punctuation mark (except for the underscore)
df.col-7

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

~\AppData\Local\Temp/ipykernel_2100/1932640420.py in 
      1 # The column name contains a punctuation mark
----> 2 df.col-7

~\anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5485         ):
   5486             return self[name]
-> 5487         return object.__getattribute__(self, name)
   5488 
   5489     def __setattr__(self, name: str, value) -> None:

AttributeError: 'DataFrame' object has no attribute 'col'

# The column name coincides with a pandas method name
df.last

    <bound method NDFrame.last of    col_1  col_2  col_3  col_4  col_5  col 6  col-7   8  last
    0      1     11     21     31     41     51     61  71    81
    1      2     12     22     32     42     52     62  72    82
    2      3     13     23     33     43     53     63  73    83
    3      4     14     24     34     44     54     64  74    84
    4      5     15     25     35     45     55     65  75    85
    5      6     16     26     36     46     56     66  76    86
    6      7     17     27     37     47     57     67  77    87
    7      8     18     28     38     48     58     68  78    88
    8      9     19     29     39     49     59     69  79    89
    9     10     20     30     40     50     60     70  80    90>

# The column name is not a string 
df.8

  File "C:\Users\Utente\AppData\Local\Temp/ipykernel_2100/2774159673.py", line 2
    df.8
      ^
SyntaxError: invalid syntax

# An attempt to create a new column using the attribute access
df.new = 0
print(df.new)     # an attribute was created
print(df['new'])  # a column was not created

0

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3360             try:
-> 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:

~\anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

~\anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'new'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)

~\AppData\Local\Temp/ipykernel_2100/2886494677.py in 
      2 df.new = 0
      3 print(df.new)     # an attribute was created
----> 4 print(df['new'])  # a column was not created

~\anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   3456             if self.columns.nlevels > 1:
   3457                 return self._getitem_multilevel(key)
-> 3458             indexer = self.columns.get_loc(key)
   3459             if is_integer(indexer):
   3460                 indexer = [indexer]

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: 'new'

Note that in all the cases above, the syntax df[column_name] would work perfectly. In addition, using the same coding style throughout the project, including the way of dataframe indexing, improves the overall code readability, hence it makes sense to be consistent and stick to a more universal style (in our case – using square brackets []).

Let's restore the original column names and move on to the next approach of label-based dataframe indexing:

df.columns = ['col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6', 'col_7', 'col_8', 'col_9']
df

	col_1	col_2	col_3	col_4	col_5	col_6	col_7	col_8	col_9
0	1	11	21	31	41	51	61	71	81
1	2	12	22	32	42	52	62	72	82
2	3	13	23	33	43	53	63	73	83
3	4	14	24	34	44	54	64	74	84
4	5	15	25	35	45	55	65	75	85
5	6	16	26	36	46	56	66	76	86
6	7	17	27	37	47	57	67	77	87
7	8	18	28	38	48	58	68	78	88
8	9	19	29	39	49	59	69	79	89
9	10	20	30	40	50	60	70	80	90

Using the `loc` Indexer

If we need to select not only columns but also rows (or only rows) from a dataframe, we can use the loc method, aka loc indexer. This method implies using the indexing operator [] as well. This is the most common way of accessing dataframe rows and columns by label.

Before proceeding, let's take a look at what the current labels of our dataframe are. For this purpose, we will use the attributes columns and index:

print(df.columns)
print(df.index)

Index(['col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6', 'col_7', 'col_8',
       'col_9'],
      dtype='object')
RangeIndex(start=0, stop=10, step=1)

Note that the labels of the dataframe rows are represented by a specific type of objects, RangeIndex, which, in our case, consists of ordered integers from 0 to 9. These integers are valid row labels, and they can be used when applying the loc indexer. For example, to extract the row with the label 0, which is also the first row of our dataframe, we can use the following syntax:

df.loc[0]

col_1     1
col_2    11
col_3    21
col_4    31
col_5    41
col_6    51
col_7    61
col_8    71
col_9    81
Name: 0, dtype: int64

In general, the syntax df.loc[row_label] is used to pull a specific row from a dataframe as a pandas Series object.

However, for our further experiments with the loc indexer, let's rename the row labels into something more meaningful and of a string data type:

df.index = ['row_1', 'row_2', 'row_3', 'row_4', 'row_5', 'row_6', 'row_7', 'row_8', 'row_9', 'row_10']
df

	col_1	col_2	col_3	col_4	col_5	col_6	col_7	col_8	col_9
row_1	1	11	21	31	41	51	61	71	81
row_2	2	12	22	32	42	52	62	72	82
row_3	3	13	23	33	43	53	63	73	83
row_4	4	14	24	34	44	54	64	74	84
row_5	5	15	25	35	45	55	65	75	85
row_6	6	16	26	36	46	56	66	76	86
row_7	7	17	27	37	47	57	67	77	87
row_8	8	18	28	38	48	58	68	78	88
row_9	9	19	29	39	49	59	69	79	89
row_10	10	20	30	40	50	60	70	80	90

Let's extract again the values of the first row of the dataframe using the loc indexer and the new row label:

df.loc['row_1']

col_1     1
col_2    11
col_3    21
col_4    31
col_5    41
col_6    51
col_7    61
col_8    71
col_9    81
Name: row_1, dtype: int64

If we want to access multiple disparate rows, not necessarily in the same order as in the original dataframe, we have to pass a list of the row labels:

df.loc[['row_6', 'row_2', 'row_9']]

	col_1	col_2	col_3	col_4	col_5	col_6	col_7	col_8	col_9
row_6	6	16	26	36	46	56	66	76	86
row_2	2	12	22	32	42	52	62	72	82
row_9	9	19	29	39	49	59	69	79	89

The rows in the resulting dataframe appear in the same order as in the list.

If at least one label from the provided list is absent in the dataframe, a KeyError will be thrown:

df.loc[['row_6', 'row_2', 'row_100']]

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

~\AppData\Local\Temp/ipykernel_2100/3587512526.py in 
----> 1 df.loc[['row_6', 'row_2', 'row_100']]

~\anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
    929 
    930             maybe_callable = com.apply_if_callable(key, self.obj)
--> 931             return self._getitem_axis(maybe_callable, axis=axis)
    932 
    933     def _is_scalar_access(self, key: tuple):

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
   1151                     raise ValueError("Cannot index with multidimensional key")
   1152 
-> 1153                 return self._getitem_iterable(key, axis=axis)
   1154 
   1155             # nested tuple slicing

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_iterable(self, key, axis)
   1091 
   1092         # A collection of keys
-> 1093         keyarr, indexer = self._get_listlike_indexer(key, axis)
   1094         return self.obj._reindex_with_indexers(
   1095             {axis: [keyarr, indexer]}, copy=True, allow_dups=True

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _get_listlike_indexer(self, key, axis)
   1312             keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
   1313 
-> 1314         self._validate_read_indexer(keyarr, indexer, axis)
   1315 
   1316         if needs_i8_conversion(ax.dtype) or isinstance(

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _validate_read_indexer(self, key, indexer, axis)
   1375 
   1376             not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 1377             raise KeyError(f"{not_found} not in index")
   1378 
   1379 

KeyError: "['row_100'] not in index"

Instead of selecting disparate rows, we may need to select multiple sequential rows from the original dataframe. In this case, we can apply slicing, i.e., specify the start and end row labels separated by a colon:

df.loc['row_7':'row_9']

	col_1	col_2	col_3	col_4	col_5	col_6	col_7	col_8	col_9
row_7	7	17	27	37	47	57	67	77	87
row_8	8	18	28	38	48	58	68	78	88
row_9	9	19	29	39	49	59	69	79	89

Pay attention that with the loc indexer, both the start and stop bounds are inclusive, which is not a common slicing style for Python, where usually the stop bound is exclusive.

It is possible to leave one of the slicing ends open, if we need all the rows up to and including a specific row (e.g., df.loc[:'row_4']), or starting from a specific row up to the end (e.g., df.loc['row_4':]):

# Selecting all the rows up to and including 'row_4'
df.loc[:'row_4']

	col_1	col_2	col_3	col_4	col_5	col_6	col_7	col_8	col_9
row_1	1	11	21	31	41	51	61	71	81
row_2	2	12	22	32	42	52	62	72	82
row_3	3	13	23	33	43	53	63	73	83
row_4	4	14	24	34	44	54	64	74	84

In order to select multiple rows and columns, practically meaning a subset of data points of a dataframe at the intersection of those rows and columns, we pass to the loc indexer two arguments separated by a comma: the necessary row labels and column labels. Both for row and column labels, it can be a list of labels, a slice of labels (or an open-ending slice), or a single label as a string. Some examples:

df.loc[['row_4', 'row_2'], ['col_5', 'col_2', 'col_9']]

	col_5	col_2	col_9
row_4	44	14	84
row_2	42	12	82

df.loc[['row_4', 'row_2'], 'col_5':'col_7']

	col_5	col_6	col_7
row_4	44	54	64
row_2	42	52	62

df.loc['row_4':'row_6', 'col_5':]

	col_5	col_6	col_7	col_8	col_9
row_4	44	54	64	74	84
row_5	45	55	65	75	85
row_6	46	56	66	76	86

df.loc[:'row_4', 'col_5']

row_1    41
row_2    42
row_3    43
row_4    44
Name: col_5, dtype: int64

A particular case is when we need to explicitly get one value from a dataframe. For this purpose, we pass the necessary row and column labels as two arguments separated by a comma:

df.loc['row_6', 'col_3']

Using the `at` Indexer

For the last case from the previous section, i.e., for selecting only one value from a dataframe, there is a faster method – using the at indexer. The syntax is identical to the one of the loc indexer, only that here we always use exactly two labels (for the row and column) separated by a comma:

df.at['row_6', 'col_3']

Position-based Dataframe Indexing

Using this approach, aka location-based or integer-based, each dataframe element (row, column, or data point) is referred to by its position number rather than the label. The position numbers are integers starting from 0 for the first row or column (typical Python 0-based indexing) and increasing by 1 for each subsequent row/column.

The critical difference between label-based and position-based dataframe indexing approaches is in the way of dataframe slicing: for the position-based indexing, it is purely Python-style, i.e, the start bound of the range is inclusive while the stop bound is exclusive. For the label-based indexing, the stop bound is inclusive.

Using the Indexing Operator

To retrieve all data from multiple sequential rows of a pandas dataframe, we can simply use the indexing operator [] and a range of the necessary row positions (it can be an open-ending range):

df[3:6]

	col_1	col_2	col_3	col_4	col_5	col_6	col_7	col_8	col_9
row_4	4	14	24	34	44	54	64	74	84
row_5	5	15	25	35	45	55	65	75	85
row_6	6	16	26	36	46	56	66	76	86

df[:3]

	col_1	col_2	col_3	col_4	col_5	col_6	col_7	col_8	col_9
row_1	1	11	21	31	41	51	61	71	81
row_2	2	12	22	32	42	52	62	72	82
row_3	3	13	23	33	43	53	63	73	83

Note that the indexing operator doesn't work for selecting a single row.

Using the `iloc` Indexer

This is the most common way of selecting dataframe rows and columns by their position numbers. The syntax of this method is very similar to that of the loc indexer and also implies using the indexing operator [].

To access a single row or multiple rows, we pass one argument to the iloc indexer representing the corresponding row position, a list of row positions (if the rows are disparate), or a range of row positions (if the rows are sequential):

# Selecting one row 
df.iloc[3]

col_1     4
col_2    14
col_3    24
col_4    34
col_5    44
col_6    54
col_7    64
col_8    74
col_9    84
Name: row_4, dtype: int64

# Selecting disparate rows in the necessary order
df.iloc[[9, 8, 7]]

	col_1	col_2	col_3	col_4	col_5	col_6	col_7	col_8	col_9
row_10	10	20	30	40	50	60	70	80	90
row_9	9	19	29	39	49	59	69	79	89
row_8	8	18	28	38	48	58	68	78	88

# Selecting a slice of sequential rows
df.iloc[3:6]

	col_1	col_2	col_3	col_4	col_5	col_6	col_7	col_8	col_9
row_4	4	14	24	34	44	54	64	74	84
row_5	5	15	25	35	45	55	65	75	85
row_6	6	16	26	36	46	56	66	76	86

# Selecting a slice of sequential rows (an open-ending range)
df.iloc[:3]

	col_1	col_2	col_3	col_4	col_5	col_6	col_7	col_8	col_9
row_1	1	11	21	31	41	51	61	71	81
row_2	2	12	22	32	42	52	62	72	82
row_3	3	13	23	33	43	53	63	73	83

Pay attention that in the last two pieces of code, the stop bound of the range is not included, as discussed earlier.

In all the other cases, to select a subset of data of any size (one column, multiple columns, multiple rows and columns together, individual data points), we pass two arguments to the iloc indexer: the necessary row position numbers and column position numbers. For both arguments, the potential types of values can be:

an integer (to select a single row/column),
a list of integers (multiple disparate rows/columns),
a range of integers (multiple sequential rows/columns),
an open-ending range of integers (multiple sequential rows/columns up to and excluding a specific position number (e.g., df.iloc[:4, 1]), or starting from a specific number up to the end (e.g., df.loc[4:, 1]),
a colon (to select all the rows/columns).

Some examples:

# Selecting an individual data point
df.iloc[0, 0]

# Selecting one row and multiple disparate columns 
df.iloc[0, [2, 8, 3]]

col_3    21
col_9    81
col_4    31
Name: row_1, dtype: int64

# Selecting multiple disparate rows and multiple sequential columns 
df.iloc[[8, 1, 9], 5:9]

	col_6	col_7	col_8	col_9
row_9	59	69	79	89
row_2	52	62	72	82
row_10	60	70	80	90

# Selecting multiple sequential rows and multiple sequential columns (with open-ending ranges)
df.iloc[:3, 6:]

	col_7	col_8	col_9
row_1	61	71	81
row_2	62	72	82
row_3	63	73	83

# Selecting all rows and multiple disparate columns 
df.iloc[:, [1, 3, 7]]

	col_2	col_4	col_8
row_1	11	31	71
row_2	12	32	72
row_3	13	33	73
row_4	14	34	74
row_5	15	35	75
row_6	16	36	76
row_7	17	37	77
row_8	18	38	78
row_9	19	39	79
row_10	20	40	80

Note that if we pass at least one position number out of bounds to the iloc indexer (whether it is a single integer for an entire row/column or one of the integers from a list), an IndexError will be thrown:

df.iloc[[1, 3, 100]]

---------------------------------------------------------------------------

IndexError                                Traceback (most recent call last)

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _get_list_axis(self, key, axis)
   1529         try:
-> 1530             return self.obj._take_with_is_copy(key, axis=axis)
   1531         except IndexError as err:

~\anaconda3\lib\site-packages\pandas\core\generic.py in _take_with_is_copy(self, indices, axis)
   3627         """
-> 3628         result = self.take(indices=indices, axis=axis)
   3629         # Maybe set copy if we didn't actually change the index.

~\anaconda3\lib\site-packages\pandas\core\generic.py in take(self, indices, axis, is_copy, **kwargs)
   3614 
-> 3615         new_data = self._mgr.take(
   3616             indices, axis=self._get_block_manager_axis(axis), verify=True

~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in take(self, indexer, axis, verify)
    861         n = self.shape[axis]
--> 862         indexer = maybe_convert_indices(indexer, n, verify=verify)
    863 

~\anaconda3\lib\site-packages\pandas\core\indexers.py in maybe_convert_indices(indices, n, verify)
    291         if mask.any():
--> 292             raise IndexError("indices are out-of-bounds")
    293     return indices

IndexError: indices are out-of-bounds

The above exception was the direct cause of the following exception:

IndexError                                Traceback (most recent call last)

~\AppData\Local\Temp/ipykernel_2100/47693281.py in 
----> 1 df.iloc[[1, 3, 100]]

~\anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
    929 
    930             maybe_callable = com.apply_if_callable(key, self.obj)
--> 931             return self._getitem_axis(maybe_callable, axis=axis)
    932 
    933     def _is_scalar_access(self, key: tuple):

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
   1555         # a list of integers
   1556         elif is_list_like_indexer(key):
-> 1557             return self._get_list_axis(key, axis=axis)
   1558 
   1559         # a single integer

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _get_list_axis(self, key, axis)
   1531         except IndexError as err:
   1532             # re-raise with different error message
-> 1533             raise IndexError("positional indexers are out-of-bounds") from err
   1534 
   1535     def _getitem_axis(self, key, axis: int):

IndexError: positional indexers are out-of-bounds

However, for a range of integers (a slice of sequential rows or columns), the out-of-bounds position numbers are allowed:

df.iloc[1, 5:100]

col_6    52
col_7    62
col_8    72
col_9    82
Name: row_2, dtype: int64

This works also for the indexing operator discussed in the previous section:

df[7:1000]

	col_1	col_2	col_3	col_4	col_5	col_6	col_7	col_8	col_9
row_8	8	18	28	38	48	58	68	78	88
row_9	9	19	29	39	49	59	69	79	89
row_10	10	20	30	40	50	60	70	80	90

Using the `iat` Indexer

To select only one value from a dataframe, we can use the iat indexer, which performs faster than iloc. The syntax is identical to the one of the iloc indexer, only that here we always use exactly two integers (the row and column position numbers):

df.iat[1, 2]

Boolean Dataframe Indexing

Apart from label-based or position-based pandas dataframe indexing, it is possible to select a subset from a dataframe based on a certain condition:

df[df['col_2'] > 15]

	col_1	col_2	col_3	col_4	col_5	col_6	col_7	col_8	col_9
row_6	6	16	26	36	46	56	66	76	86
row_7	7	17	27	37	47	57	67	77	87
row_8	8	18	28	38	48	58	68	78	88
row_9	9	19	29	39	49	59	69	79	89
row_10	10	20	30	40	50	60	70	80	90

The piece of code above returns all the rows of our dataframe where the values of the col_2 column are greater than 15. The condition applied (in our case – df['col_2'] > 15) is a boolean vector that has the same length as the dataframe index and checks if each row of the dataframe satisfies the defined criterion.

We can use also any other comparison operators:

== equals,
!= not equals,
> greater than,
< less than,
>= greater than or equal to,
<= less than or equal to.

It is possible to define a boolean condition for a string column too (the comparison operators == and != make sense in such cases).

Also, we can define several criteria on the same column or multiple columns. The operators to be used for this purpose are & (and), | (or), ~ (not). Each condition must be put in a separate pair of brackets:

# Selecting all the rows of the dataframe where the value of col_2 is greater than 15 but not equal to 19
df[(df['col_2'] > 15) & (df['col_2'] != 19)]

	col_1	col_2	col_3	col_4	col_5	col_6	col_7	col_8	col_9
row_6	6	16	26	36	46	56	66	76	86
row_7	7	17	27	37	47	57	67	77	87
row_8	8	18	28	38	48	58	68	78	88
row_10	10	20	30	40	50	60	70	80	90

# Selecting all the rows of the dataframe where the value of col_2 is greater than 15 
# or the value of col_5 is equal to 42
df[(df['col_2'] > 15) | (df['col_5'] == 42)]

	col_1	col_2	col_3	col_4	col_5	col_6	col_7	col_8	col_9
row_2	2	12	22	32	42	52	62	72	82
row_6	6	16	26	36	46	56	66	76	86
row_7	7	17	27	37	47	57	67	77	87
row_8	8	18	28	38	48	58	68	78	88
row_9	9	19	29	39	49	59	69	79	89
row_10	10	20	30	40	50	60	70	80	90

# Selecting all the rows of the dataframe where the value of col_2 is NOT greater than 15 
df[~(df['col_2'] > 15)]

	col_1	col_2	col_3	col_4	col_5	col_6	col_7	col_8	col_9
row_1	1	11	21	31	41	51	61	71	81
row_2	2	12	22	32	42	52	62	72	82
row_3	3	13	23	33	43	53	63	73	83
row_4	4	14	24	34	44	54	64	74	84
row_5	5	15	25	35	45	55	65	75	85

Combination of Dataframe Indexing Approaches

Finally, we can combine label-based, position-based, and boolean dataframe indexing approaches in various ways. For this purpose, we should apply the loc indexer again and access the positional numbers using the index attribute for rows and columns attribute for columns:

df.loc[df.index[[3, 4]], ['col_3', 'col_7']]

	col_3	col_7
row_4	24	64
row_5	25	65

df.loc['row_3':'row_6', df.columns[[0, 5]]]

	col_1	col_6
row_3	3	53
row_4	4	54
row_5	5	55
row_6	6	56

df.loc[df['col_4'] > 35, 'col_4':'col_7']

	col_4	col_5	col_6	col_7
row_6	36	46	56	66
row_7	37	47	57	67
row_8	38	48	58	68
row_9	39	49	59	69
row_10	40	50	60	70

Conclusion

To sum up, in this tutorial, we explored indexing dataframes in pandas in-depth. We learned many things:

What pandas dataframe indexing is
The purposes of dataframe indexing
What dataframe slicing is
What row and column labels are
How to check the current row and column labels of a dataframe
What row and column position numbers are
The main approaches for indexing dataframes in pandas
The critical difference between label-based and position-based dataframe indexing
Why using the attribute access to select a dataframe column is not preferable
How boolean dataframe indexing works
The most common methods related to each dataframe indexing approach, their syntax for selecting different types of subsets, and their limitations
How to combine different indexing approaches

Pandas

Tutorials

Tutorial: Indexing DataFrames in Pandas

What is Indexing Dataframes in Pandas?

Label-based Dataframe Indexing

Using the Indexing Operator

Using the Attribute Operator

Using the `loc` Indexer

Using the `at` Indexer

Position-based Dataframe Indexing

Using the Indexing Operator

Using the `iloc` Indexer

Using the `iat` Indexer

Boolean Dataframe Indexing

Combination of Dataframe Indexing Approaches

Conclusion

NLP Project Part 1: Scraping the Web to Gather Data

Python Cheat Sheet for Data Science: Intermediate

	col_1	col_2	col_3	col_4	col_5	col_6	col_7	col_8	col_9
0	1	11	21	31	41	51	61	71	81
1	2	12	22	32	42	52	62	72	82
2	3	13	23	33	43	53	63	73	83
3	4	14	24	34	44	54	64	74	84
4	5	15	25	35	45	55	65	75	85
5	6	16	26	36	46	56	66	76	86
6	7	17	27	37	47	57	67	77	87
7	8	18	28	38	48	58	68	78	88
8	9	19	29	39	49	59	69	79	89
9	10	20	30	40	50	60	70	80	90

	col_5	col_1	col_8
0	41	1	71
1	42	2	72
2	43	3	73
3	44	4	74
4	45	5	75
5	46	6	76
6	47	7	77
7	48	8	78
8	49	9	79
9	50	10	80

	col_1	col_2	col_3	col_4	col_5	col_6	col_7	col_8	col_9
0	1	11	21	31	41	51	61	71	81
1	2	12	22	32	42	52	62	72	82
2	3	13	23	33	43	53	63	73	83
3	4	14	24	34	44	54	64	74	84
4	5	15	25	35	45	55	65	75	85
5	6	16	26	36	46	56	66	76	86
6	7	17	27	37	47	57	67	77	87
7	8	18	28	38	48	58	68	78	88
8	9	19	29	39	49	59	69	79	89
9	10	20	30	40	50	60	70	80	90

	col_1	col_2	col_3	col_4	col_5	col_6	col_7	col_8	col_9
0	1	11	21	31	41	51	61	71	81
1	2	12	22	32	42	52	62	72	82
2	3	13	23	33	43	53	63	73	83
3	4	14	24	34	44	54	64	74	84
4	5	15	25	35	45	55	65	75	85
5	6	16	26	36	46	56	66	76	86
6	7	17	27	37	47	57	67	77	87
7	8	18	28	38	48	58	68	78	88
8	9	19	29	39	49	59	69	79	89
9	10	20	30	40	50	60	70	80	90

	col_5	col_1	col_8
0	41	1	71
1	42	2	72
2	43	3	73
3	44	4	74
4	45	5	75
5	46	6	76
6	47	7	77
7	48	8	78
8	49	9	79
9	50	10	80

	col_1	col_2	col_3	col_4	col_5	col_6	col_7	col_8	col_9
0	1	11	21	31	41	51	61	71	81
1	2	12	22	32	42	52	62	72	82
2	3	13	23	33	43	53	63	73	83
3	4	14	24	34	44	54	64	74	84
4	5	15	25	35	45	55	65	75	85
5	6	16	26	36	46	56	66	76	86
6	7	17	27	37	47	57	67	77	87
7	8	18	28	38	48	58	68	78	88
8	9	19	29	39	49	59	69	79	89
9	10	20	30	40	50	60	70	80	90

Tutorial: Indexing DataFrames in Pandas

What is Indexing Dataframes in Pandas?

Label-based Dataframe Indexing

Using the Indexing Operator

Using the Attribute Operator

Using the loc Indexer

Using the at Indexer

Position-based Dataframe Indexing

Using the Indexing Operator

Using the iloc Indexer

Using the iat Indexer

Boolean Dataframe Indexing

Combination of Dataframe Indexing Approaches

Conclusion

More learning resources

NLP Project Part 1: Scraping the Web to Gather Data

Python Cheat Sheet for Data Science: Intermediate

Using the `loc` Indexer

Using the `at` Indexer

Using the `iloc` Indexer

Using the `iat` Indexer

	col_1	col_2	col_3	col_4	col_5	col_6	col_7	col_8	col_9
0	1	11	21	31	41	51	61	71	81
1	2	12	22	32	42	52	62	72	82
2	3	13	23	33	43	53	63	73	83
3	4	14	24	34	44	54	64	74	84
4	5	15	25	35	45	55	65	75	85
5	6	16	26	36	46	56	66	76	86
6	7	17	27	37	47	57	67	77	87
7	8	18	28	38	48	58	68	78	88
8	9	19	29	39	49	59	69	79	89
9	10	20	30	40	50	60	70	80	90

	col_5	col_1	col_8
0	41	1	71
1	42	2	72
2	43	3	73
3	44	4	74
4	45	5	75
5	46	6	76
6	47	7	77
7	48	8	78
8	49	9	79
9	50	10	80

	col_1	col_2	col_3	col_4	col_5	col_6	col_7	col_8	col_9
0	1	11	21	31	41	51	61	71	81
1	2	12	22	32	42	52	62	72	82
2	3	13	23	33	43	53	63	73	83
3	4	14	24	34	44	54	64	74	84
4	5	15	25	35	45	55	65	75	85
5	6	16	26	36	46	56	66	76	86
6	7	17	27	37	47	57	67	77	87
7	8	18	28	38	48	58	68	78	88
8	9	19	29	39	49	59	69	79	89
9	10	20	30	40	50	60	70	80	90