Grouping Data: A Step-by-Step Tutorial to GroupBy in Pandas
In this tutorial, we will explore how to create a GroupBy object in pandas library of Python and how this object works. We will take a detailed look at each step of a grouping process, what methods can be applied to a GroupBy object, and what information we can extract from it.
The 3 Steps of a Groupby Process
Any groupby process involves some combination of the following 3 steps:
- Splitting the original object into groups based on the defined criteria.
- Applying a function to each group.
- Combining the results.
Let's explore this split-apply-combine chain step-by-step with an example from a Kaggle Nobel Prize Dataset:
import pandas as pd
import numpy as np
pd.set_option('max_columns', None)
df = pd.read_csv('complete.csv')
df = df[['awardYear', 'category', 'prizeAmount', 'prizeAmountAdjusted', 'name', 'gender', 'birth_continent']]
df.head()
awardYear | category | prizeAmount | prizeAmountAdjusted | name | gender | birth_continent | |
---|---|---|---|---|---|---|---|
0 | 2001 | Economic Sciences | 10000000 | 12295082 | A. Michael Spence | male | North America |
1 | 1975 | Physics | 630000 | 3404179 | Aage N. Bohr | male | Europe |
2 | 2004 | Chemistry | 10000000 | 11762861 | Aaron Ciechanover | male | Asia |
3 | 1982 | Chemistry | 1150000 | 3102518 | Aaron Klug | male | Europe |
4 | 1979 | Physics | 800000 | 2988048 | Abdus Salam | male | Asia |
Splitting the Original Object into Groups
At this stage, we call the pandas DataFrame.groupby()
function. We use it to split the data into groups based on predefined criteria, along rows (by default, axis=0
), or columns (axis=1
). In other words, this function maps the labels to the names of the groups.
For example, in our case, we can group the data on Nobel prizes by prize category:
grouped = df.groupby('category')
It is also possible to use more than one column to perform data grouping, passing a list of columns. Let's group our data first by prize category and then, inside each of the created groups, we will apply additional grouping based on the year of the prize award:
grouped_category_year = df.groupby(['category', 'awardYear'])
Now, if we try to print one of the two GroupBy objects that we created, we will actually see no groups:
print(grouped)
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000026083789DF0>
It is important to note that creating a GroupBy object only checks if we have passed a correct mapping; it doesn't really perform any of the operations of the split-apply-combine chain until we explicitly use some method on this object or extract some of its attributes.
To briefly inspect the resulting GroupBy object and check how exactly the groups were split, we can extract from it the groups
or indices
attribute. Both of them return a dictionary where the keys are the created groups and the values are the lists of axis labels (for the groups
attribute) or indices (for the indices
attribute) of the instances of each group in the original DataFrame:
grouped.indices
{'Chemistry': array([ 2, 3, 7, 9, 10, 11, 13, 14, 15, 17, 19, 39, 62,
64, 66, 71, 75, 80, 81, 86, 92, 104, 107, 112, 129, 135,
153, 169, 175, 178, 181, 188, 197, 199, 203, 210, 215, 223, 227,
239, 247, 249, 258, 264, 265, 268, 272, 274, 280, 282, 284, 289,
296, 298, 310, 311, 317, 318, 337, 341, 343, 348, 352, 357, 362,
365, 366, 372, 374, 384, 394, 395, 396, 415, 416, 419, 434, 440,
442, 444, 446, 448, 450, 455, 456, 459, 461, 463, 465, 469, 475,
504, 505, 508, 518, 522, 523, 524, 539, 549, 558, 559, 563, 567,
571, 572, 585, 591, 596, 599, 627, 630, 632, 641, 643, 644, 648,
659, 661, 666, 667, 668, 671, 673, 679, 681, 686, 713, 715, 717,
719, 720, 722, 723, 725, 726, 729, 732, 738, 742, 744, 746, 751,
756, 759, 763, 766, 773, 776, 798, 810, 813, 814, 817, 827, 828,
829, 832, 839, 848, 853, 855, 862, 866, 880, 885, 886, 888, 889,
892, 894, 897, 902, 904, 914, 915, 920, 921, 922, 940, 941, 943,
946, 947], dtype=int64),
'Economic Sciences': array([ 0, 5, 45, 46, 58, 90, 96, 139, 140, 145, 152, 156, 157,
180, 187, 193, 207, 219, 231, 232, 246, 250, 269, 279, 283, 295,
305, 324, 346, 369, 418, 422, 425, 426, 430, 432, 438, 458, 467,
476, 485, 510, 525, 527, 537, 538, 546, 580, 594, 595, 605, 611,
636, 637, 657, 669, 670, 678, 700, 708, 716, 724, 734, 737, 739,
745, 747, 749, 750, 753, 758, 767, 800, 805, 854, 856, 860, 864,
871, 882, 896, 912, 916, 924], dtype=int64),
'Literature': array([ 21, 31, 40, 49, 52, 98, 100, 101, 102, 111, 115, 142, 149,
159, 170, 177, 201, 202, 220, 221, 233, 235, 237, 253, 257, 259,
275, 277, 278, 286, 312, 315, 316, 321, 326, 333, 345, 347, 350,
355, 359, 364, 370, 373, 385, 397, 400, 403, 406, 411, 435, 439,
441, 454, 468, 479, 480, 482, 483, 492, 501, 506, 511, 516, 556,
569, 581, 602, 604, 606, 613, 614, 618, 631, 633, 635, 640, 652,
653, 655, 656, 665, 675, 683, 699, 761, 765, 771, 774, 777, 779,
780, 784, 786, 788, 796, 799, 803, 836, 840, 842, 850, 861, 867,
868, 878, 881, 883, 910, 917, 919, 927, 928, 929, 930, 936],
dtype=int64),
'Peace': array([ 6, 12, 16, 25, 26, 27, 34, 36, 44, 47, 48, 54, 61,
65, 72, 78, 79, 82, 95, 99, 116, 119, 120, 126, 137, 146,
151, 166, 167, 171, 200, 204, 205, 206, 209, 213, 225, 236, 240,
244, 255, 260, 266, 267, 270, 287, 303, 320, 329, 356, 360, 361,
377, 386, 387, 388, 389, 390, 391, 392, 393, 433, 447, 449, 471,
477, 481, 489, 491, 500, 512, 514, 517, 528, 529, 530, 533, 534,
540, 542, 544, 545, 547, 553, 555, 560, 562, 574, 578, 590, 593,
603, 607, 608, 609, 612, 615, 616, 617, 619, 620, 628, 634, 639,
642, 664, 677, 688, 697, 703, 705, 710, 727, 736, 787, 793, 795,
806, 823, 846, 847, 852, 865, 875, 876, 877, 895, 926, 934, 935,
937, 944, 948, 949], dtype=int64),
'Physics': array([ 1, 4, 8, 20, 23, 24, 30, 32, 38, 51, 59, 60, 67,
68, 69, 70, 74, 84, 89, 97, 103, 105, 108, 109, 114, 117,
118, 122, 125, 127, 128, 130, 133, 141, 143, 144, 155, 162, 163,
164, 165, 168, 173, 174, 176, 179, 183, 195, 212, 214, 216, 222,
224, 228, 230, 234, 238, 241, 243, 251, 256, 263, 271, 276, 291,
292, 297, 301, 306, 307, 308, 323, 327, 328, 330, 335, 336, 338,
349, 351, 353, 354, 363, 367, 375, 376, 378, 381, 382, 398, 399,
402, 404, 405, 408, 410, 412, 413, 420, 421, 424, 428, 429, 436,
445, 451, 453, 457, 460, 462, 470, 472, 487, 495, 498, 499, 509,
513, 515, 521, 526, 532, 535, 536, 541, 548, 550, 552, 557, 561,
564, 565, 566, 573, 576, 577, 579, 583, 586, 588, 592, 601, 610,
621, 622, 623, 629, 647, 650, 651, 654, 658, 674, 676, 682, 684,
690, 691, 693, 694, 695, 696, 698, 702, 707, 711, 714, 721, 730,
731, 735, 743, 752, 755, 770, 772, 775, 781, 785, 790, 792, 797,
801, 802, 808, 822, 833, 834, 835, 844, 851, 870, 872, 879, 884,
887, 890, 893, 900, 901, 903, 905, 907, 908, 909, 913, 925, 931,
932, 933, 938, 942, 945], dtype=int64),
'Physiology or Medicine': array([ 18, 22, 28, 29, 33, 35, 37, 41, 42, 43, 50, 53, 55,
56, 57, 63, 73, 76, 77, 83, 85, 87, 88, 91, 93, 94,
106, 110, 113, 121, 123, 124, 131, 132, 134, 136, 138, 147, 148,
150, 154, 158, 160, 161, 172, 182, 184, 185, 186, 189, 190, 191,
192, 194, 196, 198, 208, 211, 217, 218, 226, 229, 242, 245, 248,
252, 254, 261, 262, 273, 281, 285, 288, 290, 293, 294, 299, 300,
302, 304, 309, 313, 314, 319, 322, 325, 331, 332, 334, 339, 340,
342, 344, 358, 368, 371, 379, 380, 383, 401, 407, 409, 414, 417,
423, 427, 431, 437, 443, 452, 464, 466, 473, 474, 478, 484, 486,
488, 490, 493, 494, 496, 497, 502, 503, 507, 519, 520, 531, 543,
551, 554, 568, 570, 575, 582, 584, 587, 589, 597, 598, 600, 624,
625, 626, 638, 645, 646, 649, 660, 662, 663, 672, 680, 685, 687,
689, 692, 701, 704, 706, 709, 712, 718, 728, 733, 740, 741, 748,
754, 757, 760, 762, 764, 768, 769, 778, 782, 783, 789, 791, 794,
804, 807, 809, 811, 812, 815, 816, 818, 819, 820, 821, 824, 825,
826, 830, 831, 837, 838, 841, 843, 845, 849, 857, 858, 859, 863,
869, 873, 874, 891, 898, 899, 906, 911, 918, 923, 939], dtype=int64)}
To find the number of groups in a GroupBy object, we can extract from it the ngroups
attribute or call the len
function of the Python standard library:
print(grouped.ngroups)
print(len(grouped))
6
6
If we need to visualize all or some entries of each group, we can iterate over the GroupBy object:
for name, entries in grouped:
print(f'First 2 entries for the "{name}" category:')
print(30*'-')
print(entries.head(2), '\n\n')
First 2 entries for the "Chemistry" category:
------------------------------
awardYear category prizeAmount prizeAmountAdjusted name \
2 2004 Chemistry 10000000 11762861 Aaron Ciechanover
3 1982 Chemistry 1150000 3102518 Aaron Klug
gender birth_continent
2 male Asia
3 male Europe
First 2 entries for the "Economic Sciences" category:
------------------------------
awardYear category prizeAmount prizeAmountAdjusted \
0 2001 Economic Sciences 10000000 12295082
5 2019 Economic Sciences 9000000 9000000
name gender birth_continent
0 A. Michael Spence male North America
5 Abhijit Banerjee male Asia
First 2 entries for the "Literature" category:
------------------------------
awardYear category prizeAmount prizeAmountAdjusted \
21 1957 Literature 208629 2697789
31 1970 Literature 400000 3177966
name gender birth_continent
21 Albert Camus male Africa
31 Alexandr Solzhenitsyn male Europe
First 2 entries for the "Peace" category:
------------------------------
awardYear category prizeAmount prizeAmountAdjusted \
6 2019 Peace 9000000 9000000
12 1980 Peace 880000 2889667
name gender birth_continent
6 Abiy Ahmed Ali male Africa
12 Adolfo Pérez Esquivel male South America
First 2 entries for the "Physics" category:
------------------------------
awardYear category prizeAmount prizeAmountAdjusted name gender \
1 1975 Physics 630000 3404179 Aage N. Bohr male
4 1979 Physics 800000 2988048 Abdus Salam male
birth_continent
1 Europe
4 Asia
First 2 entries for the "Physiology or Medicine" category:
------------------------------
awardYear category prizeAmount prizeAmountAdjusted \
18 1963 Physiology or Medicine 265000 2839286
22 1974 Physiology or Medicine 550000 3263449
name gender birth_continent
18 Alan Hodgkin male Europe
22 Albert Claude male Europe
If instead, we want to select a single group in a form of DataFrame, we should use the method get_group()
on the GroupBy object:
grouped.get_group('Economic Sciences')
awardYear | category | prizeAmount | prizeAmountAdjusted | name | gender | birth_continent | |
---|---|---|---|---|---|---|---|
0 | 2001 | Economic Sciences | 10000000 | 12295082 | A. Michael Spence | male | North America |
5 | 2019 | Economic Sciences | 9000000 | 9000000 | Abhijit Banerjee | male | Asia |
45 | 2012 | Economic Sciences | 8000000 | 8361204 | Alvin E. Roth | male | North America |
46 | 1998 | Economic Sciences | 7600000 | 9713701 | Amartya Sen | male | Asia |
58 | 2015 | Economic Sciences | 8000000 | 8384572 | Angus Deaton | male | Europe |
... | ... | ... | ... | ... | ... | ... | ... |
882 | 2002 | Economic Sciences | 10000000 | 12034660 | Vernon L. Smith | male | North America |
896 | 1973 | Economic Sciences | 510000 | 3331882 | Wassily Leontief | male | Europe |
912 | 2018 | Economic Sciences | 9000000 | 9000000 | William D. Nordhaus | male | North America |
916 | 1990 | Economic Sciences | 4000000 | 6329114 | William F. Sharpe | male | North America |
924 | 1996 | Economic Sciences | 7400000 | 9490424 | William Vickrey | male | North America |
84 rows × 7 columns
Applying a Function by Group
After splitting the original data and (optionally) inspecting the resulting groups, we can perform one of the following operations or their combination (not necessarily in the given order) on each group:
- Aggregation: calculating a summary statistic for each group (e.g., group sizes, means, medians, or sums) and outputting a single number for many data points.
- Transformation: conducting some operations by group, such as calculating the z-score for each group.
- Filtration: rejecting some groups based on a predefined condition, such as group size, mean, median, or sum. This can also include filtering out particular rows from each group.
Aggregation
To aggregate the data of a GroupBy object (i.e., to calculate a summary statistic by group), we can use the agg()
method on the object:
# Showing only 1 decimal for all float numbers
pd.options.display.float_format = '{:.1f}'.format
grouped.agg(np.mean)
awardYear | prizeAmount | prizeAmountAdjusted | |
---|---|---|---|
category | |||
Chemistry | 1972.3 | 3629279.4 | 6257868.1 |
Economic Sciences | 1996.1 | 6105845.2 | 7837779.2 |
Literature | 1960.9 | 2493811.2 | 5598256.3 |
Peace | 1964.5 | 3124879.2 | 6163906.9 |
Physics | 1971.1 | 3407938.6 | 6086978.2 |
Physiology or Medicine | 1970.4 | 3072972.9 | 5738300.7 |
The code above produces a DataFrame with the group names as its new index and the mean values for each numeric column by group.
Instead of using the agg()
method, we can apply the corresponding pandas method directly on a GroupBy object. The most common methods are mean()
, median()
, mode()
, sum()
, size()
, count()
, min()
, max()
, std()
, var()
(computes the variance of each group), describe()
(outputs descriptive statistics by group), and nunique()
(gives the number of unique values in each group).
grouped.sum()
awardYear | prizeAmount | prizeAmountAdjusted | |
---|---|---|---|
category | |||
Chemistry | 362912 | 667787418 | 1151447726 |
Economic Sciences | 167674 | 512891000 | 658373449 |
Literature | 227468 | 289282102 | 649397731 |
Peace | 263248 | 418733807 | 825963521 |
Physics | 419837 | 725890928 | 1296526352 |
Physiology or Medicine | 431508 | 672981066 | 1256687857 |
Usually, we are interested only in the statistics for some particular column or columns, so we need to specify it (or them). In the example above, we definitely don't want to sum all the years. Instead, we may want to sum the prize values by prize category. For this purpose, we can select the prizeAmountAdjusted
column of the GroupBy object, just like we select a column of a DataFrame, and apply the sum()
function on it:
grouped['prizeAmountAdjusted'].sum()
category
Chemistry 1151447726
Economic Sciences 658373449
Literature 649397731
Peace 825963521
Physics 1296526352
Physiology or Medicine 1256687857
Name: prizeAmountAdjusted, dtype: int64
For the piece of code above (as well as for some of the next examples), we can use an equivalent syntax applying the function on a GroupBy object before selecting the necessary column: grouped.sum()['prizeAmountAdjusted']
. However, the previous syntax is preferable since it performs better, especially on large datasets.
If we need to aggregate data for two or more columns, we use double square brackets:
grouped[['prizeAmount', 'prizeAmountAdjusted']].sum()
prizeAmount | prizeAmountAdjusted | |
---|---|---|
category | ||
Chemistry | 667787418 | 1151447726 |
Economic Sciences | 512891000 | 658373449 |
Literature | 289282102 | 649397731 |
Peace | 418733807 | 825963521 |
Physics | 725890928 | 1296526352 |
Physiology or Medicine | 672981066 | 1256687857 |
It is possible to apply several functions at once to one or more columns of a GroupBy object. For this purpose, we again need the agg()
method and a list of the functions of interest:
grouped[['prizeAmount', 'prizeAmountAdjusted']].agg([np.sum, np.mean, np.std])
prizeAmount | prizeAmountAdjusted | |||||
---|---|---|---|---|---|---|
sum | mean | std | sum | mean | std | |
category | ||||||
Chemistry | 667787418 | 3629279.4 | 4070588.4 | 1151447726 | 6257868.1 | 3276027.2 |
Economic Sciences | 512891000 | 6105845.2 | 3787630.1 | 658373449 | 7837779.2 | 3313153.2 |
Literature | 289282102 | 2493811.2 | 3653734.0 | 649397731 | 5598256.3 | 3029512.1 |
Peace | 418733807 | 3124879.2 | 3934390.9 | 825963521 | 6163906.9 | 3189886.1 |
Physics | 725890928 | 3407938.6 | 4013073.0 | 1296526352 | 6086978.2 | 3294268.5 |
Physiology or Medicine | 672981066 | 3072972.9 | 3898539.3 | 1256687857 | 5738300.7 | 3241781.0 |
Also, we can consider applying different aggregation functions to different columns of a GroupBy object by passing a dictionary:
grouped.agg({'prizeAmount': [np.sum, np.size], 'prizeAmountAdjusted': np.mean})
prizeAmount | prizeAmountAdjusted | ||
---|---|---|---|
sum | size | mean | |
category | |||
Chemistry | 667787418 | 184 | 6257868.1 |
Economic Sciences | 512891000 | 84 | 7837779.2 |
Literature | 289282102 | 116 | 5598256.3 |
Peace | 418733807 | 134 | 6163906.9 |
Physics | 725890928 | 213 | 6086978.2 |
Physiology or Medicine | 672981066 | 219 | 5738300.7 |
Transformation
Unlike aggregation methods (and also unlike filtration ones, as we will see soon), transformation methods return a new DataFrame with the same shape and indexing as the original one but with transformed individual values. Here it is important to note that the transformation must not modify any values in the original DataFrame, meaning that such operations cannot be performed in place.
The most common pandas method to transform the data of a GroupBy object is transform()
. For example, it can be helpful for computing the z-score for each group:
grouped[['prizeAmount', 'prizeAmountAdjusted']].transform(lambda x: (x - x.mean()) / x.std())
prizeAmount | prizeAmountAdjusted | |
---|---|---|
0 | 1.0 | 1.3 |
1 | -0.7 | -0.8 |
2 | 1.6 | 1.7 |
3 | -0.6 | -1.0 |
4 | -0.6 | -0.9 |
... | ... | ... |
945 | -0.7 | -0.8 |
946 | -0.8 | -1.1 |
947 | -0.9 | 0.3 |
948 | -0.5 | -1.0 |
949 | -0.7 | -1.0 |
950 rows × 2 columns
With transformation methods, we can also replace missing data with the group mean, median, mode, or any other value:
grouped['gender'].transform(lambda x: x.fillna(x.mode()[0]))
0 male
1 male
2 male
3 male
4 male
...
945 male
946 male
947 female
948 male
949 male
Name: gender, Length: 950, dtype: object
There are some other pandas methods that we can use to transform the data of a GroupBy object: bfill()
, ffill()
, diff()
, pct_change()
, rank()
, shift()
, quantile()
, etc.
Filtration
Filtration methods discard the groups or particular rows from each group based on a predefined condition and return a subset of the original data. For example, we may want to keep only the values of a certain column from all the groups where the group mean for that column is greater than a predefined value. In the case of our DataFrame, let's filter out all the groups with a group mean for the prizeAmountAdjusted
column smaller than 7,000,000, and keep only this column in the output:
grouped['prizeAmountAdjusted'].filter(lambda x: x.mean() > 7000000)
0 12295082
5 9000000
45 8361204
46 9713701
58 8384572
...
882 12034660
896 3331882
912 9000000
916 6329114
924 9490424
Name: prizeAmountAdjusted, Length: 84, dtype: int64
Another example is filtering out the groups with more than a certain number of elements:
grouped['prizeAmountAdjusted'].filter(lambda x: len(x) < 100)
0 12295082
5 9000000
45 8361204
46 9713701
58 8384572
...
882 12034660
896 3331882
912 9000000
916 6329114
924 9490424
Name: prizeAmountAdjusted, Length: 84, dtype: int64
In both the operations above, we used the filter()
method passing a lambda function as an argument. Such a function, applied to the entire group, returns True
or False
based on the result of comparison of that group with predifined statistical condition. In other words, the function inside the filter()
method decides which groups to keep in the new dataframe.
Along with filtering out the whole groups, it is also possible to discard certain rows from each group. Some helpful methods here are first()
, last()
, and nth()
. Applying one of them on a GroupBy object returns the first/last/nth entry of each group correspondingly:
grouped.last()
awardYear | prizeAmount | prizeAmountAdjusted | name | gender | birth_continent | |
---|---|---|---|---|---|---|
category | ||||||
Chemistry | 1911 | 140695 | 7327865 | Marie Curie | female | Europe |
Economic Sciences | 1996 | 7400000 | 9490424 | William Vickrey | male | North America |
Literature | 1968 | 350000 | 3052326 | Yasunari Kawabata | male | Asia |
Peace | 1963 | 265000 | 2839286 | International Committee of the Red Cross | male | Asia |
Physics | 1972 | 480000 | 3345725 | John Bardeen | male | North America |
Physiology or Medicine | 2016 | 8000000 | 8301051 | Yoshinori Ohsumi | male | Asia |
For the nth()
method, we have to pass the integer representing the index of entry we want to return for each group:
grouped.nth(1)
awardYear | prizeAmount | prizeAmountAdjusted | name | gender | birth_continent | |
---|---|---|---|---|---|---|
category | ||||||
Chemistry | 1982 | 1150000 | 3102518 | Aaron Klug | male | Europe |
Economic Sciences | 2019 | 9000000 | 9000000 | Abhijit Banerjee | male | Asia |
Literature | 1970 | 400000 | 3177966 | Alexandr Solzhenitsyn | male | Europe |
Peace | 1980 | 880000 | 2889667 | Adolfo Pérez Esquivel | male | South America |
Physics | 1979 | 800000 | 2988048 | Abdus Salam | male | Asia |
Physiology or Medicine | 1974 | 550000 | 3263449 | Albert Claude | male | Europe |
The piece of code above gathers the second entries of all the groups, keeping in mind the 0-indexing in Python.
Two more methods for filtering out the rows in each group are head()
and tail()
, returning the first/last n rows (5, by default) of each group correspondingly:
grouped.head(3)
awardYear | category | prizeAmount | prizeAmountAdjusted | name | gender | birth_continent | |
---|---|---|---|---|---|---|---|
0 | 2001 | Economic Sciences | 10000000 | 12295082 | A. Michael Spence | male | North America |
1 | 1975 | Physics | 630000 | 3404179 | Aage N. Bohr | male | Europe |
2 | 2004 | Chemistry | 10000000 | 11762861 | Aaron Ciechanover | male | Asia |
3 | 1982 | Chemistry | 1150000 | 3102518 | Aaron Klug | male | Europe |
4 | 1979 | Physics | 800000 | 2988048 | Abdus Salam | male | Asia |
5 | 2019 | Economic Sciences | 9000000 | 9000000 | Abhijit Banerjee | male | Asia |
6 | 2019 | Peace | 9000000 | 9000000 | Abiy Ahmed Ali | male | Africa |
7 | 2009 | Chemistry | 10000000 | 10958504 | Ada E. Yonath | female | Asia |
8 | 2011 | Physics | 10000000 | 10545557 | Adam G. Riess | male | North America |
12 | 1980 | Peace | 880000 | 2889667 | Adolfo Pérez Esquivel | male | South America |
16 | 2007 | Peace | 10000000 | 11301989 | Al Gore | male | North America |
18 | 1963 | Physiology or Medicine | 265000 | 2839286 | Alan Hodgkin | male | Europe |
21 | 1957 | Literature | 208629 | 2697789 | Albert Camus | male | Africa |
22 | 1974 | Physiology or Medicine | 550000 | 3263449 | Albert Claude | male | Europe |
28 | 1937 | Physiology or Medicine | 158463 | 4716161 | Albert Szent-Györgyi | male | Europe |
31 | 1970 | Literature | 400000 | 3177966 | Alexandr Solzhenitsyn | male | Europe |
40 | 2013 | Literature | 8000000 | 8365867 | Alice Munro | female | North America |
45 | 2012 | Economic Sciences | 8000000 | 8361204 | Alvin E. Roth | male | North America |
Combining the Results
The last stage of the split-apply-combine chain — combining the results — is performed under the hood by pandas. It consists of taking the outputs of all of the operations conducted on a GroupBy object and uniting them back together, producing a new data structure, such as a Series or DataFrame. Assigning this data structure to a variable, we can use it to solve other tasks.
Conclusion
In this tutorial, we covered many aspects of using the pandas groupby function and working with the resulting object. In particular, we learned the following:
- The steps a grouping process includes
- How the split-apply-combine chain works, step-by-step
- How to create a GroupBy object
- How to briefly inspect a GroupBy object
- The attributes of a GroupBy object
- The operations that can be applied on a GroupBy object
- How to calculate summary statistics by group and what methods are available for this purpose
- How to apply several functions at once to one or more columns of a GroupBy object
- How to apply different aggregation functions to different columns of a GroupBy object
- How and why to transform the values from the original DataFrame
- How to filter the groups of a GroupBy object or particular rows of each group
- How pandas combines the results of a grouping process
- The data structures a grouping process produces
With this information, you're ready to start applying your skills in pandas groupby.