January 31, 2022

Tutorial: How to Create and Use a Pandas DataFrame

How to Create and Use a Pandas DataFrame

When it comes to exploring data with Python, DataFrames make analyzing and manipulating data for analysis easy. This article will look at some of the ins and outs when it comes to working with DataFrames.

Python is a powerful tool when it comes to working with data. Qualities like its scalability and variety of libraries for data analysis and data science applications make it versatile. However, what’s often under-appreciated-but-highly-valuable about Python is the ease with which we can manipulate data with flexible data structures. One of these structures is a DataFrame.

What’s a DataFrame?

To start, it’s important to know that there are a variety of different structures that data can take. For the majority of cases, most data are in tabular form (i.e., data structured into rows representing a single entry). You are likely already familiar with this if you’ve ever worked with an Excel spreadsheet or a SQL table. Aggregates of each of these rows, that represent a given data entry, and their properties are formed into a two-dimensional structure where titled columns consist of values of the same property.

These structures have several unique qualities:

  1. Rows — representing a singular data entry point
  2. Columns — corresponding to a grouping relating to a singular quality of each given data point that are usually titled
  3. Index — a unique identifier for each data entry

Create Pandas DataFrame

Although these can have different names depending on the programming language or application tool being used, in Python, we call these structures DataFrames. The principal library used in working with these structures is Pandas.

How do you make a DataFrame?

When it comes to creating a DataFrame, you can either import it from an external file or create it yourself in Python.

Method 1 — Import Data from a File

In the real world, a dataset is often read into Python via an external source that curated it. We can find these datasets in multiple types of files, but we most commonly find them in the form of comma separated value files (CSVs). Fortunately, in the Pandas library, it has a function that works to convert the data in this format into a DataFrame called pandas.read_csv(). The only major argument that it requires is a pathway that outlines where the file exists.

One pathway may be from the web (i.e., from an API or a GitHub repository)

import pandas as pd
import numpy as np 
import datetime as dt
# While not necessarily the case, you’ll often need to load the numpy library when working with the pandas library 
url = 'https://raw.githubusercontent.com/Vibe1990/Netflix-Project/main/netflix_title.csv'
netflix = pd.read_csv(url)
#  When providing the URL pathway, this will need to be in the form of a raw string, otherwise it will result in an error
netflix
show_id type title director cast country date_added release_year rating duration listed_in description
0 s1 TV Show 3% NaN João Miguel, Bianca Comparato, Michel Gomes, R… Brazil 14-Aug-20 2020 TV-MA 4 Seasons International TV Shows, TV Dramas, TV Sci-Fi &… In a future where the elite inhabit an island …
1 s2 Movie 7:19 Jorge Michel Grau Demián Bichir, Héctor Bonilla, Oscar Serrano, … Mexico 23-Dec-16 2016 TV-MA 93 min Dramas, International Movies After a devastating earthquake hits Mexico Cit…
2 s3 Movie 23:59 Gilbert Chan Tedd Chan, Stella Chung, Henley Hii, Lawrence … Singapore 20-Dec-18 2011 R 78 min Horror Movies, International Movies When an army recruit is found dead, his fellow…
3 s4 Movie 9 Shane Acker Elijah Wood, John C. Reilly, Jennifer Connelly… United States 16-Nov-17 2009 PG-13 80 min Action & Adventure, Independent Movies, Sci-Fi… In a postapocalyptic world, rag-doll robots hi…
4 s5 Movie 21 Robert Luketic Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar… United States 01-Jan-20 2008 PG-13 123 min Dramas A brilliant group of students become card-coun…

Alternatively, if a file were stored on your computer in a working directory, then the path would adjust accordingly. In this process, we could use either the relative or full path to specify the pathway to retrieve a given file because the function can decipher the difference between the two without an issue.

# Assuming you've set up your notebook to have the desired working directory set

netflix = pd.read_csv(“netflix_title.csv”)
netflix
show_id type title director cast country date_added release_year rating duration listed_in description
0 s1 TV Show 3% NaN João Miguel, Bianca Comparato, Michel Gomes, R… Brazil 14-Aug-20 2020 TV-MA 4 Seasons International TV Shows, TV Dramas, TV Sci-Fi &… In a future where the elite inhabit an island …
1 s2 Movie 7:19 Jorge Michel Grau Demián Bichir, Héctor Bonilla, Oscar Serrano, … Mexico 23-Dec-16 2016 TV-MA 93 min Dramas, International Movies After a devastating earthquake hits Mexico Cit…
2 s3 Movie 23:59 Gilbert Chan Tedd Chan, Stella Chung, Henley Hii, Lawrence … Singapore 20-Dec-18 2011 R 78 min Horror Movies, International Movies When an army recruit is found dead, his fellow…
3 s4 Movie 9 Shane Acker Elijah Wood, John C. Reilly, Jennifer Connelly… United States 16-Nov-17 2009 PG 80 min Action & Adventure, Independent Movies, Sci-Fi… In a postapocalyptic world, rag-doll robots hi…
4 s5 Movie 21 Robert Luketic Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar… United States 01-Jan-20 2008 PG 123 min Dramas A brilliant group of students become card-coun…
7782 s7783 Movie Zozo Josef Fares Imad Creidi, Antoinette Turk, Elias Gergi, Car… Sweden, Czech Republic, United Kingdom, Denmar… 19-Oct-20 2005 TV-MA 99 min Dramas, International Movies When Lebanon’s Civil War deprives Zozo of his …
7783 s7784 Movie Zubaan Mozez Singh Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan… India 02-Mar-19 2015 TV-14 111 min Dramas, International Movies, Music & Musicals A scrappy but poor boy worms his way into a ty…
7784 s7785 Movie Zulu Man in Japan NaN Nasty C NaN 25-Sep-20 2019 TV-MA 44 min Documentaries, International Movies, Music & M… In this documentary, South African rapper Nast…
7785 s7786 TV Show Zumbo’s Just Desserts NaN Adriano Zumbo, Rachel Khoo Australia 31-Oct-20 2019 TV-PG 1 Season International TV Shows, Reality TV Dessert wizard Adriano Zumbo looks for the nex…
7786 s7787 Movie ZZ TOP: THAT LITTLE OL’ BAND FROM TEXAS Sam Dunn NaN United Kingdom, Canada, United States 01-Mar-20 2019 TV-MA 90 min Documentaries, Music & Musicals This documentary delves into the mystique behi…

7787 rows × 12 columns

netflix.head(5)
show_id type title director cast country date_added release_year rating duration listed_in description
0 s1 TV Show 3% NaN João Miguel, Bianca Comparato, Michel Gomes, R… Brazil 14-Aug-20 2020 TV-MA 4 Seasons International TV Shows, TV Dramas, TV Sci-Fi &… In a future where the elite inhabit an island …
1 s2 Movie 7:19 Jorge Michel Grau Demián Bichir, Héctor Bonilla, Oscar Serrano, … Mexico 23-Dec-16 2016 TV-MA 93 min Dramas, International Movies After a devastating earthquake hits Mexico Cit…
2 s3 Movie 23:59 Gilbert Chan Tedd Chan, Stella Chung, Henley Hii, Lawrence … Singapore 20-Dec-18 2011 R 78 min Horror Movies, International Movies When an army recruit is found dead, his fellow…
3 s4 Movie 9 Shane Acker Elijah Wood, John C. Reilly, Jennifer Connelly… United States 16-Nov-17 2009 PG-13 80 min Action & Adventure, Independent Movies, Sci-Fi… In a postapocalyptic world, rag-doll robots hi…
4 s5 Movie 21 Robert Luketic Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar… United States 01-Jan-20 2008 PG-13 123 min Dramas A brilliant group of students become card-coun…

Although CSV files are the most common, there are a number of different functions that are available in Pandas to read in files of various types into a DataFrame that operates with the same general process:

METHOD 2 – Creating DataFrames Yourself

While not the most common method of creating a DataFrame, you can certainly create a data frame yourself by inputting data. We can accomplish this with the pandas.DataFrame() function, which takes its data input argument and converts it into a DataFrame. The pandas.DataFrame function is quite robust in that it can take in a variety of different data inputs:

  • Nothing – this will make an empty DataFrame that you can populate with data later
    # Creating an empty DataFrame
    data = pd.DataFrame()
    print(data)
    Empty DataFrame
    Columns: []
    Index: []
  • A dictionary of ndarrays / lists
    # DataFrame for Pawnee City Hall
    
    Pawnee_city_hall = {
        "Personnel": ["Leslie Knope", "Ron Swanson", "Ann Perkins", "Tom Haverford", "Mark Brendanawicz", "April Ludgate", "Andy Dwyer", "Ben Wyatt", "Chris Traeger","Jerry Gergich", "Donna Meagle", "Craig Middlebrooks"],
        "Position":["Deputy Director", "Director", "Health Representative", "Administrator", "City Planner", "Assistant - Director", "Assistant - Deputy Director", "Deputy City Manager", "City Manger", "Administrator", "Office Manger", "Assistant Office Manager"]
    }
    
    Pawnee_city_hall = pd.DataFrame(Pawnee_city_hall)
    Pawnee_city_hall
    Personnel Position
    0 Leslie Knope Deputy Director
    1 Ron Swanson Director
    2 Ann Perkins Health Representative
    3 Tom Haverford Administrator
    4 Mark Brendanawicz City Planner
    5 April Ludgate Assistant – Director
    6 Andy Dwyer Assistant – Deputy Director
    7 Ben Wyatt Deputy City Manager
    8 Chris Traeger City Manger
    9 Jerry Gergich Administrator
    10 Donna Meagle Office Manger
    11 Craig Middlebrooks Assistant Office Manager
  • A dictionary of series (a one-dimensional array of data with an axis label)
    # Another way to create a DataFrame for Pawnee City Hall
    personnel = pd.Series(["Leslie Knope", "Ron Swanson", "Ann Perkins", "Tom Haverford", "Mark Brendanawicz", "April Ludgate", "Andy Dwyer", "Ben Wyatt", "Chris Traeger","Jerry Gergich", "Donna Meagle", "Craig Middlebrooks"])
    position = pd.Series(["Deputy Director", "Director", "Health Representative", "Administrator", "City Planner", "Assistant - Director", "Assistant - Deputy Director", "Deputy City Manager", "City Manger", "Administrator", "Office Manger", "Assistant Office Manager"])
    
    Pawnee_city_hall = pd.DataFrame({"Names":personnel, "Job":position})
    Pawnee_city_hall
    Names Job
    0 Leslie Knope Deputy Director
    1 Ron Swanson Director
    2 Ann Perkins Health Representative
    3 Tom Haverford Administrator
    4 Mark Brendanawicz City Planner
    5 April Ludgate Assistant – Director
    6 Andy Dwyer Assistant – Deputy Director
    7 Ben Wyatt Deputy City Manager
    8 Chris Traeger City Manger
    9 Jerry Gergich Administrator
    10 Donna Meagle Office Manger
    11 Craig Middlebrooks Assistant Office Manager
  • A list of lists
    # Another way to create a DataFrame for Pawnee City Hall
    
    data = [
        ["Leslie Knope", "Deputy Director"], 
        ["Ron Swanson", "Director"], 
        ["Tom Haverford", "Administrator"], 
        ["April Ludgate", "Assistant-Director"], 
        ["Donna Meagle", "Office Manager"],
        ["Andy Dwyer", "Assistant - Deputy Director"], 
        ['Jerry Gergich', "Administrator"]]
    
    Pawnee_city_hall = pd.DataFrame(data, columns = ["Names", "Position"])
    Pawnee_city_hall
    Names Position
    0 Leslie Knope Deputy Director
    1 Ron Swanson Director
    2 Tom Haverford Administrator
    3 April Ludgate Assistant-Director
    4 Donna Meagle Office Manager
    5 Andy Dwyer Assistant – Deputy Director
    6 Jerry Gergich Administrator
  • A list of dictionaries
    # DATAFRAME FOR RAPTORS
    
    raptors = [{"Player": "Pascal Siakim", "PPG": 23.7, "College":"University of New Mexico", "is_starting_five": True},
               {"Player": "Fred VanVleet", "PPG": 20.1, "College":"Wichita State", "is_starting_five": True},
               {"Player": "Scottie Barnes", "PPG": 15.1, "College":"Florida State", "is_starting_five": True},
               {"Player": "Chris Boucher", "PPG": 8.7, "College":"Oregon", "is_starting_five": False}
              ]
    
    raptors = pd.DataFrame(raptors, index = ['Power Forward', "Point Guard", "Small Forward", "Center"])
    raptors
    Player PPG College is_starting_five is_starting_five
    Power Forward Pascal Siakim 23.7 University of New Mexico True True
    Point Guard Fred VanVleet 20.1 Whicita State True True
    Small Forward Scottie Barnes 15.1 Florida State True True
    Center Chris Boucher 8.7 Oregon False False

    NOTE: the pandas.DataFrame function also has the index and column argument that’s used to name the row index and column titles respectively.

Although the same data types are used in the examples above (strings), DataFrames can consist of a variety of different data types, such as integers, floats, lists, datetimes, Booleans, list, etc.

# DATAFRAME FOR LAKERS

lakers = {
"player" : ['Lebron James', 
              'Russell Westbrook', 'Anthony Davis', 'Dwight Howard', 'Avery Bradley', 'DeAndre Jordan', 
              'Carmelo Anthony', 'Austin Reaves', 'Kent Bazemore', 'Malik Monk', 'Stanley Johnson', 
              'Trevor Ariza', 'Wayne Ellington', 'Talen Horton-Tucker'],
"PPG": [28.7, 19.3, 23.3, 5, 6.9, 4.5, 13.4, 5.4, 4.1, 12.1, 6.2, 3.5, 6.4, 10.9],
"in_starting_lineup": [True, True, True, False, True, False, False, False, False, True, False, False, False, False]
}

lakers = pd.DataFrame(lakers)
lakers
player PPG in_starting_lineup
0 Lebron James 28.7 True
1 Russell Westbrook 19.3 True
2 Anthony Davis 23.3 True
3 Dwight Howard 5.0 False
4 Avery Bradley 6.9 True
5 DeAndre Jordan 4.5 False
6 Carmelo Anthony 13.4 False
7 Austin Reaves 5.4 False
8 Kent Bazemore 4.1 False
9 Malik Monk 12.1 True
10 Stanley Johnson 6.2 False
11 Trevor Ariza 3.5 False
12 Wayne Ellington 6.4 False
13 Talen Horton-Tucker 10.9 False

Exploring a DataFrame

Since Python is an object-oriented programming language, creating a DataFrame means creating an object of the DataFrame class. This also means that there are a number of different attributes that we can explore and methods that we can apply to the DataFrame. While we use these more often in situations when we aren’t familiar with the dataset (say from importing it from somewhere), they are nonetheless useful.

Whenever a dataset is loaded into Python as a DataFrame, it’s best to look at its structure. There are a number of different attributes that can provide that info:

  • DataFrame.shape — returns a tuple indicating the number of rows and columns of the DataFrame
  • DataFrame.size — returns an integer value of the number of datapoints in the dataset
    print(netflix.shape)
    print(netflix.size)
    (7787, 12)
    93444

If you were to explore the axes of the DataFrame, you may do so by having an array return the listed columns and index via DataFrame.columns and DataFrame.index. On the other hand, it may be useful to look at the different types of data that makes up the dataset. In these situations, the DataFrame.dtypes is used.

print(netflix.columns)
print("")
print(netflix.index)
print("")
print(netflix.dtypes)
Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')
RangeIndex(start=0, stop=7787, step=1)
show_id object
type object
title object
director object
cast object
country object
date_added object
release_year int64
rating object
duration object
listed_in object
description object
dtype: object

Manipulating a DataFrame

Now that we know what DataFrame is, it’s time to do some real work! Principally, this involves manipulating it as part of the data cleaning and data wrangling process, just prior to the actual analysis. Now there are a number of basic operations that should be in everyone’s repertoire — the first one is being able to access and isolate a given segment of a DataFrame.

To segment a DataFrame, we use either the DataFrame.loc attribute or DataFrame.iloc attribute where the input dictates which rows or columns are extracted ([rows: columns]). If a column needs to be isolated, then the process would be to use square brackets with the name of the given column.

print(netflix['title'])
print("")
print(netflix.loc[0])
0 3%
1 7:19
2 23:59
3 9
4 21
7782 Zozo
7783 Zubaan
7784 Zulu Man in Japan
7785 Zumbo’s Just Desserts
7786 ZZ TOP: THAT LITTLE OL’ BAND FROM TEXAS
Name: title, Length: 7787, dtype: object
show_id s1
type TV Show
title 3%
director NaN
cast João Miguel, Bianca Comparato, Michel Gomes, R…
country Brazil
date_added 14-Aug-20
release_year 2020
rating TV-MA
duration 4 Seasons
listed_in International TV Shows, TV Dramas, TV Sci-Fi &…
description In a future where the elite inhabit an island …
Name: 0, dtype: object

In the event that we need to extract multiple rows or columns, we use the slice method, which involves using a “:” that indicates a continuous range with the end range being exclusive (i.e., not included) or by inputting criteria within square brackets in a similar manner as indexing with Boolean with NumPy.

# Select the first two row of the raptors DataFrame 
raptors.iloc[0:2]
Player PPG College is_starting_five
Power Forward Pascal Siakim 23.7 University of New Mexico True
Point Guard Fred VanVleet 20.1 Whicita State True
# Select the last three columns of the raptors DataFrame
raptors.iloc[:, 1:4]
PPG College is_starting_five
Power Forward 23.7 University of New Mexico True
Point Guard 20.1 Whicita State True
Small Forward 15.1 Florida State True
Center 8.7 Oregon False
# Select players that averaged more than 15 PPG on the Laker DataFrame
lakers[lakers["PPG"] > 15]
player PPG in_starting_lineup
0 Lebron James 28.7 True
1 Russell Westbrook 19.3 True
2 Anthony Davis 23.3 True

While the above examples are simplistic, it’s possible to make it more powerful and sophisticated with the use of operators such as AND (&), OR (|), NOT EQUAL TO (!=) or EQUAL TO (==). To set this up, let’s create a new DataFrame containing information about current UFC champions:

# DATAFRAME FOR UFC CHAMPS

ufc_champs = {
    "names" : ['Francis Ngannou', "Glover Teixeira", 
             "Israel Adesanya", "Kamaru Usman", 
             "Charles Oliveira", "Alex Volkanowski", 
             'Aljamain Sterling', "Brandon Morano", 
             "Julianna Pena", "Valentina Shevchenko", "Rose Namajunas"],
    "nicknames" : ['The Predator', None, "The Last Stylebender", 
                 "The Nigerian Nightmare", "Da Bronx", "The Great",
                "Funk Master", "The Assassin Baby", "The Venezulean Vixen",
                "The Bullet", "Thug"],
    "wins" : [16,33,21,19,31,23,20,19,11,22,11],
    "losses" : [3,7,1,1,8,1,3,5,4,3,4],
    "weightclass" : ['Heavyweight', "Light Heavyweight", "Middleweight",
                   'Welterweight', "Lightweight", "Featherweight", 
                   'Bantamweight', "Flyweight", "Bantamweight", "Flyweight", "Strawweight"]
}

ufc_champs = pd.DataFrame(ufc_champs)
ufc_champs
names nicknames wins losses weightclass country
0 Francis Ngannou The Predator 16 3 Heavyweight Cameroon
1 Glover Teixeira None 33 7 Light Heavyweight Brazil
2 Israel Adesanya The Last Stylebender 21 1 Middleweight New Zealand
3 Kamaru Usman The Nigerian Nightmare 19 1 Welterweight USA
4 Charles Oliveira Da Bronx 31 8 Lightweight Brazil
5 Alex Volkanowski The Great 23 1 Featherweight Australia
6 Aljamain Sterling Funk Master 20 3 Bantamweight USA
7 Brandon Morano The Assassin Baby 19 5 Flyweight Mexico
8 Julianna Pena The Venezulean Vixen 11 4 Bantamweight USA
9 Valentina Shevchenko The Bullet 22 3 Flyweight Kyrgyzstan
10 Rose Namajunas Thug 11 4 Strawweight USA
# Select champion that came from Brazil + is the Light Heavyweight Champ
ufc_champs[(ufc_champs['country'] == "Brazil") & (ufc_champs['weightclass'] == "Light Heavyweight")]
names nicknames wins losses weightclass country
1 Glover Teixeira None 33 7 Light Heavyweight Brazil
# Select champion(s) if home country is Brazil OR has at Less than 2 losses 
ufc_champs[(ufc_champs['country'] == "Brazil") | (ufc_champs['losses'] < 2)]
names nicknames wins losses weightclass country
1 Glover Teixeira None 33 7 Light Heavyweight Brazil
2 Israel Adesanya The Last Stylebender 21 1 Middleweight New Zealand
3 Kamaru Usman The Nigerian Nightmare 19 1 Welterweight USA
4 Charles Oliveira Da Bronx 31 8 Lightweight Brazil
5 Alex Volkanowski The Great 23 1 Featherweight Australia
# Select chamption with less than 20 wins AND home country is not USA
print(ufc_champs[(ufc_champs['wins'] < 20) & (ufc_champs["country"] != "USA")])
names nicknames wins losses weightclass country
0 Francis Ngannou The Predator 16 3 Heavyweight Cameroon
7 Brandon Morano The Assassin Baby 19 5 Flyweight Mexico

Aside from filtering out a DataFrame or segmenting it, it’s also possible to use the DataFrame.iloc() and DataFrame.loc() attributes in changing a particular value. Let’s look at an example of how we can do this with real data. Suppose that the Lakers make a change in the starting five lineup where we replace Anthony Davis with DeAndre Jordan. We can do that with DataFrame.iloc() like this:

# The Lakers made a change in the starting 5 lineup where we replace Anthony Davis with DeAndre Jordan

lakers.iloc[2,2] = False
lakers.loc[5,"in_starting_lineup"] = True
lakers
player PPG in_starting_lineup
0 Lebron James 28.7 True
1 Russell Westbrook 19.3 True
2 Anthony Davis 23.3 False
3 Dwight Howard 5.0 False
4 Avery Bradley 6.9 True
5 DeAndre Jordan 4.5 True
6 Carmelo Anthony 13.4 False
7 Austin Reaves 5.4 False
8 Kent Bazemore 4.1 False
9 Malik Monk 12.1 True
10 Stanley Johnson 6.2 False
11 Trevor Ariza 3.5 False
12 Wayne Ellington 6.4 False
13 Talen Horton-Tucker 10.9 False

In some situations, it may be necessary to insert or delete data from a DataFrame. We can insert or delete a row using the append() and drop() methods. With the append method, you would use a panda.Series object that matches the dimensions of a DataFrame as the argument for the function. For the drop method, all that we need to state is the index/column label within the DataFrame that we need to drop. Both methods contain the axis argument that specifies whether to add or drop a row or column.

Recall that we created a DataFrame consisting of UFC champions that held the title at the end of 2021 with their monikers and win-loss record:

# Say we create a DataFrame consisting of UFC champions that held the title at the end of 2021 with their monikers and win-loss record 

ufc_champs = {
    "names": ['Francis Ngannou', "Glover Teixeira", 
             "Israel Adesanya", "Kamaru Usman", 
             "Charles Oliveira", "Alex Volkanowski", 
             'Aljamain Sterling', "Brandon Morano", 
             "Julianna Pena", "Valentina Shevchenko", "Rose Namajunas"],
    "nicknames": ['The Predator', None, "The Last Stylebender", 
                 "The Nigerian Nightmare", "Da Bronx", "The Great",
                "Funk Master", "The Assassin Baby", "The Venezulean Vixen",
                "The Bullet", "Thug"],
    "wins": [16,33,21,19,31,23,20,19,11,22,11],
    "losses": [3,7,1,1,8,1,3,5,4,3,4],
    "weightclass": ['Heavyweight', "Light Heavyweight", "Middleweight",
                   'Welterweight', "Lightweight", "Featherweight", 
                   'Bantamweight', "Flyweight", "Bantamweight", "Flyweight", "Strawweight"]
}

ufc_champs = pd.DataFrame(ufc_champs)
ufc_champs

# In making this DataFrame, we forgot to include the Women’s Featherweight champ 
ufc_champs.append(pd.Series(data = ["Amanda Nunes", "Lioness", 21, 5, "Featherweight"], index = ufc_champs.columns, name = 17))
names nicknames wins losses weightclass
0 Francis Ngannou The Predator 16 3 Heavyweight
1 Glover Teixeira None 33 7 Light Heavyweight
2 Israel Adesanya The Last Stylebender 21 1 Middleweight
3 Kamaru Usman The Nigerian Nightmare 19 1 Welterweight
4 Charles Oliveira Da Bronx 31 8 Lightweight
5 Alex Volkanowski The Great 23 1 Featherweight
6 Aljamain Sterling Funk Master 20 3 Bantamweight
7 Brandon Morano The Assassin Baby 19 5 Flyweight
8 Julianna Pena The Venezulean Vixen 11 4 Bantamweight
9 Valentina Shevchenko The Bullet 22 3 Flyweight
10 Rose Namajunas Thug 11 4 Strawweight
17 Amanda Nunes Lioness 21 5 Featherweight
ufc_champs.drop([0], axis = 0) # drops first row
names nicknames wins losses weightclass
1 Glover Teixeira None 33 7 Light Heavyweight
2 Israel Adesanya The Last Stylebender 21 1 Middleweight
3 Kamaru Usman The Nigerian Nightmare 19 1 Welterweight
4 Charles Oliveira Da Bronx 31 8 Lightweight
5 Alex Volkanowski The Great 23 1 Featherweight
6 Aljamain Sterling Funk Master 20 3 Bantamweight
7 Brandon Morano The Assassin Baby 19 5 Flyweight
8 Julianna Pena The Venezulean Vixen 11 4 Bantamweight
9 Valentina Shevchenko The Bullet 22 3 Flyweight
10 Rose Namajunas Thug 11 4 Strawweight

In the case for adding a column, the process would be similar to that of adding an item into a dictionary.

weight_limit = [265,205,185,170,155,145,135,125,135,125,115]
ufc_champs['weight_limit'] = weight_limit
ufc_champs
names nicknames wins losses weightclass weight_limit
0 Francis Ngannou The Predator 16 3 Heavyweight 265
1 Glover Teixeira None 33 7 Light Heavyweight 205
2 Israel Adesanya The Last Stylebender 21 1 Middleweight 185
3 Kamaru Usman The Nigerian Nightmare 19 1 Welterweight 170
4 Charles Oliveira Da Bronx 31 8 Lightweight 155
5 Alex Volkanowski The Great 23 1 Featherweight 145
6 Aljamain Sterling Funk Master 20 3 Bantamweight 135
7 Brandon Morano The Assassin Baby 19 5 Flyweight 125
8 Julianna Pena The Venezulean Vixen 11 4 Bantamweight 135
9 Valentina Shevchenko The Bullet 22 3 Flyweight 125
10 Rose Namajunas Thug 11 4 Strawweight 115

Sometimes with datasets, the labels used in identifying a column may not accurately describe its property. To change these labels, we can use the DataFrame.rename() method, which takes in the index argument (specifies the labels for the index in a dictionary-like format), columns argument (specifies the labels for the columns in a dictionary-like format), and the in-place argument that determines whether to return a new DataFrame.

# Let’s say we need to rename the columns of the UFC champ DataFrame by capitalizing the labels

ufc_champs.rename(columns = {'names': 'Names', 
                             'nicknames': 'Nicknames', 
                             'wins': 'Wins',  
                             'losses': 'Losses', 
                             'weightclass': 'Weight Class', 
                             'weight_limit': 'Weight Limit'})
Names Nicknames Wins Losses Weight Class Weight Limit
0 Francis Ngannou The Predator 16 3 Heavyweight 265
1 Glover Teixeira None 33 7 Light Heavyweight 205
2 Israel Adesanya The Last Stylebender 21 1 Middleweight 185
3 Kamaru Usman The Nigerian Nightmare 19 1 Welterweight 170
4 Charles Oliveira Da Bronx 31 8 Lightweight 155
5 Alex Volkanowski The Great 23 1 Featherweight 145
6 Aljamain Sterling Funk Master 20 3 Bantamweight 135
7 Brandon Morano The Assassin Baby 19 5 Flyweight 125
8 Julianna Pena The Venezulean Vixen 11 4 Bantamweight 135
9 Valentina Shevchenko The Bullet 22 3 Flyweight 125
10 Rose Namajunas Thug 11 4 Strawweight 115

Lastly, there may be some cases when we need to reshape the current makeup of the dataset to make it suitable for data analysis. While it’s certainly possible to manually remake another DataFrame, it’ll be easier to transform it. In Pandas, there are three different transformation functions that we can use to reshape the DataFrame:

Method 1 — Pivoting

This transformation is essentially taking a longer-format DataFrame and making it broader. Often this is a result of having a unique identifier repeated along multiple rows for each subsequent entry. One method to derive a newly formatted DataFrame is by using DataFrame.pivot. This method requires defining which of the data columns will be used as the new index and index as well as values for the DataFrame.

# Say we are creating a DataFrame that maps out the voting for the season MVP of the 2021-2022 NBA season from 5 different sports journalist/reporter/pundits in sports media

mvp_vote_2022 = pd.DataFrame({
    "Voter": ['Kenny Smith', 'Kenny Smith', 'Kenny Smith', 'Kenny Smith', 'Kenny Smith',
               "Charles Barkley", "Charles Barkley", "Charles Barkley", "Charles Barkley", "Charles Barkley",
               "Ernie Johnson", "Ernie Johnson", "Ernie Johnson", "Ernie Johnson", "Ernie Johnson",
               "Michael Wilbon", "Michael Wilbon", "Michael Wilbon", "Michael Wilbon", "Michael Wilbon",
               "Doris Burke", "Doris Burke", "Doris Burke", "Doris Burke", "Doris Burke"
              ], 
    "Player": ['Steph Curry', 'Lebron James', 'Chris Paul', 'Kevin Durant', 'Giannis Antetokounmpo',
                'Steph Curry', 'Nikola Jokic', 'Giannis Antetokounmpo', 'Chris Paul', 'DeMar DeRozan',
                'Steph Curry', 'Giannis Antetokounmpo', 'Kevin Durant', 'Nikola Jokic', 'Joel Embid',
                'Kevin Durant', 'Giannis Antetokounmpo', 'Nikola Jokic', 'Steph Curry', 'DeMar DeRozan',
                'Kevin Durant', 'Giannis Antetokounmpo', 'Steph Curry', 'Nikola Jokic', 'DeMar DeRozan',],

    "Placing": [1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5] 
})

# Say if we need to switch the index to indicate voters as the index values and columns with the MVP placings from each voter

mvp_vote_2022.pivot(index = 'Voter', columns = 'Placing')
Player
Placing 1 2 3 4 5
Voter
Charles Barkley Steph Curry Nikola Jokic Giannis Antetokounmpo Chris Paul DeMar DeRozan
Doris Burke Kevin Durant Giannis Antetokounmpo Steph Curry Nikola Jokic DeMar DeRozan
Ernie Johnson Steph Curry Giannis Antetokounmpo Kevin Durant Nikola Jokic Joel Embid
Kenny Smith Steph Curry Lebron James Chris Paul Kevin Durant Giannis Antetokounmpo
Michael Wilbon Kevin Durant Giannis Antetokounmpo Nikola Jokic Steph Curry DeMar DeRozan

The DataFrame.pivot method does not allow rows with duplicate values for a given column. If this uniqueness isn’t guaranteed, an alternative approach would be to use the DataFrame.pivot_table() function instead. This approach requires arguments used to specify the index, column, and values. Uniquely, this function also has an additional argument “aggfunc” (default is numpy.mean), which passes a function to aggregate the values of a DataFrame.

# Create a new DataFrame that shows the median placing for NBA MVP 
mvp_vote_2022.pivot_table(index = 'Player', values = 'Placing', aggfunc = np.median).sort_values(by= 'Placing') 

# the .sort_values() method is used to arrange the DataFrame by some existing variable in either ascending or descending order
Placing
Player
Steph Curry 1.0
Giannis Antetokounmpo 2.0
Kevin Durant 2.0
Lebron James 2.0
Chris Paul 3.5
Nikola Jokic 3.5
DeMar DeRozan 5.0
Joel Embid 5.0

Method 2 — Stacking/Unstacking

Sometimes a DataFrame may have multiple indices that’ll look something like this:

array = [["Month 1",  "Month 2", "Month  3", "Month 4"], 
        ['Squat', "Squat", "Squat", "Squat"]]

array2 = [['Arnold', "Arnold","Larry", "Larry"], 
          ["Before", "After", "Before", "After"]]

values = [[225, 245, 275, 315], [335, 355, 365, 405], [315, 365, 435, 455], [495, 405, 545, 585]]

values = np.array(values)

effect_of_ped = pd.DataFrame(data = values, index = array2, columns = array)
effect_of_ped
Month 1 Month 2 Month 3 Month 4
Squat Squat Squat Squat
Arnold Before 225 245 275 315
After 335 355 365 405
Larry Before 315 365 435 455
After 495 405 545 585

It’s often difficult to make sense of the data or address it for analysis. So, functions such as stack() or unstack() make it possible to make it longer or broader, respectively.

effect_of_ped.stack()
Month 3 Month 1 Month 2 Month 4
Arnold Before Squat 275 225 245 315
After Squat 365 335 355 405
Larry Before Squat 435 315 365 455
After Squat 545 495 405 585
effect_of_ped.unstack()
Month 1 Month 2 Month 3 Month 4
Squat Squat Squat Squat
After Before After Before After Before After Before
Arnold 335 225 355 245 365 275 405 315
Larry 495 315 405 365 545 435 585 455

Method 3 — Melting

Known as unpivoting a DataFrame, this works by essentially converting a wide-format DataFrame to a long-format. This usually occurs when more than one column works as an identifier for a given analysis. In order to transform the DataFrame to a longer format, we’ll need to use the DataFrame.melt() function, which requires establishing which columns are to be used as the identifier variable and the columns to “unpivot” to correspond to the values for said identifier.

data = pd.DataFrame({"Name": ["Pankaj", "Lisa", "David"], "ID": [1, 2, 3], "Role": ["CEO", "Editor", "Author"]})
data

# Let's say the identifier is ID and the values would be Name + Role

pd.melt(data, id_vars = ["ID"], value_vars = ['Name', 'Role'])
ID variable value
0 1 Name Pankaj
1 2 Name Lisa
2 3 Name David
3 1 Role CEO
4 2 Role Editor
5 3 Role Author

So far, we’ve only scratched the surface of DataFrames. There are many more functions and methods that can operate on these data structures within Python to gain deeper insights into your data. You can find these in the Pandas DataFrame reference guide. However, a great place to start is with the Pandas and NumPy Fundamentals course on Dataquest. Once you know the fundamentals, progress to working with data in Python in some of the other courses in the Data Analyst career path.

 

 

Dataquest

About the author

Dataquest

Dataquest teaches through challenging exercises and projects instead of video lectures. It's the most effective way to learn the skills you need to build your data career.

Learn data skills for free

Headshot Headshot

Join 1M+ learners

Try free courses