Tutorial: How to Create and Use a Pandas DataFrame
When it comes to exploring data with Python, DataFrames make analyzing and manipulating data for analysis easy. This article will look at some of the ins and outs when it comes to working with DataFrames.
Python is a powerful tool when it comes to working with data. Qualities like its scalability and variety of libraries for data analysis and data science applications make it versatile. However, what’s often under-appreciated-but-highly-valuable about Python is the ease with which we can manipulate data with flexible data structures. One of these structures is a DataFrame.
What’s a DataFrame?
To start, it’s important to know that there are a variety of different structures that data can take. For the majority of cases, most data are in tabular form (i.e., data structured into rows representing a single entry). You are likely already familiar with this if you’ve ever worked with an Excel spreadsheet or a SQL table. Aggregates of each of these rows, that represent a given data entry, and their properties are formed into a two-dimensional structure where titled columns consist of values of the same property.
These structures have several unique qualities:
- Rows — representing a singular data entry point
- Columns — corresponding to a grouping relating to a singular quality of each given data point that are usually titled
- Index — a unique identifier for each data entry
Although these can have different names depending on the programming language or application tool being used, in Python, we call these structures DataFrames. The principal library used in working with these structures is Pandas.
How do you make a DataFrame?
When it comes to creating a DataFrame, you can either import it from an external file or create it yourself in Python.
Method 1 — Import Data from a File
In the real world, a dataset is often read into Python via an external source that curated it. We can find these datasets in multiple types of files, but we most commonly find them in the form of comma separated value files (CSVs). Fortunately, in the Pandas library, it has a function that works to convert the data in this format into a DataFrame called pandas.read_csv(). The only major argument that it requires is a pathway that outlines where the file exists.
One pathway may be from the web (i.e., from an API or a GitHub repository)
import pandas as pd
import numpy as np
import datetime as dt
# While not necessarily the case, you’ll often need to load the numpy library when working with the pandas library
url = 'https://raw.githubusercontent.com/Vibe1990/Netflix-Project/main/netflix_title.csv'
netflix = pd.read_csv(url)
# When providing the URL pathway, this will need to be in the form of a raw string, otherwise it will result in an error
netflix
show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | s1 | TV Show | 3 | NaN | João Miguel, Bianca Comparato, Michel Gomes, R... | Brazil | 14-Aug-20 | 2020 | TV-MA | 4 Seasons | International TV Shows, TV Dramas, TV Sci-Fi &... | In a future where the elite inhabit an island ... |
1 | s2 | Movie | 7:19 | Jorge Michel Grau | Demián Bichir, Héctor Bonilla, Oscar Serrano, ... | Mexico | 23-Dec-16 | 2016 | TV-MA | 93 min | Dramas, International Movies | After a devastating earthquake hits Mexico Cit... |
2 | s3 | Movie | 23:59 | Gilbert Chan | Tedd Chan, Stella Chung, Henley Hii, Lawrence ... | Singapore | 20-Dec-18 | 2011 | R | 78 min | Horror Movies, International Movies | When an army recruit is found dead, his fellow... |
3 | s4 | Movie | 9 | Shane Acker | Elijah Wood, John C. Reilly, Jennifer Connelly... | United States | 16-Nov-17 | 2009 | PG-13 | 80 min | Action & Adventure, Independent Movies, Sci-Fi... | In a postapocalyptic world, rag-doll robots hi... |
4 | s5 | Movie | 21 | Robert Luketic | Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar... | United States | 01-Jan-20 | 2008 | PG-13 | 123 min | Dramas | A brilliant group of students become card-coun... |
Alternatively, if a file were stored on your computer in a working directory, then the path would adjust accordingly. In this process, we could use either the relative or full path to specify the pathway to retrieve a given file because the function can decipher the difference between the two without an issue.
# Assuming you've set up your notebook to have the desired working directory set
netflix = pd.read_csv(“netflix_title.csv”)
netflix
show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | s1 | TV Show | 3 | NaN | João Miguel, Bianca Comparato, Michel Gomes, R... | Brazil | 14-Aug-20 | 2020 | TV-MA | 4 Seasons | International TV Shows, TV Dramas, TV Sci-Fi &... | In a future where the elite inhabit an island ... |
1 | s2 | Movie | 7:19 | Jorge Michel Grau | Demián Bichir, Héctor Bonilla, Oscar Serrano, ... | Mexico | 23-Dec-16 | 2016 | TV-MA | 93 min | Dramas, International Movies | After a devastating earthquake hits Mexico Cit... |
2 | s3 | Movie | 23:59 | Gilbert Chan | Tedd Chan, Stella Chung, Henley Hii, Lawrence ... | Singapore | 20-Dec-18 | 2011 | R | 78 min | Horror Movies, International Movies | When an army recruit is found dead, his fellow... |
3 | s4 | Movie | 9 | Shane Acker | Elijah Wood, John C. Reilly, Jennifer Connelly... | United States | 16-Nov-17 | 2009 | PG | 80 min | Action & Adventure, Independent Movies, Sci-Fi... | In a postapocalyptic world, rag-doll robots hi... |
4 | s5 | Movie | 21 | Robert Luketic | Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar... | United States | 01-Jan-20 | 2008 | PG | 123 min | Dramas | A brilliant group of students become card-coun... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
7782 | s7783 | Movie | Zozo | Josef Fares | Imad Creidi, Antoinette Turk, Elias Gergi, Car... | Sweden, Czech Republic, United Kingdom, Denmar... | 19-Oct-20 | 2005 | TV-MA | 99 min | Dramas, International Movies | When Lebanon's Civil War deprives Zozo of his ... |
7783 | s7784 | Movie | Zubaan | Mozez Singh | Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan... | India | 02-Mar-19 | 2015 | TV-14 | 111 min | Dramas, International Movies, Music & Musicals | A scrappy but poor boy worms his way into a ty... |
7784 | s7785 | Movie | Zulu Man in Japan | NaN | Nasty C | NaN | 25-Sep-20 | 2019 | TV-MA | 44 min | Documentaries, International Movies, Music & M... | In this documentary, South African rapper Nast... |
7785 | s7786 | TV Show | Zumbo's Just Desserts | NaN | Adriano Zumbo, Rachel Khoo | Australia | 31-Oct-20 | 2019 | TV-PG | 1 Season | International TV Shows, Reality TV | Dessert wizard Adriano Zumbo looks for the nex... |
7786 | s7787 | Movie | ZZ TOP: THAT LITTLE OL' BAND FROM TEXAS | Sam Dunn | NaN | United Kingdom, Canada, United States | 01-Mar-20 | 2019 | TV-MA | 90 min | Documentaries, Music & Musicals | This documentary delves into the mystique behi... |
7787 rows × 12 columns
netflix.head(5)
show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | s1 | TV Show | 3 | NaN | João Miguel, Bianca Comparato, Michel Gomes, R... | Brazil | 14-Aug-20 | 2020 | TV-MA | 4 Seasons | International TV Shows, TV Dramas, TV Sci-Fi &... | In a future where the elite inhabit an island ... |
1 | s2 | Movie | 7:19 | Jorge Michel Grau | Demián Bichir, Héctor Bonilla, Oscar Serrano, ... | Mexico | 23-Dec-16 | 2016 | TV-MA | 93 min | Dramas, International Movies | After a devastating earthquake hits Mexico Cit... |
2 | s3 | Movie | 23:59 | Gilbert Chan | Tedd Chan, Stella Chung, Henley Hii, Lawrence ... | Singapore | 20-Dec-18 | 2011 | R | 78 min | Horror Movies, International Movies | When an army recruit is found dead, his fellow... |
3 | s4 | Movie | 9 | Shane Acker | Elijah Wood, John C. Reilly, Jennifer Connelly... | United States | 16-Nov-17 | 2009 | PG-13 | 80 min | Action & Adventure, Independent Movies, Sci-Fi... | In a postapocalyptic world, rag-doll robots hi... |
4 | s5 | Movie | 21 | Robert Luketic | Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar... | United States | 01-Jan-20 | 2008 | PG-13 | 123 min | Dramas | A brilliant group of students become card-coun... |
Although CSV files are the most common, there are a number of different functions that are available in Pandas to read in files of various types into a DataFrame that operates with the same general process:
File type | Function in Pandas |
---|---|
JASON | read_json() |
HTML | read_html() |
XML | read_xml() |
SQL | read_sql() |
Excel | read_excel() |
METHOD 2 – Creating DataFrames Yourself
While not the most common method of creating a DataFrame, you can certainly create a data frame yourself by inputting data. We can accomplish this with the pandas.DataFrame() function, which takes its data input argument and converts it into a DataFrame. The pandas.DataFrame
function is quite robust in that it can take in a variety of different data inputs:
- Nothing – this will make an empty DataFrame that you can populate with data later
# Creating an empty DataFrame data = pd.DataFrame() print(data)
Empty DataFrame Columns: [] Index: []
- A dictionary of ndarrays / lists
# DataFrame for Pawnee City Hall Pawnee_city_hall = { "Personnel": ["Leslie Knope", "Ron Swanson", "Ann Perkins", "Tom Haverford", "Mark Brendanawicz", "April Ludgate", "Andy Dwyer", "Ben Wyatt", "Chris Traeger","Jerry Gergich", "Donna Meagle", "Craig Middlebrooks"], "Position":["Deputy Director", "Director", "Health Representative", "Administrator", "City Planner", "Assistant - Director", "Assistant - Deputy Director", "Deputy City Manager", "City Manger", "Administrator", "Office Manger", "Assistant Office Manager"] } Pawnee_city_hall = pd.DataFrame(Pawnee_city_hall) Pawnee_city_hall
Personnel Position 0 Leslie Knope Deputy Director 1 Ron Swanson Director 2 Ann Perkins Health Representative 3 Tom Haverford Administrator 4 Mark Brendanawicz City Planner 5 April Ludgate Assistant - Director 6 Andy Dwyer Assistant - Deputy Director 7 Ben Wyatt Deputy City Manager 8 Chris Traeger City Manger 9 Jerry Gergich Administrator 10 Donna Meagle Office Manger 11 Craig Middlebrooks Assistant Office Manager
- A dictionary of series (a one-dimensional array of data with an axis label)
# Another way to create a DataFrame for Pawnee City Hall personnel = pd.Series(["Leslie Knope", "Ron Swanson", "Ann Perkins", "Tom Haverford", "Mark Brendanawicz", "April Ludgate", "Andy Dwyer", "Ben Wyatt", "Chris Traeger","Jerry Gergich", "Donna Meagle", "Craig Middlebrooks"]) position = pd.Series(["Deputy Director", "Director", "Health Representative", "Administrator", "City Planner", "Assistant - Director", "Assistant - Deputy Director", "Deputy City Manager", "City Manger", "Administrator", "Office Manger", "Assistant Office Manager"]) Pawnee_city_hall = pd.DataFrame({"Names":personnel, "Job":position}) Pawnee_city_hall
Names Job 0 Leslie Knope Deputy Director 1 Ron Swanson Director 2 Ann Perkins Health Representative 3 Tom Haverford Administrator 4 Mark Brendanawicz City Planner 5 April Ludgate Assistant - Director 6 Andy Dwyer Assistant - Deputy Director 7 Ben Wyatt Deputy City Manager 8 Chris Traeger City Manger 9 Jerry Gergich Administrator 10 Donna Meagle Office Manger 11 Craig Middlebrooks Assistant Office Manager - A list of lists
# Another way to create a DataFrame for Pawnee City Hall data = [ ["Leslie Knope", "Deputy Director"], ["Ron Swanson", "Director"], ["Tom Haverford", "Administrator"], ["April Ludgate", "Assistant-Director"], ["Donna Meagle", "Office Manager"], ["Andy Dwyer", "Assistant - Deputy Director"], ['Jerry Gergich', "Administrator"]] Pawnee_city_hall = pd.DataFrame(data, columns = ["Names", "Position"]) Pawnee_city_hall
Names Position 0 Leslie Knope Deputy Director 1 Ron Swanson Director 2 Tom Haverford Administrator 3 April Ludgate Assistant-Director 4 Donna Meagle Office Manager 5 Andy Dwyer Assistant - Deputy Director 6 Jerry Gergich Administrator
- A list of dictionaries
# DATAFRAME FOR RAPTORS raptors = [{"Player": "Pascal Siakim", "PPG": 23.7, "College":"University of New Mexico", "is_starting_five": True}, {"Player": "Fred VanVleet", "PPG": 20.1, "College":"Wichita State", "is_starting_five": True}, {"Player": "Scottie Barnes", "PPG": 15.1, "College":"Florida State", "is_starting_five": True}, {"Player": "Chris Boucher", "PPG": 8.7, "College":"Oregon", "is_starting_five": False} ] raptors = pd.DataFrame(raptors, index = ['Power Forward', "Point Guard", "Small Forward", "Center"]) raptors
Player PPG College is_starting_five is_starting_five Power Forward Pascal Siakim 23.7 University of New Mexico True True Point Guard Fred VanVleet 20.1 Whicita State True True Small Forward Scottie Barnes 15.1 Florida State True True Center Chris Boucher 8.7 Oregon False False NOTE: the
pandas.DataFrame
function also has the index and column argument that’s used to name the row index and column titles respectively.
Although the same data types are used in the examples above (strings), DataFrames can consist of a variety of different data types, such as integers, floats, lists, datetimes, Booleans, list, etc.
# DATAFRAME FOR LAKERS
lakers = {
"player" : ['Lebron James',
'Russell Westbrook', 'Anthony Davis', 'Dwight Howard', 'Avery Bradley', 'DeAndre Jordan',
'Carmelo Anthony', 'Austin Reaves', 'Kent Bazemore', 'Malik Monk', 'Stanley Johnson',
'Trevor Ariza', 'Wayne Ellington', 'Talen Horton-Tucker'],
"PPG": [28.7, 19.3, 23.3, 5, 6.9, 4.5, 13.4, 5.4, 4.1, 12.1, 6.2, 3.5, 6.4, 10.9],
"in_starting_lineup": [True, True, True, False, True, False, False, False, False, True, False, False, False, False]
}
lakers = pd.DataFrame(lakers)
lakers
player | PPG | in_starting_lineup | |
---|---|---|---|
0 | Lebron James | 28.7 | True |
1 | Russell Westbrook | 19.3 | True |
2 | Anthony Davis | 23.3 | True |
3 | Dwight Howard | 5.0 | False |
4 | Avery Bradley | 6.9 | True |
5 | DeAndre Jordan | 4.5 | False |
6 | Carmelo Anthony | 13.4 | False |
7 | Austin Reaves | 5.4 | False |
8 | Kent Bazemore | 4.1 | False |
9 | Malik Monk | 12.1 | True |
10 | Stanley Johnson | 6.2 | False |
11 | Trevor Ariza | 3.5 | False |
12 | Wayne Ellington | 6.4 | False |
13 | Talen Horton-Tucker | 10.9 | False |
Exploring a DataFrame
Since Python is an object-oriented programming language, creating a DataFrame means creating an object of the DataFrame class. This also means that there are a number of different attributes that we can explore and methods that we can apply to the DataFrame. While we use these more often in situations when we aren’t familiar with the dataset (say from importing it from somewhere), they are nonetheless useful.
Whenever a dataset is loaded into Python as a DataFrame, it’s best to look at its structure. There are a number of different attributes that can provide that info:
- DataFrame.shape — returns a tuple indicating the number of rows and columns of the DataFrame
- DataFrame.size — returns an integer value of the number of datapoints in the dataset
print(netflix.shape) print(netflix.size)
(7787, 12) 93444
If you were to explore the axes of the DataFrame, you may do so by having an array return the listed columns and index via DataFrame.columns and DataFrame.index. On the other hand, it may be useful to look at the different types of data that makes up the dataset. In these situations, the DataFrame.dtypes is used.
print(netflix.columns)
print("")
print(netflix.index)
print("")
print(netflix.dtypes)
Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
'release_year', 'rating', 'duration', 'listed_in', 'description'],
dtype='object')
RangeIndex(start=0, stop=7787, step=1)
show_id | object |
type | object |
title | object |
director | object |
cast | object |
country | object |
date_added | object |
release_year | int64 |
rating | object |
duration | object |
listed_in | object |
description | object |
dtype: object |
Manipulating a DataFrame
Now that we know what DataFrame is, it’s time to do some real work! Principally, this involves manipulating it as part of the data cleaning and data wrangling process, just prior to the actual analysis. Now there are a number of basic operations that should be in everyone’s repertoire — the first one is being able to access and isolate a given segment of a DataFrame.
To segment a DataFrame, we use either the DataFrame.loc attribute or DataFrame.iloc attribute where the input dictates which rows or columns are extracted ([rows: columns]). If a column needs to be isolated, then the process would be to use square brackets with the name of the given column.
print(netflix['title'])
print("")
print(netflix.loc[0])
0 | 3 |
---|---|
1 | 7:19 |
2 | 23:59 |
3 | 9 |
4 | 21 |
... | |
7782 | Zozo |
7783 | Zubaan |
7784 | Zulu Man in Japan |
7785 | Zumbo's Just Desserts |
7786 | ZZ TOP: THAT LITTLE OL' BAND FROM TEXAS |
Name: title, Length: 7787, dtype: object |
show_id | s1 |
---|---|
type | TV Show |
title | 3 |
director | NaN |
cast | João Miguel, Bianca Comparato, Michel Gomes, R... |
country | Brazil |
date_added | 14-Aug-20 |
release_year | 2020 |
rating | TV-MA |
duration | 4 Seasons |
listed_in | International TV Shows, TV Dramas, TV Sci-Fi &... |
description | In a future where the elite inhabit an island ... |
Name: 0, dtype: object |
In the event that we need to extract multiple rows or columns, we use the slice method, which involves using a “:” that indicates a continuous range with the end range being exclusive (i.e., not included) or by inputting criteria within square brackets in a similar manner as indexing with Boolean with NumPy.
# Select the first two row of the raptors DataFrame
raptors.iloc[0:2]
Player | PPG | College | is_starting_five | |
---|---|---|---|---|
Power Forward | Pascal Siakim | 23.7 | University of New Mexico | True |
Point Guard | Fred VanVleet | 20.1 | Whicita State | True |
# Select the last three columns of the raptors DataFrame
raptors.iloc[:, 1:4]
PPG | College | is_starting_five | |
---|---|---|---|
Power Forward | 23.7 | University of New Mexico | True |
Point Guard | 20.1 | Whicita State | True |
Small Forward | 15.1 | Florida State | True |
Center | 8.7 | Oregon | False |
# Select players that averaged more than 15 PPG on the Laker DataFrame
lakers[lakers["PPG"] > 15]
player | PPG | in_starting_lineup | |
---|---|---|---|
0 | Lebron James | 28.7 | True |
1 | Russell Westbrook | 19.3 | True |
2 | Anthony Davis | 23.3 | True |
While the above examples are simplistic, it's possible to make it more powerful and sophisticated with the use of operators such as AND (&), OR (|), NOT EQUAL TO (!=) or EQUAL TO (==). To set this up, let’s create a new DataFrame containing information about current UFC champions:
# DATAFRAME FOR UFC CHAMPS
ufc_champs = {
"names" : ['Francis Ngannou', "Glover Teixeira",
"Israel Adesanya", "Kamaru Usman",
"Charles Oliveira", "Alex Volkanowski",
'Aljamain Sterling', "Brandon Morano",
"Julianna Pena", "Valentina Shevchenko", "Rose Namajunas"],
"nicknames" : ['The Predator', None, "The Last Stylebender",
"The Nigerian Nightmare", "Da Bronx", "The Great",
"Funk Master", "The Assassin Baby", "The Venezulean Vixen",
"The Bullet", "Thug"],
"wins" : [16,33,21,19,31,23,20,19,11,22,11],
"losses" : [3,7,1,1,8,1,3,5,4,3,4],
"weightclass" : ['Heavyweight', "Light Heavyweight", "Middleweight",
'Welterweight', "Lightweight", "Featherweight",
'Bantamweight', "Flyweight", "Bantamweight", "Flyweight", "Strawweight"]
}
ufc_champs = pd.DataFrame(ufc_champs)
ufc_champs
names | nicknames | wins | losses | weightclass | country | |
---|---|---|---|---|---|---|
0 | Francis Ngannou | The Predator | 16 | 3 | Heavyweight | Cameroon |
1 | Glover Teixeira | None | 33 | 7 | Light Heavyweight | Brazil |
2 | Israel Adesanya | The Last Stylebender | 21 | 1 | Middleweight | New Zealand |
3 | Kamaru Usman | The Nigerian Nightmare | 19 | 1 | Welterweight | USA |
4 | Charles Oliveira | Da Bronx | 31 | 8 | Lightweight | Brazil |
5 | Alex Volkanowski | The Great | 23 | 1 | Featherweight | Australia |
6 | Aljamain Sterling | Funk Master | 20 | 3 | Bantamweight | USA |
7 | Brandon Morano | The Assassin Baby | 19 | 5 | Flyweight | Mexico |
8 | Julianna Pena | The Venezulean Vixen | 11 | 4 | Bantamweight | USA |
9 | Valentina Shevchenko | The Bullet | 22 | 3 | Flyweight | Kyrgyzstan |
10 | Rose Namajunas | Thug | 11 | 4 | Strawweight | USA |
# Select champion that came from Brazil + is the Light Heavyweight Champ
ufc_champs[(ufc_champs['country'] == "Brazil") & (ufc_champs['weightclass'] == "Light Heavyweight")]
names | nicknames | wins | losses | weightclass | country | |
---|---|---|---|---|---|---|
1 | Glover Teixeira | None | 33 | 7 | Light Heavyweight | Brazil |
# Select champion(s) if home country is Brazil OR has at Less than 2 losses
ufc_champs[(ufc_champs['country'] == "Brazil") | (ufc_champs['losses'] < 2)]
names | nicknames | wins | losses | weightclass | country | |
---|---|---|---|---|---|---|
1 | Glover Teixeira | None | 33 | 7 | Light Heavyweight | Brazil |
2 | Israel Adesanya | The Last Stylebender | 21 | 1 | Middleweight | New Zealand |
3 | Kamaru Usman | The Nigerian Nightmare | 19 | 1 | Welterweight | USA |
4 | Charles Oliveira | Da Bronx | 31 | 8 | Lightweight | Brazil |
5 | Alex Volkanowski | The Great | 23 | 1 | Featherweight | Australia |
# Select chamption with less than 20 wins AND home country is not USA
print(ufc_champs[(ufc_champs['wins'] < 20) & (ufc_champs["country"] != "USA")])
names | nicknames | wins | losses | weightclass | country | |
---|---|---|---|---|---|---|
0 | Francis Ngannou | The Predator | 16 | 3 | Heavyweight | Cameroon |
7 | Brandon Morano | The Assassin Baby | 19 | 5 | Flyweight | Mexico |
Aside from filtering out a DataFrame or segmenting it, it's also possible to use the DataFrame.iloc()
and DataFrame.loc()
attributes in changing a particular value. Let’s look at an example of how we can do this with real data. Suppose that the Lakers make a change in the starting five lineup where we replace Anthony Davis with DeAndre Jordan. We can do that with DataFrame.iloc()
like this:
# The Lakers made a change in the starting 5 lineup where we replace Anthony Davis with DeAndre Jordan
lakers.iloc[2,2] = False
lakers.loc[5,"in_starting_lineup"] = True
lakers
player | PPG | in_starting_lineup | |
---|---|---|---|
0 | Lebron James | 28.7 | True |
1 | Russell Westbrook | 19.3 | True |
2 | Anthony Davis | 23.3 | False |
3 | Dwight Howard | 5.0 | False |
4 | Avery Bradley | 6.9 | True |
5 | DeAndre Jordan | 4.5 | True |
6 | Carmelo Anthony | 13.4 | False |
7 | Austin Reaves | 5.4 | False |
8 | Kent Bazemore | 4.1 | False |
9 | Malik Monk | 12.1 | True |
10 | Stanley Johnson | 6.2 | False |
11 | Trevor Ariza | 3.5 | False |
12 | Wayne Ellington | 6.4 | False |
13 | Talen Horton-Tucker | 10.9 | False |
In some situations, it may be necessary to insert or delete data from a DataFrame. We can insert or delete a row using the append()
and drop()
methods. With the append method, you would use a panda.Series object that matches the dimensions of a DataFrame as the argument for the function. For the drop method, all that we need to state is the index/column label within the DataFrame that we need to drop. Both methods contain the axis argument that specifies whether to add or drop a row or column.
Recall that we created a DataFrame consisting of UFC champions that held the title at the end of 2021 with their monikers and win-loss record:
# Say we create a DataFrame consisting of UFC champions that held the title at the end of 2021 with their monikers and win-loss record
ufc_champs = {
"names": ['Francis Ngannou', "Glover Teixeira",
"Israel Adesanya", "Kamaru Usman",
"Charles Oliveira", "Alex Volkanowski",
'Aljamain Sterling', "Brandon Morano",
"Julianna Pena", "Valentina Shevchenko", "Rose Namajunas"],
"nicknames": ['The Predator', None, "The Last Stylebender",
"The Nigerian Nightmare", "Da Bronx", "The Great",
"Funk Master", "The Assassin Baby", "The Venezulean Vixen",
"The Bullet", "Thug"],
"wins": [16,33,21,19,31,23,20,19,11,22,11],
"losses": [3,7,1,1,8,1,3,5,4,3,4],
"weightclass": ['Heavyweight', "Light Heavyweight", "Middleweight",
'Welterweight', "Lightweight", "Featherweight",
'Bantamweight', "Flyweight", "Bantamweight", "Flyweight", "Strawweight"]
}
ufc_champs = pd.DataFrame(ufc_champs)
ufc_champs
# In making this DataFrame, we forgot to include the Women’s Featherweight champ
ufc_champs.append(pd.Series(data = ["Amanda Nunes", "Lioness", 21, 5, "Featherweight"], index = ufc_champs.columns, name = 17))
names | nicknames | wins | losses | weightclass | |
---|---|---|---|---|---|
0 | Francis Ngannou | The Predator | 16 | 3 | Heavyweight |
1 | Glover Teixeira | None | 33 | 7 | Light Heavyweight |
2 | Israel Adesanya | The Last Stylebender | 21 | 1 | Middleweight |
3 | Kamaru Usman | The Nigerian Nightmare | 19 | 1 | Welterweight |
4 | Charles Oliveira | Da Bronx | 31 | 8 | Lightweight |
5 | Alex Volkanowski | The Great | 23 | 1 | Featherweight |
6 | Aljamain Sterling | Funk Master | 20 | 3 | Bantamweight |
7 | Brandon Morano | The Assassin Baby | 19 | 5 | Flyweight |
8 | Julianna Pena | The Venezulean Vixen | 11 | 4 | Bantamweight |
9 | Valentina Shevchenko | The Bullet | 22 | 3 | Flyweight |
10 | Rose Namajunas | Thug | 11 | 4 | Strawweight |
17 | Amanda Nunes | Lioness | 21 | 5 | Featherweight |
ufc_champs.drop([0], axis = 0) # drops first row
names | nicknames | wins | losses | weightclass | |
---|---|---|---|---|---|
1 | Glover Teixeira | None | 33 | 7 | Light Heavyweight |
2 | Israel Adesanya | The Last Stylebender | 21 | 1 | Middleweight |
3 | Kamaru Usman | The Nigerian Nightmare | 19 | 1 | Welterweight |
4 | Charles Oliveira | Da Bronx | 31 | 8 | Lightweight |
5 | Alex Volkanowski | The Great | 23 | 1 | Featherweight |
6 | Aljamain Sterling | Funk Master | 20 | 3 | Bantamweight |
7 | Brandon Morano | The Assassin Baby | 19 | 5 | Flyweight |
8 | Julianna Pena | The Venezulean Vixen | 11 | 4 | Bantamweight |
9 | Valentina Shevchenko | The Bullet | 22 | 3 | Flyweight |
10 | Rose Namajunas | Thug | 11 | 4 | Strawweight |
In the case for adding a column, the process would be similar to that of adding an item into a dictionary.
weight_limit = [265,205,185,170,155,145,135,125,135,125,115]
ufc_champs['weight_limit'] = weight_limit
ufc_champs
names | nicknames | wins | losses | weightclass | weight_limit | |
---|---|---|---|---|---|---|
0 | Francis Ngannou | The Predator | 16 | 3 | Heavyweight | 265 |
1 | Glover Teixeira | None | 33 | 7 | Light Heavyweight | 205 |
2 | Israel Adesanya | The Last Stylebender | 21 | 1 | Middleweight | 185 |
3 | Kamaru Usman | The Nigerian Nightmare | 19 | 1 | Welterweight | 170 |
4 | Charles Oliveira | Da Bronx | 31 | 8 | Lightweight | 155 |
5 | Alex Volkanowski | The Great | 23 | 1 | Featherweight | 145 |
6 | Aljamain Sterling | Funk Master | 20 | 3 | Bantamweight | 135 |
7 | Brandon Morano | The Assassin Baby | 19 | 5 | Flyweight | 125 |
8 | Julianna Pena | The Venezulean Vixen | 11 | 4 | Bantamweight | 135 |
9 | Valentina Shevchenko | The Bullet | 22 | 3 | Flyweight | 125 |
10 | Rose Namajunas | Thug | 11 | 4 | Strawweight | 115 |
Sometimes with datasets, the labels used in identifying a column may not accurately describe its property. To change these labels, we can use the DataFrame.rename()
method, which takes in the index argument (specifies the labels for the index in a dictionary-like format), columns argument (specifies the labels for the columns in a dictionary-like format), and the in-place argument that determines whether to return a new DataFrame.
# Let’s say we need to rename the columns of the UFC champ DataFrame by capitalizing the labels
ufc_champs.rename(columns = {'names': 'Names',
'nicknames': 'Nicknames',
'wins': 'Wins',
'losses': 'Losses',
'weightclass': 'Weight Class',
'weight_limit': 'Weight Limit'})
Names | Nicknames | Wins | Losses | Weight Class | Weight Limit | |
---|---|---|---|---|---|---|
0 | Francis Ngannou | The Predator | 16 | 3 | Heavyweight | 265 |
1 | Glover Teixeira | None | 33 | 7 | Light Heavyweight | 205 |
2 | Israel Adesanya | The Last Stylebender | 21 | 1 | Middleweight | 185 |
3 | Kamaru Usman | The Nigerian Nightmare | 19 | 1 | Welterweight | 170 |
4 | Charles Oliveira | Da Bronx | 31 | 8 | Lightweight | 155 |
5 | Alex Volkanowski | The Great | 23 | 1 | Featherweight | 145 |
6 | Aljamain Sterling | Funk Master | 20 | 3 | Bantamweight | 135 |
7 | Brandon Morano | The Assassin Baby | 19 | 5 | Flyweight | 125 |
8 | Julianna Pena | The Venezulean Vixen | 11 | 4 | Bantamweight | 135 |
9 | Valentina Shevchenko | The Bullet | 22 | 3 | Flyweight | 125 |
10 | Rose Namajunas | Thug | 11 | 4 | Strawweight | 115 |
Lastly, there may be some cases when we need to reshape the current makeup of the dataset to make it suitable for data analysis. While it's certainly possible to manually remake another DataFrame, it’ll be easier to transform it. In Pandas, there are three different transformation functions that we can use to reshape the DataFrame:
Method 1 — Pivoting
This transformation is essentially taking a longer-format DataFrame and making it broader. Often this is a result of having a unique identifier repeated along multiple rows for each subsequent entry. One method to derive a newly formatted DataFrame is by using DataFrame.pivot. This method requires defining which of the data columns will be used as the new index and index as well as values for the DataFrame.
# Say we are creating a DataFrame that maps out the voting for the season MVP of the 2021-2022 NBA season from 5 different sports journalist/reporter/pundits in sports media
mvp_vote_2022 = pd.DataFrame({
"Voter": ['Kenny Smith', 'Kenny Smith', 'Kenny Smith', 'Kenny Smith', 'Kenny Smith',
"Charles Barkley", "Charles Barkley", "Charles Barkley", "Charles Barkley", "Charles Barkley",
"Ernie Johnson", "Ernie Johnson", "Ernie Johnson", "Ernie Johnson", "Ernie Johnson",
"Michael Wilbon", "Michael Wilbon", "Michael Wilbon", "Michael Wilbon", "Michael Wilbon",
"Doris Burke", "Doris Burke", "Doris Burke", "Doris Burke", "Doris Burke"
],
"Player": ['Steph Curry', 'Lebron James', 'Chris Paul', 'Kevin Durant', 'Giannis Antetokounmpo',
'Steph Curry', 'Nikola Jokic', 'Giannis Antetokounmpo', 'Chris Paul', 'DeMar DeRozan',
'Steph Curry', 'Giannis Antetokounmpo', 'Kevin Durant', 'Nikola Jokic', 'Joel Embid',
'Kevin Durant', 'Giannis Antetokounmpo', 'Nikola Jokic', 'Steph Curry', 'DeMar DeRozan',
'Kevin Durant', 'Giannis Antetokounmpo', 'Steph Curry', 'Nikola Jokic', 'DeMar DeRozan',],
"Placing": [1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5]
})
# Say if we need to switch the index to indicate voters as the index values and columns with the MVP placings from each voter
mvp_vote_2022.pivot(index = 'Voter', columns = 'Placing')
Player | |||||
---|---|---|---|---|---|
Placing | 1 | 2 | 3 | 4 | 5 |
Voter | |||||
Charles Barkley | Steph Curry | Nikola Jokic | Giannis Antetokounmpo | Chris Paul | DeMar DeRozan |
Doris Burke | Kevin Durant | Giannis Antetokounmpo | Steph Curry | Nikola Jokic | DeMar DeRozan |
Ernie Johnson | Steph Curry | Giannis Antetokounmpo | Kevin Durant | Nikola Jokic | Joel Embid |
Kenny Smith | Steph Curry | Lebron James | Chris Paul | Kevin Durant | Giannis Antetokounmpo |
Michael Wilbon | Kevin Durant | Giannis Antetokounmpo | Nikola Jokic | Steph Curry | DeMar DeRozan |
The DataFrame.pivot
method does not allow rows with duplicate values for a given column. If this uniqueness isn’t guaranteed, an alternative approach would be to use the DataFrame.pivot_table()
function instead. This approach requires arguments used to specify the index, column, and values. Uniquely, this function also has an additional argument “aggfunc” (default is numpy.mean), which passes a function to aggregate the values of a DataFrame.
# Create a new DataFrame that shows the median placing for NBA MVP
mvp_vote_2022.pivot_table(index = 'Player', values = 'Placing', aggfunc = np.median).sort_values(by= 'Placing')
# the .sort_values() method is used to arrange the DataFrame by some existing variable in either ascending or descending order
Placing | |
---|---|
Player | |
Steph Curry | 1.0 |
Giannis Antetokounmpo | 2.0 |
Kevin Durant | 2.0 |
Lebron James | 2.0 |
Chris Paul | 3.5 |
Nikola Jokic | 3.5 |
DeMar DeRozan | 5.0 |
Joel Embid | 5.0 |
Method 2 — Stacking/Unstacking
Sometimes a DataFrame may have multiple indices that’ll look something like this:
array = [["Month 1", "Month 2", "Month 3", "Month 4"],
['Squat', "Squat", "Squat", "Squat"]]
array2 = [['Arnold', "Arnold","Larry", "Larry"],
["Before", "After", "Before", "After"]]
values = [[225, 245, 275, 315], [335, 355, 365, 405], [315, 365, 435, 455], [495, 405, 545, 585]]
values = np.array(values)
effect_of_ped = pd.DataFrame(data = values, index = array2, columns = array)
effect_of_ped
Month 1 | Month 2 | Month 3 | Month 4 | ||
---|---|---|---|---|---|
Squat | Squat | Squat | Squat | ||
Arnold | Before | 225 | 245 | 275 | 315 |
After | 335 | 355 | 365 | 405 | |
Larry | Before | 315 | 365 | 435 | 455 |
After | 495 | 405 | 545 | 585 |
It's often difficult to make sense of the data or address it for analysis. So, functions such as stack()
or unstack()
make it possible to make it longer or broader, respectively.
effect_of_ped.stack()
Month 3 | Month 1 | Month 2 | Month 4 | |||
---|---|---|---|---|---|---|
Arnold | Before | Squat | 275 | 225 | 245 | 315 |
After | Squat | 365 | 335 | 355 | 405 | |
Larry | Before | Squat | 435 | 315 | 365 | 455 |
After | Squat | 545 | 495 | 405 | 585 |
effect_of_ped.unstack()
Month 1 | Month 2 | Month 3 | Month 4 | |||||
---|---|---|---|---|---|---|---|---|
Squat | Squat | Squat | Squat | |||||
After | Before | After | Before | After | Before | After | Before | |
Arnold | 335 | 225 | 355 | 245 | 365 | 275 | 405 | 315 |
Larry | 495 | 315 | 405 | 365 | 545 | 435 | 585 | 455 |
Method 3 — Melting
Known as unpivoting a DataFrame, this works by essentially converting a wide-format DataFrame to a long-format. This usually occurs when more than one column works as an identifier for a given analysis. In order to transform the DataFrame to a longer format, we’ll need to use the DataFrame.melt()
function, which requires establishing which columns are to be used as the identifier variable and the columns to “unpivot” to correspond to the values for said identifier.
data = pd.DataFrame({"Name": ["Pankaj", "Lisa", "David"], "ID": [1, 2, 3], "Role": ["CEO", "Editor", "Author"]})
data
# Let's say the identifier is ID and the values would be Name + Role
pd.melt(data, id_vars = ["ID"], value_vars = ['Name', 'Role'])
ID | variable | value | |
---|---|---|---|
0 | 1 | Name | Pankaj |
1 | 2 | Name | Lisa |
2 | 3 | Name | David |
3 | 1 | Role | CEO |
4 | 2 | Role | Editor |
5 | 3 | Role | Author |
So far, we’ve only scratched the surface of DataFrames. There are many more functions and methods that can operate on these data structures within Python to gain deeper insights into your data. You can find these in the Pandas DataFrame reference guide. However, a great place to start is with the Pandas and NumPy Fundamentals course on Dataquest. Once you know the fundamentals, progress to working with data in Python in some of the other courses in the Data Analyst career path.