January 31, 2022

Tutorial: How to Create and Use a Pandas DataFrame

When it comes to exploring data with Python, DataFrames make analyzing and manipulating data for analysis easy. This article will look at some of the ins and outs when it comes to working with DataFrames.

Python is a powerful tool when it comes to working with data. Qualities like its scalability and variety of libraries for data analysis and data science applications make it versatile. However, what’s often under-appreciated-but-highly-valuable about Python is the ease with which we can manipulate data with flexible data structures. One of these structures is a DataFrame.

What’s a DataFrame?

To start, it’s important to know that there are a variety of different structures that data can take. For the majority of cases, most data are in tabular form (i.e., data structured into rows representing a single entry). You are likely already familiar with this if you’ve ever worked with an Excel spreadsheet or a SQL table. Aggregates of each of these rows, that represent a given data entry, and their properties are formed into a two-dimensional structure where titled columns consist of values of the same property.

These structures have several unique qualities:

Rows — representing a singular data entry point
Columns — corresponding to a grouping relating to a singular quality of each given data point that are usually titled
Index — a unique identifier for each data entry

Although these can have different names depending on the programming language or application tool being used, in Python, we call these structures DataFrames. The principal library used in working with these structures is Pandas.

How do you make a DataFrame?

When it comes to creating a DataFrame, you can either import it from an external file or create it yourself in Python.

Method 1 — Import Data from a File

In the real world, a dataset is often read into Python via an external source that curated it. We can find these datasets in multiple types of files, but we most commonly find them in the form of comma separated value files (CSVs). Fortunately, in the Pandas library, it has a function that works to convert the data in this format into a DataFrame called pandas.read_csv(). The only major argument that it requires is a pathway that outlines where the file exists.

One pathway may be from the web (i.e., from an API or a GitHub repository)

import pandas as pd
import numpy as np 
import datetime as dt

# While not necessarily the case, you’ll often need to load the numpy library when working with the pandas library 
url = 'https://raw.githubusercontent.com/Vibe1990/Netflix-Project/main/netflix_title.csv'
netflix = pd.read_csv(url)
#  When providing the URL pathway, this will need to be in the form of a raw string, otherwise it will result in an error
netflix

	show_id	type	title	director	cast	country	date_added	release_year	rating	duration	listed_in	description
0	s1	TV Show	3	NaN	João Miguel, Bianca Comparato, Michel Gomes, R...	Brazil	14-Aug-20	2020	TV-MA	4 Seasons	International TV Shows, TV Dramas, TV Sci-Fi &...	In a future where the elite inhabit an island ...
1	s2	Movie	7:19	Jorge Michel Grau	Demián Bichir, Héctor Bonilla, Oscar Serrano, ...	Mexico	23-Dec-16	2016	TV-MA	93 min	Dramas, International Movies	After a devastating earthquake hits Mexico Cit...
2	s3	Movie	23:59	Gilbert Chan	Tedd Chan, Stella Chung, Henley Hii, Lawrence ...	Singapore	20-Dec-18	2011	R	78 min	Horror Movies, International Movies	When an army recruit is found dead, his fellow...
3	s4	Movie	9	Shane Acker	Elijah Wood, John C. Reilly, Jennifer Connelly...	United States	16-Nov-17	2009	PG-13	80 min	Action & Adventure, Independent Movies, Sci-Fi...	In a postapocalyptic world, rag-doll robots hi...
4	s5	Movie	21	Robert Luketic	Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...	United States	01-Jan-20	2008	PG-13	123 min	Dramas	A brilliant group of students become card-coun...

Alternatively, if a file were stored on your computer in a working directory, then the path would adjust accordingly. In this process, we could use either the relative or full path to specify the pathway to retrieve a given file because the function can decipher the difference between the two without an issue.

# Assuming you've set up your notebook to have the desired working directory set

netflix = pd.read_csv(“netflix_title.csv”)
netflix

	show_id	type	title	director	cast	country	date_added	release_year	rating	duration	listed_in	description
0	s1	TV Show	3	NaN	João Miguel, Bianca Comparato, Michel Gomes, R...	Brazil	14-Aug-20	2020	TV-MA	4 Seasons	International TV Shows, TV Dramas, TV Sci-Fi &...	In a future where the elite inhabit an island ...
1	s2	Movie	7:19	Jorge Michel Grau	Demián Bichir, Héctor Bonilla, Oscar Serrano, ...	Mexico	23-Dec-16	2016	TV-MA	93 min	Dramas, International Movies	After a devastating earthquake hits Mexico Cit...
2	s3	Movie	23:59	Gilbert Chan	Tedd Chan, Stella Chung, Henley Hii, Lawrence ...	Singapore	20-Dec-18	2011	R	78 min	Horror Movies, International Movies	When an army recruit is found dead, his fellow...
3	s4	Movie	9	Shane Acker	Elijah Wood, John C. Reilly, Jennifer Connelly...	United States	16-Nov-17	2009	PG	80 min	Action & Adventure, Independent Movies, Sci-Fi...	In a postapocalyptic world, rag-doll robots hi...
4	s5	Movie	21	Robert Luketic	Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...	United States	01-Jan-20	2008	PG	123 min	Dramas	A brilliant group of students become card-coun...
...	...	...	...	...	...	...	...	...	...	...	...	...
7782	s7783	Movie	Zozo	Josef Fares	Imad Creidi, Antoinette Turk, Elias Gergi, Car...	Sweden, Czech Republic, United Kingdom, Denmar...	19-Oct-20	2005	TV-MA	99 min	Dramas, International Movies	When Lebanon's Civil War deprives Zozo of his ...
7783	s7784	Movie	Zubaan	Mozez Singh	Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan...	India	02-Mar-19	2015	TV-14	111 min	Dramas, International Movies, Music & Musicals	A scrappy but poor boy worms his way into a ty...
7784	s7785	Movie	Zulu Man in Japan	NaN	Nasty C	NaN	25-Sep-20	2019	TV-MA	44 min	Documentaries, International Movies, Music & M...	In this documentary, South African rapper Nast...
7785	s7786	TV Show	Zumbo's Just Desserts	NaN	Adriano Zumbo, Rachel Khoo	Australia	31-Oct-20	2019	TV-PG	1 Season	International TV Shows, Reality TV	Dessert wizard Adriano Zumbo looks for the nex...
7786	s7787	Movie	ZZ TOP: THAT LITTLE OL' BAND FROM TEXAS	Sam Dunn	NaN	United Kingdom, Canada, United States	01-Mar-20	2019	TV-MA	90 min	Documentaries, Music & Musicals	This documentary delves into the mystique behi...

7787 rows × 12 columns

netflix.head(5)

show_id	type	title	director	cast	country	date_added	release_year	rating	duration	listed_in	description
0	s1	TV Show	3	NaN	João Miguel, Bianca Comparato, Michel Gomes, R...	Brazil	14-Aug-20	2020	TV-MA	4 Seasons	International TV Shows, TV Dramas, TV Sci-Fi &...	In a future where the elite inhabit an island ...
1	s2	Movie	7:19	Jorge Michel Grau	Demián Bichir, Héctor Bonilla, Oscar Serrano, ...	Mexico	23-Dec-16	2016	TV-MA	93 min	Dramas, International Movies	After a devastating earthquake hits Mexico Cit...
2	s3	Movie	23:59	Gilbert Chan	Tedd Chan, Stella Chung, Henley Hii, Lawrence ...	Singapore	20-Dec-18	2011	R	78 min	Horror Movies, International Movies	When an army recruit is found dead, his fellow...
3	s4	Movie	9	Shane Acker	Elijah Wood, John C. Reilly, Jennifer Connelly...	United States	16-Nov-17	2009	PG-13	80 min	Action & Adventure, Independent Movies, Sci-Fi...	In a postapocalyptic world, rag-doll robots hi...
4	s5	Movie	21	Robert Luketic	Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...	United States	01-Jan-20	2008	PG-13	123 min	Dramas	A brilliant group of students become card-coun...

Although CSV files are the most common, there are a number of different functions that are available in Pandas to read in files of various types into a DataFrame that operates with the same general process:

File type	Function in Pandas
JASON	read_json()
HTML	read_html()
XML	read_xml()
SQL	read_sql()
Excel	read_excel()

METHOD 2 – Creating DataFrames Yourself

While not the most common method of creating a DataFrame, you can certainly create a data frame yourself by inputting data. We can accomplish this with the pandas.DataFrame() function, which takes its data input argument and converts it into a DataFrame. The pandas.DataFrame function is quite robust in that it can take in a variety of different data inputs:

Nothing – this will make an empty DataFrame that you can populate with data later

# Creating an empty DataFrame
data = pd.DataFrame()
print(data)

Empty DataFrame
Columns: []
Index: []

A dictionary of ndarrays / lists

# DataFrame for Pawnee City Hall

Pawnee_city_hall = {
    "Personnel": ["Leslie Knope", "Ron Swanson", "Ann Perkins", "Tom Haverford", "Mark Brendanawicz", "April Ludgate", "Andy Dwyer", "Ben Wyatt", "Chris Traeger","Jerry Gergich", "Donna Meagle", "Craig Middlebrooks"],
    "Position":["Deputy Director", "Director", "Health Representative", "Administrator", "City Planner", "Assistant - Director", "Assistant - Deputy Director", "Deputy City Manager", "City Manger", "Administrator", "Office Manger", "Assistant Office Manager"]
}

Pawnee_city_hall = pd.DataFrame(Pawnee_city_hall)
Pawnee_city_hall

	Personnel	Position
0	Leslie Knope	Deputy Director
1	Ron Swanson	Director
2	Ann Perkins	Health Representative
3	Tom Haverford	Administrator
4	Mark Brendanawicz	City Planner
5	April Ludgate	Assistant - Director
6	Andy Dwyer	Assistant - Deputy Director
7	Ben Wyatt	Deputy City Manager
8	Chris Traeger	City Manger
9	Jerry Gergich	Administrator
10	Donna Meagle	Office Manger
11	Craig Middlebrooks	Assistant Office Manager

A dictionary of series (a one-dimensional array of data with an axis label)

# Another way to create a DataFrame for Pawnee City Hall
personnel = pd.Series(["Leslie Knope", "Ron Swanson", "Ann Perkins", "Tom Haverford", "Mark Brendanawicz", "April Ludgate", "Andy Dwyer", "Ben Wyatt", "Chris Traeger","Jerry Gergich", "Donna Meagle", "Craig Middlebrooks"])
position = pd.Series(["Deputy Director", "Director", "Health Representative", "Administrator", "City Planner", "Assistant - Director", "Assistant - Deputy Director", "Deputy City Manager", "City Manger", "Administrator", "Office Manger", "Assistant Office Manager"])

Pawnee_city_hall = pd.DataFrame({"Names":personnel, "Job":position})
Pawnee_city_hall

	Names	Job
0	Leslie Knope	Deputy Director
1	Ron Swanson	Director
2	Ann Perkins	Health Representative
3	Tom Haverford	Administrator
4	Mark Brendanawicz	City Planner
5	April Ludgate	Assistant - Director
6	Andy Dwyer	Assistant - Deputy Director
7	Ben Wyatt	Deputy City Manager
8	Chris Traeger	City Manger
9	Jerry Gergich	Administrator
10	Donna Meagle	Office Manger
11	Craig Middlebrooks	Assistant Office Manager

A list of lists

# Another way to create a DataFrame for Pawnee City Hall

data = [
    ["Leslie Knope", "Deputy Director"], 
    ["Ron Swanson", "Director"], 
    ["Tom Haverford", "Administrator"], 
    ["April Ludgate", "Assistant-Director"], 
    ["Donna Meagle", "Office Manager"],
    ["Andy Dwyer", "Assistant - Deputy Director"], 
    ['Jerry Gergich', "Administrator"]]

Pawnee_city_hall = pd.DataFrame(data, columns = ["Names", "Position"])
Pawnee_city_hall

	Names	Position
0	Leslie Knope	Deputy Director
1	Ron Swanson	Director
2	Tom Haverford	Administrator
3	April Ludgate	Assistant-Director
4	Donna Meagle	Office Manager
5	Andy Dwyer	Assistant - Deputy Director
6	Jerry Gergich	Administrator

A list of dictionaries

# DATAFRAME FOR RAPTORS

raptors = [{"Player": "Pascal Siakim", "PPG": 23.7, "College":"University of New Mexico", "is_starting_five": True},
           {"Player": "Fred VanVleet", "PPG": 20.1, "College":"Wichita State", "is_starting_five": True},
           {"Player": "Scottie Barnes", "PPG": 15.1, "College":"Florida State", "is_starting_five": True},
           {"Player": "Chris Boucher", "PPG": 8.7, "College":"Oregon", "is_starting_five": False}
          ]

raptors = pd.DataFrame(raptors, index = ['Power Forward', "Point Guard", "Small Forward", "Center"])
raptors

	Player	PPG	College	is_starting_five	is_starting_five
Power Forward	Pascal Siakim	23.7	University of New Mexico	True	True
Point Guard	Fred VanVleet	20.1	Whicita State	True	True
Small Forward	Scottie Barnes	15.1	Florida State	True	True
Center	Chris Boucher	8.7	Oregon	False	False

NOTE: the pandas.DataFrame function also has the index and column argument that’s used to name the row index and column titles respectively.

Although the same data types are used in the examples above (strings), DataFrames can consist of a variety of different data types, such as integers, floats, lists, datetimes, Booleans, list, etc.

# DATAFRAME FOR LAKERS

lakers = {
"player" : ['Lebron James', 
              'Russell Westbrook', 'Anthony Davis', 'Dwight Howard', 'Avery Bradley', 'DeAndre Jordan', 
              'Carmelo Anthony', 'Austin Reaves', 'Kent Bazemore', 'Malik Monk', 'Stanley Johnson', 
              'Trevor Ariza', 'Wayne Ellington', 'Talen Horton-Tucker'],
"PPG": [28.7, 19.3, 23.3, 5, 6.9, 4.5, 13.4, 5.4, 4.1, 12.1, 6.2, 3.5, 6.4, 10.9],
"in_starting_lineup": [True, True, True, False, True, False, False, False, False, True, False, False, False, False]
}

lakers = pd.DataFrame(lakers)
lakers

	player	PPG	in_starting_lineup
0	Lebron James	28.7	True
1	Russell Westbrook	19.3	True
2	Anthony Davis	23.3	True
3	Dwight Howard	5.0	False
4	Avery Bradley	6.9	True
5	DeAndre Jordan	4.5	False
6	Carmelo Anthony	13.4	False
7	Austin Reaves	5.4	False
8	Kent Bazemore	4.1	False
9	Malik Monk	12.1	True
10	Stanley Johnson	6.2	False
11	Trevor Ariza	3.5	False
12	Wayne Ellington	6.4	False
13	Talen Horton-Tucker	10.9	False

Exploring a DataFrame

Since Python is an object-oriented programming language, creating a DataFrame means creating an object of the DataFrame class. This also means that there are a number of different attributes that we can explore and methods that we can apply to the DataFrame. While we use these more often in situations when we aren’t familiar with the dataset (say from importing it from somewhere), they are nonetheless useful.

Whenever a dataset is loaded into Python as a DataFrame, it’s best to look at its structure. There are a number of different attributes that can provide that info:

DataFrame.shape — returns a tuple indicating the number of rows and columns of the DataFrame
DataFrame.size — returns an integer value of the number of datapoints in the dataset
```
print(netflix.shape)
print(netflix.size)
```
```
(7787, 12)
93444
```

If you were to explore the axes of the DataFrame, you may do so by having an array return the listed columns and index via DataFrame.columns and DataFrame.index. On the other hand, it may be useful to look at the different types of data that makes up the dataset. In these situations, the DataFrame.dtypes is used.

print(netflix.columns)
print("")
print(netflix.index)
print("")
print(netflix.dtypes)

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')
RangeIndex(start=0, stop=7787, step=1)

show_id	object
type	object
title	object
director	object
cast	object
country	object
date_added	object
release_year	int64
rating	object
duration	object
listed_in	object
description	object
dtype: object

Manipulating a DataFrame

Now that we know what DataFrame is, it’s time to do some real work! Principally, this involves manipulating it as part of the data cleaning and data wrangling process, just prior to the actual analysis. Now there are a number of basic operations that should be in everyone’s repertoire — the first one is being able to access and isolate a given segment of a DataFrame.

To segment a DataFrame, we use either the DataFrame.loc attribute or DataFrame.iloc attribute where the input dictates which rows or columns are extracted ([rows: columns]). If a column needs to be isolated, then the process would be to use square brackets with the name of the given column.

print(netflix['title'])
print("")
print(netflix.loc[0])

0	3
1	7:19
2	23:59
3	9
4	21
...
7782	Zozo
7783	Zubaan
7784	Zulu Man in Japan
7785	Zumbo's Just Desserts
7786	ZZ TOP: THAT LITTLE OL' BAND FROM TEXAS
Name: title, Length: 7787, dtype: object

show_id	s1
type	TV Show
title	3
director	NaN
cast	João Miguel, Bianca Comparato, Michel Gomes, R...
country	Brazil
date_added	14-Aug-20
release_year	2020
rating	TV-MA
duration	4 Seasons
listed_in	International TV Shows, TV Dramas, TV Sci-Fi &...
description	In a future where the elite inhabit an island ...
Name: 0, dtype: object

In the event that we need to extract multiple rows or columns, we use the slice method, which involves using a “:” that indicates a continuous range with the end range being exclusive (i.e., not included) or by inputting criteria within square brackets in a similar manner as indexing with Boolean with NumPy.

# Select the first two row of the raptors DataFrame 
raptors.iloc[0:2]

	Player	PPG	College	is_starting_five
Power Forward	Pascal Siakim	23.7	University of New Mexico	True
Point Guard	Fred VanVleet	20.1	Whicita State	True

# Select the last three columns of the raptors DataFrame
raptors.iloc[:, 1:4]

	PPG	College	is_starting_five
Power Forward	23.7	University of New Mexico	True
Point Guard	20.1	Whicita State	True
Small Forward	15.1	Florida State	True
Center	8.7	Oregon	False

# Select players that averaged more than 15 PPG on the Laker DataFrame
lakers[lakers["PPG"] > 15]

	player	PPG	in_starting_lineup
0	Lebron James	28.7	True
1	Russell Westbrook	19.3	True
2	Anthony Davis	23.3	True

While the above examples are simplistic, it's possible to make it more powerful and sophisticated with the use of operators such as AND (&), OR (|), NOT EQUAL TO (!=) or EQUAL TO (==). To set this up, let’s create a new DataFrame containing information about current UFC champions:

# DATAFRAME FOR UFC CHAMPS

ufc_champs = {
    "names" : ['Francis Ngannou', "Glover Teixeira", 
             "Israel Adesanya", "Kamaru Usman", 
             "Charles Oliveira", "Alex Volkanowski", 
             'Aljamain Sterling', "Brandon Morano", 
             "Julianna Pena", "Valentina Shevchenko", "Rose Namajunas"],
    "nicknames" : ['The Predator', None, "The Last Stylebender", 
                 "The Nigerian Nightmare", "Da Bronx", "The Great",
                "Funk Master", "The Assassin Baby", "The Venezulean Vixen",
                "The Bullet", "Thug"],
    "wins" : [16,33,21,19,31,23,20,19,11,22,11],
    "losses" : [3,7,1,1,8,1,3,5,4,3,4],
    "weightclass" : ['Heavyweight', "Light Heavyweight", "Middleweight",
                   'Welterweight', "Lightweight", "Featherweight", 
                   'Bantamweight', "Flyweight", "Bantamweight", "Flyweight", "Strawweight"]
}

ufc_champs = pd.DataFrame(ufc_champs)
ufc_champs

	names	nicknames	wins	losses	weightclass	country
0	Francis Ngannou	The Predator	16	3	Heavyweight	Cameroon
1	Glover Teixeira	None	33	7	Light Heavyweight	Brazil
2	Israel Adesanya	The Last Stylebender	21	1	Middleweight	New Zealand
3	Kamaru Usman	The Nigerian Nightmare	19	1	Welterweight	USA
4	Charles Oliveira	Da Bronx	31	8	Lightweight	Brazil
5	Alex Volkanowski	The Great	23	1	Featherweight	Australia
6	Aljamain Sterling	Funk Master	20	3	Bantamweight	USA
7	Brandon Morano	The Assassin Baby	19	5	Flyweight	Mexico
8	Julianna Pena	The Venezulean Vixen	11	4	Bantamweight	USA
9	Valentina Shevchenko	The Bullet	22	3	Flyweight	Kyrgyzstan
10	Rose Namajunas	Thug	11	4	Strawweight	USA

# Select champion that came from Brazil + is the Light Heavyweight Champ
ufc_champs[(ufc_champs['country'] == "Brazil") & (ufc_champs['weightclass'] == "Light Heavyweight")]

	names	nicknames	wins	losses	weightclass	country
1	Glover Teixeira	None	33	7	Light Heavyweight	Brazil

# Select champion(s) if home country is Brazil OR has at Less than 2 losses 
ufc_champs[(ufc_champs['country'] == "Brazil") | (ufc_champs['losses'] < 2)]

	names	nicknames	wins	losses	weightclass	country
1	Glover Teixeira	None	33	7	Light Heavyweight	Brazil
2	Israel Adesanya	The Last Stylebender	21	1	Middleweight	New Zealand
3	Kamaru Usman	The Nigerian Nightmare	19	1	Welterweight	USA
4	Charles Oliveira	Da Bronx	31	8	Lightweight	Brazil
5	Alex Volkanowski	The Great	23	1	Featherweight	Australia

# Select chamption with less than 20 wins AND home country is not USA
print(ufc_champs[(ufc_champs['wins'] < 20) & (ufc_champs["country"] != "USA")])

	names	nicknames	wins	losses	weightclass	country
0	Francis Ngannou	The Predator	16	3	Heavyweight	Cameroon
7	Brandon Morano	The Assassin Baby	19	5	Flyweight	Mexico

Aside from filtering out a DataFrame or segmenting it, it's also possible to use the DataFrame.iloc() and DataFrame.loc() attributes in changing a particular value. Let’s look at an example of how we can do this with real data. Suppose that the Lakers make a change in the starting five lineup where we replace Anthony Davis with DeAndre Jordan. We can do that with DataFrame.iloc() like this:

# The Lakers made a change in the starting 5 lineup where we replace Anthony Davis with DeAndre Jordan

lakers.iloc[2,2] = False
lakers.loc[5,"in_starting_lineup"] = True
lakers

	player	PPG	in_starting_lineup
0	Lebron James	28.7	True
1	Russell Westbrook	19.3	True
2	Anthony Davis	23.3	False
3	Dwight Howard	5.0	False
4	Avery Bradley	6.9	True
5	DeAndre Jordan	4.5	True
6	Carmelo Anthony	13.4	False
7	Austin Reaves	5.4	False
8	Kent Bazemore	4.1	False
9	Malik Monk	12.1	True
10	Stanley Johnson	6.2	False
11	Trevor Ariza	3.5	False
12	Wayne Ellington	6.4	False
13	Talen Horton-Tucker	10.9	False

In some situations, it may be necessary to insert or delete data from a DataFrame. We can insert or delete a row using the append() and drop() methods. With the append method, you would use a panda.Series object that matches the dimensions of a DataFrame as the argument for the function. For the drop method, all that we need to state is the index/column label within the DataFrame that we need to drop. Both methods contain the axis argument that specifies whether to add or drop a row or column.

Recall that we created a DataFrame consisting of UFC champions that held the title at the end of 2021 with their monikers and win-loss record:

# Say we create a DataFrame consisting of UFC champions that held the title at the end of 2021 with their monikers and win-loss record 

ufc_champs = {
    "names": ['Francis Ngannou', "Glover Teixeira", 
             "Israel Adesanya", "Kamaru Usman", 
             "Charles Oliveira", "Alex Volkanowski", 
             'Aljamain Sterling', "Brandon Morano", 
             "Julianna Pena", "Valentina Shevchenko", "Rose Namajunas"],
    "nicknames": ['The Predator', None, "The Last Stylebender", 
                 "The Nigerian Nightmare", "Da Bronx", "The Great",
                "Funk Master", "The Assassin Baby", "The Venezulean Vixen",
                "The Bullet", "Thug"],
    "wins": [16,33,21,19,31,23,20,19,11,22,11],
    "losses": [3,7,1,1,8,1,3,5,4,3,4],
    "weightclass": ['Heavyweight', "Light Heavyweight", "Middleweight",
                   'Welterweight', "Lightweight", "Featherweight", 
                   'Bantamweight', "Flyweight", "Bantamweight", "Flyweight", "Strawweight"]
}

ufc_champs = pd.DataFrame(ufc_champs)
ufc_champs

# In making this DataFrame, we forgot to include the Women’s Featherweight champ 
ufc_champs.append(pd.Series(data = ["Amanda Nunes", "Lioness", 21, 5, "Featherweight"], index = ufc_champs.columns, name = 17))

	names	nicknames	wins	losses	weightclass
0	Francis Ngannou	The Predator	16	3	Heavyweight
1	Glover Teixeira	None	33	7	Light Heavyweight
2	Israel Adesanya	The Last Stylebender	21	1	Middleweight
3	Kamaru Usman	The Nigerian Nightmare	19	1	Welterweight
4	Charles Oliveira	Da Bronx	31	8	Lightweight
5	Alex Volkanowski	The Great	23	1	Featherweight
6	Aljamain Sterling	Funk Master	20	3	Bantamweight
7	Brandon Morano	The Assassin Baby	19	5	Flyweight
8	Julianna Pena	The Venezulean Vixen	11	4	Bantamweight
9	Valentina Shevchenko	The Bullet	22	3	Flyweight
10	Rose Namajunas	Thug	11	4	Strawweight
17	Amanda Nunes	Lioness	21	5	Featherweight

ufc_champs.drop([0], axis = 0) # drops first row

	names	nicknames	wins	losses	weightclass
1	Glover Teixeira	None	33	7	Light Heavyweight
2	Israel Adesanya	The Last Stylebender	21	1	Middleweight
3	Kamaru Usman	The Nigerian Nightmare	19	1	Welterweight
4	Charles Oliveira	Da Bronx	31	8	Lightweight
5	Alex Volkanowski	The Great	23	1	Featherweight
6	Aljamain Sterling	Funk Master	20	3	Bantamweight
7	Brandon Morano	The Assassin Baby	19	5	Flyweight
8	Julianna Pena	The Venezulean Vixen	11	4	Bantamweight
9	Valentina Shevchenko	The Bullet	22	3	Flyweight
10	Rose Namajunas	Thug	11	4	Strawweight

In the case for adding a column, the process would be similar to that of adding an item into a dictionary.

weight_limit = [265,205,185,170,155,145,135,125,135,125,115]
ufc_champs['weight_limit'] = weight_limit
ufc_champs

	names	nicknames	wins	losses	weightclass	weight_limit
0	Francis Ngannou	The Predator	16	3	Heavyweight	265
1	Glover Teixeira	None	33	7	Light Heavyweight	205
2	Israel Adesanya	The Last Stylebender	21	1	Middleweight	185
3	Kamaru Usman	The Nigerian Nightmare	19	1	Welterweight	170
4	Charles Oliveira	Da Bronx	31	8	Lightweight	155
5	Alex Volkanowski	The Great	23	1	Featherweight	145
6	Aljamain Sterling	Funk Master	20	3	Bantamweight	135
7	Brandon Morano	The Assassin Baby	19	5	Flyweight	125
8	Julianna Pena	The Venezulean Vixen	11	4	Bantamweight	135
9	Valentina Shevchenko	The Bullet	22	3	Flyweight	125
10	Rose Namajunas	Thug	11	4	Strawweight	115

Sometimes with datasets, the labels used in identifying a column may not accurately describe its property. To change these labels, we can use the DataFrame.rename() method, which takes in the index argument (specifies the labels for the index in a dictionary-like format), columns argument (specifies the labels for the columns in a dictionary-like format), and the in-place argument that determines whether to return a new DataFrame.

# Let’s say we need to rename the columns of the UFC champ DataFrame by capitalizing the labels

ufc_champs.rename(columns = {'names': 'Names', 
                             'nicknames': 'Nicknames', 
                             'wins': 'Wins',  
                             'losses': 'Losses', 
                             'weightclass': 'Weight Class', 
                             'weight_limit': 'Weight Limit'})

	Names	Nicknames	Wins	Losses	Weight Class	Weight Limit
0	Francis Ngannou	The Predator	16	3	Heavyweight	265
1	Glover Teixeira	None	33	7	Light Heavyweight	205
2	Israel Adesanya	The Last Stylebender	21	1	Middleweight	185
3	Kamaru Usman	The Nigerian Nightmare	19	1	Welterweight	170
4	Charles Oliveira	Da Bronx	31	8	Lightweight	155
5	Alex Volkanowski	The Great	23	1	Featherweight	145
6	Aljamain Sterling	Funk Master	20	3	Bantamweight	135
7	Brandon Morano	The Assassin Baby	19	5	Flyweight	125
8	Julianna Pena	The Venezulean Vixen	11	4	Bantamweight	135
9	Valentina Shevchenko	The Bullet	22	3	Flyweight	125
10	Rose Namajunas	Thug	11	4	Strawweight	115

Lastly, there may be some cases when we need to reshape the current makeup of the dataset to make it suitable for data analysis. While it's certainly possible to manually remake another DataFrame, it’ll be easier to transform it. In Pandas, there are three different transformation functions that we can use to reshape the DataFrame:

Method 1 — Pivoting

This transformation is essentially taking a longer-format DataFrame and making it broader. Often this is a result of having a unique identifier repeated along multiple rows for each subsequent entry. One method to derive a newly formatted DataFrame is by using DataFrame.pivot. This method requires defining which of the data columns will be used as the new index and index as well as values for the DataFrame.

# Say we are creating a DataFrame that maps out the voting for the season MVP of the 2021-2022 NBA season from 5 different sports journalist/reporter/pundits in sports media

mvp_vote_2022 = pd.DataFrame({
    "Voter": ['Kenny Smith', 'Kenny Smith', 'Kenny Smith', 'Kenny Smith', 'Kenny Smith',
               "Charles Barkley", "Charles Barkley", "Charles Barkley", "Charles Barkley", "Charles Barkley",
               "Ernie Johnson", "Ernie Johnson", "Ernie Johnson", "Ernie Johnson", "Ernie Johnson",
               "Michael Wilbon", "Michael Wilbon", "Michael Wilbon", "Michael Wilbon", "Michael Wilbon",
               "Doris Burke", "Doris Burke", "Doris Burke", "Doris Burke", "Doris Burke"
              ], 
    "Player": ['Steph Curry', 'Lebron James', 'Chris Paul', 'Kevin Durant', 'Giannis Antetokounmpo',
                'Steph Curry', 'Nikola Jokic', 'Giannis Antetokounmpo', 'Chris Paul', 'DeMar DeRozan',
                'Steph Curry', 'Giannis Antetokounmpo', 'Kevin Durant', 'Nikola Jokic', 'Joel Embid',
                'Kevin Durant', 'Giannis Antetokounmpo', 'Nikola Jokic', 'Steph Curry', 'DeMar DeRozan',
                'Kevin Durant', 'Giannis Antetokounmpo', 'Steph Curry', 'Nikola Jokic', 'DeMar DeRozan',],

    "Placing": [1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5] 
})

# Say if we need to switch the index to indicate voters as the index values and columns with the MVP placings from each voter

mvp_vote_2022.pivot(index = 'Voter', columns = 'Placing')

Player
Placing	1	2	3	4	5
Voter
Charles Barkley	Steph Curry	Nikola Jokic	Giannis Antetokounmpo	Chris Paul	DeMar DeRozan
Doris Burke	Kevin Durant	Giannis Antetokounmpo	Steph Curry	Nikola Jokic	DeMar DeRozan
Ernie Johnson	Steph Curry	Giannis Antetokounmpo	Kevin Durant	Nikola Jokic	Joel Embid
Kenny Smith	Steph Curry	Lebron James	Chris Paul	Kevin Durant	Giannis Antetokounmpo
Michael Wilbon	Kevin Durant	Giannis Antetokounmpo	Nikola Jokic	Steph Curry	DeMar DeRozan

The DataFrame.pivot method does not allow rows with duplicate values for a given column. If this uniqueness isn’t guaranteed, an alternative approach would be to use the DataFrame.pivot_table() function instead. This approach requires arguments used to specify the index, column, and values. Uniquely, this function also has an additional argument “aggfunc” (default is numpy.mean), which passes a function to aggregate the values of a DataFrame.

# Create a new DataFrame that shows the median placing for NBA MVP 
mvp_vote_2022.pivot_table(index = 'Player', values = 'Placing', aggfunc = np.median).sort_values(by= 'Placing') 

# the .sort_values() method is used to arrange the DataFrame by some existing variable in either ascending or descending order

Placing
Player
Steph Curry	1.0
Giannis Antetokounmpo	2.0
Kevin Durant	2.0
Lebron James	2.0
Chris Paul	3.5
Nikola Jokic	3.5
DeMar DeRozan	5.0
Joel Embid	5.0

Method 2 — Stacking/Unstacking

Sometimes a DataFrame may have multiple indices that’ll look something like this:

array = [["Month 1",  "Month 2", "Month  3", "Month 4"], 
        ['Squat', "Squat", "Squat", "Squat"]]

array2 = [['Arnold', "Arnold","Larry", "Larry"], 
          ["Before", "After", "Before", "After"]]

values = [[225, 245, 275, 315], [335, 355, 365, 405], [315, 365, 435, 455], [495, 405, 545, 585]]

values = np.array(values)

effect_of_ped = pd.DataFrame(data = values, index = array2, columns = array)
effect_of_ped

		Month 1	Month 2	Month 3	Month 4
		Squat	Squat	Squat	Squat
Arnold	Before	225	245	275	315
Arnold	After	335	355	365	405
Larry	Before	315	365	435	455
Larry	After	495	405	545	585

It's often difficult to make sense of the data or address it for analysis. So, functions such as stack() or unstack() make it possible to make it longer or broader, respectively.

effect_of_ped.stack()

			Month 3	Month 1	Month 2	Month 4
Arnold	Before	Squat	275	225	245	315
Arnold	After	Squat	365	335	355	405
Larry	Before	Squat	435	315	365	455
Larry	After	Squat	545	495	405	585

effect_of_ped.unstack()

	Month 1		Month 2		Month 3		Month 4
	Squat		Squat		Squat		Squat
	After	Before	After	Before	After	Before	After	Before
Arnold	335	225	355	245	365	275	405	315
Larry	495	315	405	365	545	435	585	455

Method 3 — Melting

Known as unpivoting a DataFrame, this works by essentially converting a wide-format DataFrame to a long-format. This usually occurs when more than one column works as an identifier for a given analysis. In order to transform the DataFrame to a longer format, we’ll need to use the DataFrame.melt() function, which requires establishing which columns are to be used as the identifier variable and the columns to “unpivot” to correspond to the values for said identifier.

data = pd.DataFrame({"Name": ["Pankaj", "Lisa", "David"], "ID": [1, 2, 3], "Role": ["CEO", "Editor", "Author"]})
data

# Let's say the identifier is ID and the values would be Name + Role

pd.melt(data, id_vars = ["ID"], value_vars = ['Name', 'Role'])

	ID	variable	value
0	1	Name	Pankaj
1	2	Name	Lisa
2	3	Name	David
3	1	Role	CEO
4	2	Role	Editor
5	3	Role	Author

So far, we’ve only scratched the surface of DataFrames. There are many more functions and methods that can operate on these data structures within Python to gain deeper insights into your data. You can find these in the Pandas DataFrame reference guide. However, a great place to start is with the Pandas and NumPy Fundamentals course on Dataquest. Once you know the fundamentals, progress to working with data in Python in some of the other courses in the Data Analyst career path.

Tutorials

Tutorial: How to Create and Use a Pandas DataFrame

What’s a DataFrame?

How do you make a DataFrame?

Method 1 — Import Data from a File

METHOD 2 – Creating DataFrames Yourself

Exploring a DataFrame

Manipulating a DataFrame

Method 1 — Pivoting

Method 2 — Stacking/Unstacking

Method 3 — Melting

Python for Loop: A Beginner’s Tutorial

Essential Tools Every Python User Should Know

Tutorial: How to Create and Use a Pandas DataFrame

What’s a DataFrame?

How do you make a DataFrame?

Method 1 — Import Data from a File

METHOD 2 – Creating DataFrames Yourself

Exploring a DataFrame

Manipulating a DataFrame

Method 1 — Pivoting

Method 2 — Stacking/Unstacking

Method 3 — Melting

More learning resources

Python for Loop: A Beginner’s Tutorial

Essential Tools Every Python User Should Know