How to Easily Remove Duplicates from a Python List (2023)
A Python list is an ordered, zero-indexed, and mutable collection of objects. We create it by placing objects of similar or different data types inside square brackets separated by commas. For a refresher on how a Python List works, kindly refer to this DataQuest tutorial.
Removing duplicates from a list is an important data preprocessing step — for example, identifying the unique customers who purchased from a gift shop in the past month for promotional offers. Several customers may have bought gifts from the shop more than once in the past month, and their names will appear as many times as they visited the shop.
In this tutorial, we'll learn the different methods for removing duplicates from a Python List.
1. Using the del
keyword
We use the del
keyword to delete objects from a list with their index position. We use this method when the size of the list is small and there aren't many duplicate elements. For example, a class of six students were asked about their favorite programming language, and their responses were saved in the students
list. Several students preferred the same programming language, so we have duplicates in the students
list that we'll remove using the del
keyword.
students = ['Python', 'R', 'C#', 'Python', 'R', 'Java']
# Remove the Python
duplicate with its index number: 3
del students[3]
print(students)
['Python', 'R', 'C#', 'R', 'Java']
# Remove the R
duplicate with its index number: 3
del students[3]
print(students)
['Python', 'R', 'C#', 'Java']
We successfully removed the duplicates from the list. But why did we use index 3 twice? The len
of the original students
list is 6. A Python List is zero-indexed. The first element in the list has an index of 0, and the last element has an index of 5. The duplicate 'Python'
has an index of 3. After deleting the 'Python'
duplicate, the len
of the students
list is reduced by 1. The element after the duplicate 'Python'
now assumes its index position. This is why the duplicate 'R'
index changes from 4 to 3. The disadvantage of using this method is that we have to keep track of the duplicates' indices, which keep changing. This would be difficult for a very large list.
Next, we'll remove duplicates from a list more efficiently using for-loop
.
2. Using for-loop
We use for-loop
to iterate over an iterable: for example, a Python List. For a referesher on how for-loop
works, kindly refer to this for-loop tutorial on DataQuest blog.
To remove duplicates using for-loop
, first you create a new empty list. Then, you iterate over the elements in the list containing duplicates and append only the first occurrence of each element in the new list. The code below shows how to use for-loop
to remove duplicates from the students
list.
# Using for-loop
students = ['Python', 'R', 'C#', 'Python', 'R', 'Java']
new_list = []
for one_student_choice in students:
if one_student_choice not in new_list:
new_list.append(one_student_choice)
print(new_list)
['Python', 'R', 'C#', 'Java']
Voilà! We successfully removed the duplicates without having to keep track of the elements' indices. This method can help us remove duplicates in a large list. However, this required a lot of code. There should be a simpler way to do this. Any guesses?
List comprehension! We'll simplify the above code using list comprehension in the next example:
# Using list comprehension
new_list = []
[new_list.append(item) for item in students if item not in new_list]
print(new_list)
['Python', 'R', 'C#', 'Java']
We got the job done with fewer lines of code. We can combinefor-loop
with the enumerate
and zip
functions to write exotic bits of code to remove duplicates. The idea behind how these codes work is the same as in the examples shown above.
Next, we will see how to remove duplicates from a list without iterating using a set
.
3. Using set
Sets in Python are unordered collections of unique elements. By their nature, duplicates aren't allowed. Therefore, converting a list into a set removes the duplicates. Changing the set into a list yields a new list without duplicates.
The following example shows how to remove duplicates from the students
list using set
.
# Removing duplicates by first changing to a set and then back to a list
new_list = list(set(students))
print(new_list)
['R', 'Python', 'Java', 'C#']
Notice that the order of the elements in the list is different from our previous examples. This is because a set
doesn't preserve order. Next, we'll see how to remove duplicates from a list using a dictionary.
4. Using dict
A Python dictionary is a collection of key-value pairs with the requirement that keys must be unique. So, we'll remove the duplicates in a Python List if we can make the elements of the list be the keys of a dictionary. We cannot convert the simple students
list to a dictionary because a dictionary is created with a key-value pair. We get the following error if we try to convert the students
list to a dictionary:
# We get ValueError when we try to convert a simple list into a dictionary
print(dict(students))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-6-43bfe4b3db83> in <module>
1 # We get ValueError when we try to convert a simple list into a dictionary
2
----> 3 dict(students)
ValueError: dictionary update sequence element #0 has length 6; 2 is required
However, we can create a dictionary from a list of tuples — after which we'll get the unique keys of the dictionary and convert them into a list. A vectorized way of getting the list of tuples from the students
list is using the map function:
# Convert students
list into a list of tuples
list_of_tuples = list(map(lambda x: (x, None), students))
print(list_of_tuples, end='\n\n')
[('Python', None), ('R', None), ('C#', None), ('Python', None), ('R', None), ('Java', None)]
In the above code block, every element in the students
list is passed through the lambda
function to create a tuple, (element, None)
. When the list of tuples is changed into a dictionary, the first element in the tuple is the key and the second element is the value. The unique keys are from the dictionary with the keys()
method and changed into a list:
# Convert list of tuples into a dictionary
dict_students = dict(list_of_tuples)
print('The resulting dictionary from the list of tuples:')
print(dict_students, end='\n\n')
The resulting dictionary from the list of tuples:
{'Python': None, 'R': None, 'C#': None, 'Java': None}
# Get the unique keys from the dictionary and convert into a list
new_list = list(dict_students.keys())
print('The new list without duplicates:')
print(new_list, end='\n\n')
The new list without duplicates:
['Python', 'R', 'C#', 'Java']
The dict.fromkeys()
method converts a list into a list of tuples and the list of tuples into a dictionary in one go. We can then get the unique dictionary keys and convert them to a list. However, we used the dict.keys()
method to get the unique keys from the dictionary before converting to a list. This isn't really necessary. By default, operations on a dictionary like iteration and converting to a list use the dictionary keys.
# Using dict.fromkeys() methods to get a dictionary from a list
new_dict_students = dict.fromkeys(students)
print('The resulting dictionary from the dict.fromkeys():')
print(new_dict_students, end='\n\n')
print('The new list without duplicates using dict.fromkeys():')
print(list(new_dict_students), end='\n\n')
The resulting dictionary from the dict.fromkeys():
{'Python': None, 'R': None, 'C#': None, 'Java': None}
The new list without duplicates using dict.fromkeys():
['Python', 'R', 'C#', 'Java']
4. Using Counter
and FreqDist
We can remove duplicates from a Python list using dictionary subclasses like Counter
and FreqDist
. Both Counter
and FreqDist
work in the same way. They are collections wherein the unique elements are the dictionary keys and counts of their occurence are the values. As in a dictionary, the list without duplicates comes from the dictionary keys.
# Using the dict subclass FreqDist
from nltk.probability import FreqDist
freq_dict = FreqDist(students)
print('The tabulate key-value pairs from FreqDist:')
freq_dict.tabulate()
print('\nThe new list without duplicates:')
print(list(freq_dict), end='\n\n')
The tabulate key-value pairs from FreqDist:
Python R C# Java
2 2 1 1
The new list without duplicates:
['Python', 'R', 'C#', 'Java']
# Using the dict subclass Counter
from collections import Counter
counter_dict = Counter(students)
print(counter_dict, end='\n\n')
print('The new list without duplicates:')
print(list(counter_dict), end='\n\n')
Counter({'Python': 2, 'R': 2, 'C#': 1, 'Java': 1})
The new list without duplicates:
['Python', 'R', 'C#', 'Java']
5. Using pd.unique
and np.unique
Both pd.unique
and np.unique
take a list with duplicates and return a unique array of the elements in the list. The resulting arrays are converted to lists. While np.unique
sorts the unique elements in ascending order, pd.unique
maintains the order of the elements in the list.
import numpy as np
import pandas as pd
print('The new list without duplicates using np.unique():')
print(list(np.unique(students)), end='\n\n')
print('\nThe new list without duplicates using pd.unique():')
print(list(pd.unique(students)), end='\n\n')
The new list without duplicates using np.unique():
['C#', 'Java', 'Python', 'R']
The new list without duplicates using pd.unique():
['Python', 'R', 'C#', 'Java']
Application: Gift Shop Revisited
In this section, we'll revisit our gift shop illustration. The gift shop is in a neigborhood of 50 people. An average of 10 people purchase from the shop every day, and the shop is open 10 days a month. You received a list of lists containing the names of the customers who purchased from the shop in the previous month, and your task is to get the names of the unique customers for a promotional offer.
# Install the names
package
!pip install names
# Get package to generate names
import names
import random
random.seed(43)
# Generate names for 50 people in the neighbourhood
names_neighbourhood = [names.get_full_name() for _ in range(50)]
# Import package that randomly select names
from random import choices
# Customers do not have equal probabilities of purchasing from the shop
weights = [random.randint(-20, 20) for _ in range(50)]
# Randomly generate 20 customers that purchased from the store for 10 days
customers_month = [choices(names_neighbourhood, weights=weights, k=10) for _ in range(10)]
In the above code block, we randomly generated customers who purchased from the store in the past month as a list of lists. We want to get the unique customers who purchased from the store every month, so we'll create the get_customer_names
function for this task.
We have included the optional input and output data types. Python is a dynamically typed programming language and implicitly handles this during runtime. However, it's useful to show the input and output data types for complicated data structures. Alternatively, we can describe the input and output of the function with a docstring
. customers_purchases: List[List[str]]
tells us that the customers_purchases
parameter is a list that contains lists with strings, and the -> List[Tuple[str, int]]
tells us that the function returns a list that contains tuples with a string and an integer.
from typing import List, Tuple
def get_customer_names(customers_purchases: List[List[str]]) -> List[Tuple[str, int]]:
# Get a single list of all the customers' names from the list of lists
full_list_month = []
for a_day_purchase in customers_purchases:
full_list_month.extend(a_day_purchase)
return Counter(full_list_month).most_common()
customers_list_tuples = get_customer_names(customers_month)
customers_list_tuples
[('Nicole Moore', 14),
('Diane Paredes', 13),
('Mathew Jacobs', 11),
('Katherine Piazza', 10),
('Alvin Cloud', 8),
('Robert Mcadams', 8),
('Roger Lee', 8),
('Becky Hubert', 7),
('Paul Frisch', 7),
('Danielle Mccormick', 5),
('Donna Salvato', 3),
('Sally Thompson', 2),
('Franklin Copeland', 2),
('Linda Sample', 2)]
The names of the customers were randomly generated — yours may be different if you don't use the same seed
value. The names of the unique customers in customers_list_tuples
comes from first converting the list of tuples to a dictionary and then converting the dictionary keys to a list:
# Unique customers for the previous month
list(dict(customers_list_tuples))
['Nicole Moore',
'Diane Paredes',
'Mathew Jacobs',
'Katherine Piazza',
'Alvin Cloud',
'Robert Mcadams',
'Roger Lee',
'Becky Hubert',
'Paul Frisch',
'Danielle Mccormick',
'Donna Salvato',
'Sally Thompson',
'Franklin Copeland',
'Linda Sample']
Conclusion
In this tutorial, we learned how to remove duplicates from a Python List. We learned how to remove duplicates with the del
keyword for small lists. For larger lists, we saw that using for-loop
and list comprehension methods were more efficient than using the del
keyword. Furthermore, we learned that the values of a set
and the keys of a dictionary are unique, which makes them suitable for removing duplicates from a list. Finally, we learned that dictionary subclasses remove duplicates from a list in much the same way as a dictionary, and we saw the NumPy and pandas methods for getting unique elements from a list.