September 12, 2022

How to Easily Remove Duplicates from a Python List (2022)

Removing duplicates from a list in Python

A Python list is an ordered, zero-indexed, and mutable collection of objects. We create it by placing objects of similar or different data types inside square brackets separated by commas. For a refresher on how a Python List works, kindly refer to this DataQuest tutorial.

Removing duplicates from a list is an important data preprocessing step — for example, identifying the unique customers who purchased from a gift shop in the past month for promotional offers. Several customers may have bought gifts from the shop more than once in the past month, and their names will appear as many times as they visited the shop.

In this tutorial, we’ll learn the different methods for removing duplicates from a Python List.


1. Using the del keyword

We use the del keyword to delete objects from a list with their index position. We use this method when the size of the list is small and there aren’t many duplicate elements. For example, a class of six students were asked about their favorite programming language, and their responses were saved in the students list. Several students preferred the same programming language, so we have duplicates in the students list that we’ll remove using the del keyword.

students = ['Python', 'R', 'C#', 'Python', 'R', 'Java']

# Remove the `Python` duplicate with its index number: 3
del students[3]

print(students)
    ['Python', 'R', 'C#', 'R', 'Java']
# Remove the `R` duplicate with its index number: 3
del students[3]

print(students)
    ['Python', 'R', 'C#', 'Java']

We successfully removed the duplicates from the list. But why did we use index 3 twice? The len of the original students list is 6. A Python List is zero-indexed. The first element in the list has an index of 0, and the last element has an index of 5. The duplicate 'Python' has an index of 3. After deleting the 'Python' duplicate, the len of the students list is reduced by 1. The element after the duplicate 'Python' now assumes its index position. This is why the duplicate 'R' index changes from 4 to 3. The disadvantage of using this method is that we have to keep track of the duplicates’ indices, which keep changing. This would be difficult for a very large list.

Next, we’ll remove duplicates from a list more efficiently using for-loop.


2. Using for-loop

We use for-loop to iterate over an iterable: for example, a Python List. For a referesher on how for-loop works, kindly refer to this for-loop tutorial on DataQuest blog.

To remove duplicates using for-loop, first you create a new empty list. Then, you iterate over the elements in the list containing duplicates and append only the first occurrence of each element in the new list. The code below shows how to use for-loop to remove duplicates from the students list.

# Using for-loop
students = ['Python', 'R', 'C#', 'Python', 'R', 'Java']

new_list = []

for one_student_choice in students:
    if one_student_choice not in new_list:
        new_list.append(one_student_choice)

print(new_list)
    ['Python', 'R', 'C#', 'Java']

Voilà! We successfully removed the duplicates without having to keep track of the elements’ indices. This method can help us remove duplicates in a large list. However, this required a lot of code. There should be a simpler way to do this. Any guesses?

List comprehension! We’ll simplify the above code using list comprehension in the next example:

# Using list comprehension
new_list = []

[new_list.append(item) for item in students if item not in new_list]

print(new_list)
    ['Python', 'R', 'C#', 'Java']

We got the job done with fewer lines of code. We can combinefor-loop with the enumerate and zip functions to write exotic bits of code to remove duplicates. The idea behind how these codes work is the same as in the examples shown above.

Next, we will see how to remove duplicates from a list without iterating using a set.


3. Using set

Sets in Python are unordered collections of unique elements. By their nature, duplicates aren’t allowed. Therefore, converting a list into a set removes the duplicates. Changing the set into a list yields a new list without duplicates.

The following example shows how to remove duplicates from the students list using set.

# Removing duplicates by first changing to a set and then back to a list
new_list = list(set(students))

print(new_list)
    ['R', 'Python', 'Java', 'C#']

Notice that the order of the elements in the list is different from our previous examples. This is because a set doesn’t preserve order. Next, we’ll see how to remove duplicates from a list using a dictionary.


4. Using dict

A Python dictionary is a collection of key-value pairs with the requirement that keys must be unique. So, we’ll remove the duplicates in a Python List if we can make the elements of the list be the keys of a dictionary. We cannot convert the simple students list to a dictionary because a dictionary is created with a key-value pair. We get the following error if we try to convert the students list to a dictionary:

# We get ValueError when we try to convert a simple list into a dictionary

print(dict(students))
    ---------------------------------------------------------------------------

 ValueError                                Traceback (most recent call last)

<ipython-input-6-43bfe4b3db83> in <module>
1 # We get ValueError when we try to convert a simple list into a dictionary
2 
----> 3 dict(students)

ValueError: dictionary update sequence element #0 has length 6; 2 is required

However, we can create a dictionary from a list of tuples — after which we’ll get the unique keys of the dictionary and convert them into a list. A vectorized way of getting the list of tuples from the students list is using the map function:

# Convert `students` list into a list of tuples
list_of_tuples = list(map(lambda x: (x, None), students))

print(list_of_tuples, end='\n\n')
    [('Python', None), ('R', None), ('C#', None), ('Python', None), ('R', None), ('Java', None)]

In the above code block, every element in the students list is passed through the lambda function to create a tuple, (element, None). When the list of tuples is changed into a dictionary, the first element in the tuple is the key and the second element is the value. The unique keys are from the dictionary with the keys() method and changed into a list:

# Convert list of tuples into a dictionary
dict_students = dict(list_of_tuples)

print('The resulting dictionary from the list of tuples:')
print(dict_students, end='\n\n')
    The resulting dictionary from the list of tuples:
    {'Python': None, 'R': None, 'C#': None, 'Java': None}
# Get the unique keys from the dictionary and convert into a list
new_list = list(dict_students.keys())

print('The new list without duplicates:')
print(new_list, end='\n\n')
    The new list without duplicates:
    ['Python', 'R', 'C#', 'Java']

The dict.fromkeys() method converts a list into a list of tuples and the list of tuples into a dictionary in one go. We can then get the unique dictionary keys and convert them to a list. However, we used the dict.keys() method to get the unique keys from the dictionary before converting to a list. This isn’t really necessary. By default, operations on a dictionary like iteration and converting to a list use the dictionary keys.

# Using dict.fromkeys() methods to get a dictionary from a list
new_dict_students = dict.fromkeys(students)

print('The resulting dictionary from the dict.fromkeys():')
print(new_dict_students, end='\n\n')

print('The new list without duplicates using dict.fromkeys():')
print(list(new_dict_students), end='\n\n')
    The resulting dictionary from the dict.fromkeys():
    {'Python': None, 'R': None, 'C#': None, 'Java': None}

    The new list without duplicates using dict.fromkeys():
    ['Python', 'R', 'C#', 'Java']

4. Using Counter and FreqDist

We can remove duplicates from a Python list using dictionary subclasses like Counter and FreqDist. Both Counter and FreqDist work in the same way. They are collections wherein the unique elements are the dictionary keys and counts of their occurence are the values. As in a dictionary, the list without duplicates comes from the dictionary keys.

# Using the dict subclass FreqDist
from nltk.probability import FreqDist

freq_dict = FreqDist(students)
print('The tabulate key-value pairs from FreqDist:')
freq_dict.tabulate()

print('\nThe new list without duplicates:')
print(list(freq_dict), end='\n\n')
    The tabulate key-value pairs from FreqDist:
    Python      R     C#   Java 
         2      2      1      1 

    The new list without duplicates:
    ['Python', 'R', 'C#', 'Java']
# Using the dict subclass Counter
from collections import Counter

counter_dict = Counter(students)
print(counter_dict, end='\n\n')

print('The new list without duplicates:')
print(list(counter_dict), end='\n\n')
    Counter({'Python': 2, 'R': 2, 'C#': 1, 'Java': 1})

    The new list without duplicates:
    ['Python', 'R', 'C#', 'Java']

5. Using pd.unique and np.unique

Both pd.unique and np.unique take a list with duplicates and return a unique array of the elements in the list. The resulting arrays are converted to lists. While np.unique sorts the unique elements in ascending order, pd.unique maintains the order of the elements in the list.

import numpy as np
import pandas as pd

print('The new list without duplicates using np.unique():')
print(list(np.unique(students)), end='\n\n')

print('\nThe new list without duplicates using pd.unique():')
print(list(pd.unique(students)), end='\n\n')
    The new list without duplicates using np.unique():
    ['C#', 'Java', 'Python', 'R']

    The new list without duplicates using pd.unique():
    ['Python', 'R', 'C#', 'Java']

Application: Gift Shop Revisited

In this section, we’ll revisit our gift shop illustration. The gift shop is in a neigborhood of 50 people. An average of 10 people purchase from the shop every day, and the shop is open 10 days a month. You received a list of lists containing the names of the customers who purchased from the shop in the previous month, and your task is to get the names of the unique customers for a promotional offer.

# Install the `names` package
!pip install names
# Get package to generate names
import names
import random

random.seed(43)

# Generate names for 50 people in the neighbourhood
names_neighbourhood = [names.get_full_name() for _ in range(50)]

# Import package that randomly select names
from random import choices

# Customers do not have equal probabilities of purchasing from the shop
weights = [random.randint(-20, 20) for _ in range(50)]

# Randomly generate 20 customers that purchased from the store for 10 days
customers_month = [choices(names_neighbourhood, weights=weights, k=10) for _ in range(10)]

In the above code block, we randomly generated customers who purchased from the store in the past month as a list of lists. We want to get the unique customers who purchased from the store every month, so we’ll create the get_customer_names function for this task.

We have included the optional input and output data types. Python is a dynamically typed programming language and implicitly handles this during runtime. However, it’s useful to show the input and output data types for complicated data structures. Alternatively, we can describe the input and output of the function with a docstring. customers_purchases: List[List[str]] tells us that the customers_purchases parameter is a list that contains lists with strings, and the -> List[Tuple[str, int]] tells us that the function returns a list that contains tuples with a string and an integer.

from typing import List, Tuple

def get_customer_names(customers_purchases: List[List[str]]) -> List[Tuple[str, int]]:

    # Get a single list of all the customers' names from the list of lists
    full_list_month = []
    for a_day_purchase in customers_purchases:
        full_list_month.extend(a_day_purchase)

    return Counter(full_list_month).most_common()

customers_list_tuples = get_customer_names(customers_month)
customers_list_tuples
    [('Nicole Moore', 14),
     ('Diane Paredes', 13),
     ('Mathew Jacobs', 11),
     ('Katherine Piazza', 10),
     ('Alvin Cloud', 8),
     ('Robert Mcadams', 8),
     ('Roger Lee', 8),
     ('Becky Hubert', 7),
     ('Paul Frisch', 7),
     ('Danielle Mccormick', 5),
     ('Donna Salvato', 3),
     ('Sally Thompson', 2),
     ('Franklin Copeland', 2),
     ('Linda Sample', 2)]

The names of the customers were randomly generated — yours may be different if you don’t use the same seed value. The names of the unique customers in customers_list_tuples comes from first converting the list of tuples to a dictionary and then converting the dictionary keys to a list:

# Unique customers for the previous month

list(dict(customers_list_tuples))
    ['Nicole Moore',
     'Diane Paredes',
     'Mathew Jacobs',
     'Katherine Piazza',
     'Alvin Cloud',
     'Robert Mcadams',
     'Roger Lee',
     'Becky Hubert',
     'Paul Frisch',
     'Danielle Mccormick',
     'Donna Salvato',
     'Sally Thompson',
     'Franklin Copeland',
     'Linda Sample']

Conclusion

In this tutorial, we learned how to remove duplicates from a Python List. We learned how to remove duplicates with the del keyword for small lists. For larger lists, we saw that using for-loop and list comprehension methods were more efficient than using the del keyword. Furthermore, we learned that the values of a set and the keys of a dictionary are unique, which makes them suitable for removing duplicates from a list. Finally, we learned that dictionary subclasses remove duplicates from a list in much the same way as a dictionary, and we saw the NumPy and pandas methods for getting unique elements from a list.

Aghogho Monorien

About the author

Aghogho Monorien

Aghogho is an engineer and aspiring Quant working on the applications of artificial intelligence in finance.

Learn data skills for free

Headshot Headshot

Join 1M+ learners

Try free courses