The Dataquest Download

Level up your data and AI skills, one newsletter at a time.

Each week, the Dataquest Download brings the latest behind-the-scenes developments at Dataquest directly to your inbox. Discover our top tutorial of the week to boost your data skills, get the scoop on any course changes, and pick up a useful tip to apply in your projects. We also spotlight standout projects from our students and share their personal learning journeys.

Hello, Dataquesters!

In this edition, we bring you expert insights, hands-on projects, and essential data skills to boost your learning journey. From cleaning messy column names in pandas to analyzing Kaggle’s data science survey, there’s plenty to explore!

Concept of the Week: Clean up messy column names in pandas effortlessly—no more frustrating errors! Learn more

From the Community: Check out Jim’s heart disease classification model, Aryan’s challenge to build 52 ML projects, and practical SQL aggregation tips. Join the discussion

New Resource: Building & Presenting Your Data Portfolio—a guide to creating a portfolio that stands out to employers. Learn more

You load a dataset, eager to explore it, but something feels off. The column names are inconsistent—some have spaces, others have weird capitalization, and a few even contain special characters. You try to reference a column and get an error. Annoying, right?

Messy column names are a common frustration, but pandas makes it easy to clean them up. Let’s go over a few simple techniques to make your column names readable, standardized, and easy to work with.

The Dataset: Laptops.csv

For this article, we’ll be working with a dataset containing information about 1,300 laptop computers. If you’d like to follow along and run the code yourself, you can download the dataset here: laptops.csv.

Let’s first load the dataset and inspect the column names:

import pandas as pd

# Load dataset
df = pd.read_csv("laptops.csv",encoding="latin1")

# Display the column names
print(df.columns)

Output:

Index([' Manufacturer','Model Name','Category',
        'Screen Size','Screen','CPU','RAM',         ' Storage','GPU','Operating System',
         'Operating System Version','Weight','Price (Euros)'],
dtype='object')

What’s Wrong with Our Column Names?

From the output above, here’s what you might notice:

Spaces instead of underscores (Operating System instead of operating_system)
Inconsistent capitalization (Manufacturer vs. model name)
Hidden leading or trailing whitespace (" Storage" instead of "Storage")
Special characters (Price (Euros) instead of price_euros)

These small inconsistencies can slow you down, especially when writing queries or referencing columns in your analysis. We will now rectify these issues, one step at a time.

How to Clean Up Column Names with pandas

We’ll go through a series of steps to clean our column names. After each step, we’ll display the updated column names to see our progress.

Step 1: Remove Leading and Trailing Whitespace

Column names sometimes contain extra spaces at the beginning or end, which can lead to hard-to-spot errors when referencing them. Removing these unnecessary spaces ensures consistency and prevents issues when selecting columns.

# Strip whitespace from column names
df.columns = df.columns.str.strip()
print(df.columns)

Output:

Index(['Manufacturer','Model Name','Category',
        'Screen Size','Screen','CPU','RAM',
        'Storage','GPU','Operating System',
        'Operating System Version','Weight','Price (Euros)'], dtype='object')

Notice that leading and trailing spaces have been removed from column names like ‘ Storage’ and ‘ Manufacturer’, making them easier to reference.

Step 2: Standardize Formatting (Lowercase & Underscores)

Column names should be easy to reference in code. A common convention is to use lowercase letters and replace spaces with underscores. This makes accessing and manipulating columns more predictable and less error-prone.

df.columns = df.columns.str.lower().str.replace(" ", "_")
print(df.columns)

Output:

Index(['manufacturer','model_name','category',
       'screen_size','screen', 'cpu','ram','storage',
       'gpu','operating_system', 'operating_system_version',
       'weight','price_(euros)'], dtype='object')

All column names are now lowercase, and spaces have been replaced with underscores, following a standard naming convention. However, note that special characters are still present, such as parentheses in ‘price_(euros)’, which we’ll address next.

Step 3: Remove Special Characters

Some column names may include special characters like parentheses, currency symbols, or other non-alphanumeric characters. These can cause errors when performing operations like merging or querying data. Removing them ensures compatibility across different processing tools.

df.columns = df.columns.str.replace("[()€$]", "", regex=True)
print(df.columns)

Output:

Index(['manufacturer','model_name','category',
       'screen_size','screen', 'cpu','ram',
        'storage','gpu','operating_system', 
        'operating_system_version','weight','price_euros'], 
dtype='object')

Special characters like parentheses and currency symbols have been removed. ‘price_(euros)’ is now ‘price_euros’, making it more user-friendly and easier to reference in code.

Step 4: Apply All Fixes at Once

Instead of applying each fix manually, we can create a function that performs all the cleaning steps in one go. This approach is useful when working with multiple datasets that require the same cleaning steps, making the process more efficient and reusable.

def clean_column_names(df):
    df.columns = (df.columns
               .str.strip()
               .str.lower()
               .str.replace(" ", "_")
               .str.replace("[()€$]", "",regex=True))
    return df

# Clean the column names
df = clean_column_names(df)
print(df.columns)

Output:

Index(['manufacturer','model_name','category',
        'screen_size','screen', 'cpu','ram','storage',
        'gpu','operating_system', 'operating_system_version',
        'weight','price_euros'],
dtype='object')

After applying all fixes in one function, we now have fully cleaned column names that are easy to reference and work within any analysis or transformation.

Now your dataset is much easier to work with!

Why This Matters

Cleaning column names may seem like a small step, but it saves time and prevents frustrating errors. When you can quickly reference your columns, your data analysis flows much more smoothly.

Want to learn more about cleaning messy data? Check out the full Data Cleaning Basics lesson. If you’re serious about data analysis, explore the full Junior Data Analyst career path.

From the Community

Classifying Heart Disease: Jim’s machine-learning project showcases in-depth variable analysis and strong visualizations, achieving high-quality results with room for improvement.

52 Projects in 52 Weeks: Aryan is challenging himself to build a machine-learning project every week for a year—got unique ideas? Join the discussion and collaborate!

Multiple Columns from One Table in SQL: Raisa explains how to use conditional aggregation in SQL and when it’s a better choice than subqueries.

Calculating Accuracy for a Specific Class: Neha breaks down per-class accuracy in classification models, with both theory and a practical example.

DQ Resources

Building and Presenting Your Data Portfolio [NEW]: Learn how to create a strong data portfolio that showcases your skills, highlights impactful projects, and stands out to employers. Learn more

Profitable App Profiles—A Data Analysis Project: Learn to analyze mobile app market data and recommend profitable development strategies in this guided project. Gain hands-on experience with Python, data cleaning, and exploratory analysis. Learn more

Breaking Into a Data Career—Lessons From a CEO: In this video, Kishawna Peck, CEO of Womxn in Data Science, shares insights on navigating the data industry, building a strong portfolio, and breaking into a data career. Learn more

What We're Reading

A New Validation Technique for Better Forecasts: A new method could improve predictions in weather forecasting, climate research, public health, and ecological management.

OpenAI Struggles with Real-World Coding: OpenAI tested its latest AI models on freelance coding tasks, revealing surprising performance gaps in real-world software development.

Give 20%, Get $20: Time to Refer a Friend!

Give 20% Get $20

Now is the perfect time to share Dataquest with a friend. Gift a 20% discount, and for every friend who subscribes, earn a $20 bonus. Use your bonuses for digital gift cards, prepaid cards, or donate to charity. Your choice! Click here

High-fives from Vik, Celeste, Anna P, Anna S, Anishta, Bruno, Elena, Mike, Daniel, and Brayan.

Join Dataquest today!

2025-07-16

Learn faster and retain more.
Dataquest is the best way to learn

Take a free Course

The Dataquest Download

Level up your data and AI skills, one newsletter at a time.

The Dataset: Laptops.csv

Output:

What’s Wrong with Our Column Names?

How to Clean Up Column Names with pandas

Step 1: Remove Leading and Trailing Whitespace

Output:

Step 2: Standardize Formatting (Lowercase & Underscores)

Output:

Step 3: Remove Special Characters

Output:

Step 4: Apply All Fixes at Once

Output:

Why This Matters

From the Community

DQ Resources

What We're Reading

Give 20%, Get $20: Time to Refer a Friend!

Build Production-Ready Workflows with Airflow and ECS

Use SQL or Python? With PySpark, You Don’t Have to Choose

Learn to Set Up PostgreSQL with Docker (No Installation Needed)

Learn faster and retain more.
Dataquest is the best way to learn

The Dataquest Download

Level up your data and AI skills, one newsletter at a time.

The Dataset: Laptops.csv

Output:

What’s Wrong with Our Column Names?

How to Clean Up Column Names with pandas

Step 1: Remove Leading and Trailing Whitespace

Output:

Step 2: Standardize Formatting (Lowercase & Underscores)

Output:

Step 3: Remove Special Characters

Output:

Step 4: Apply All Fixes at Once

Output:

Why This Matters

From the Community

DQ Resources

What We're Reading

Give 20%, Get $20: Time to Refer a Friend!

Build Production-Ready Workflows with Airflow and ECS

Use SQL or Python? With PySpark, You Don’t Have to Choose

Learn to Set Up PostgreSQL with Docker (No Installation Needed)

Learn faster and retain more.Dataquest is the best way to learn

Learn faster and retain more.
Dataquest is the best way to learn