March Madness Challenge – Compete, Learn, and Win – Register Now
The Dataquest Download
Level up your data and AI skills, one newsletter at a time.
Hello, Dataquesters!
In this edition, we bring you expert insights, hands-on projects, and essential data skills to boost your learning journey. From cleaning messy column names in pandas to analyzing Kaggle’s data science survey, there’s plenty to explore!
Concept of the Week: Clean up messy column names in pandas effortlessly—no more frustrating errors! Learn more
From the Community: Check out Jim’s heart disease classification model, Aryan’s challenge to build 52 ML projects, and practical SQL aggregation tips. Join the discussion
New Resource: Building & Presenting Your Data Portfolio—a guide to creating a portfolio that stands out to employers. Learn more
You load a dataset, eager to explore it, but something feels off. The column names are inconsistent—some have spaces, others have weird capitalization, and a few even contain special characters. You try to reference a column and get an error. Annoying, right?
Messy column names are a common frustration, but pandas makes it easy to clean them up. Let’s go over a few simple techniques to make your column names readable, standardized, and easy to work with.
The Dataset: Laptops.csv
For this article, we’ll be working with a dataset containing information about 1,300 laptop computers. If you’d like to follow along and run the code yourself, you can download the dataset here: laptops.csv.
Let’s first load the dataset and inspect the column names:
import pandas as pd # Load dataset df = pd.read_csv("laptops.csv",
encoding="latin1") # Display the column names print(df.columns)
Output:
Index([' Manufacturer',
'Model Name',
'Category',
'Screen Size',
'Screen',
'CPU',
'RAM',
' Storage',
'GPU',
'Operating System', 'Operating System Version',
'Weight',
'Price (Euros)'], dtype='object')
What’s Wrong with Our Column Names?
From the output above, here’s what you might notice:
- Spaces instead of underscores (
Operating System
instead ofoperating_system
) - Inconsistent capitalization (
Manufacturer
vs.model name
) - Hidden leading or trailing whitespace (
" Storage"
instead of"Storage"
) - Special characters (
Price (Euros)
instead ofprice_euros
)
These small inconsistencies can slow you down, especially when writing queries or referencing columns in your analysis. We will now rectify these issues, one step at a time.
How to Clean Up Column Names with pandas
We’ll go through a series of steps to clean our column names. After each step, we’ll display the updated column names to see our progress.
Step 1: Remove Leading and Trailing Whitespace
Column names sometimes contain extra spaces at the beginning or end, which can lead to hard-to-spot errors when referencing them. Removing these unnecessary spaces ensures consistency and prevents issues when selecting columns.
# Strip whitespace from column names
df.columns = df.columns.str.strip()
print(df.columns)
Output:
Index(['Manufacturer',
'Model Name',
'Category',
'Screen Size',
'Screen',
'CPU',
'RAM',
'Storage',
'GPU',
'Operating System', 'Operating System Version',
'Weight',
'Price (Euros)'],
dtype='object')
Notice that leading and trailing spaces have been removed from column names like ‘ Storage’ and ‘ Manufacturer’, making them easier to reference.
Step 2: Standardize Formatting (Lowercase & Underscores)
Column names should be easy to reference in code. A common convention is to use lowercase letters and replace spaces with underscores. This makes accessing and manipulating columns more predictable and less error-prone.
df.columns = df.columns.str.lower().str.replace(" ", "_")
print(df.columns)
Output:
Index(['manufacturer',
'model_name',
'category',
'screen_size',
'screen', 'cpu',
'ram',
'storage',
'gpu',
'operating_system', 'operating_system_version',
'weight',
'price_(euros)'],
dtype='object')
All column names are now lowercase, and spaces have been replaced with underscores, following a standard naming convention. However, note that special characters are still present, such as parentheses in ‘price_(euros)’, which we’ll address next.
Step 3: Remove Special Characters
Some column names may include special characters like parentheses, currency symbols, or other non-alphanumeric characters. These can cause errors when performing operations like merging or querying data. Removing them ensures compatibility across different processing tools.
df.columns = df.columns.str.replace("[()€$]", "", regex=True)
print(df.columns)
Output:
Index(['manufacturer',
'model_name',
'category',
'screen_size',
'screen', 'cpu',
'ram',
'storage',
'gpu',
'operating_system', 'operating_system_version',
'weight',
'price_euros'],
dtype='object')
Special characters like parentheses and currency symbols have been removed. ‘price_(euros)’ is now ‘price_euros’, making it more user-friendly and easier to reference in code.
Step 4: Apply All Fixes at Once
Instead of applying each fix manually, we can create a function that performs all the cleaning steps in one go. This approach is useful when working with multiple datasets that require the same cleaning steps, making the process more efficient and reusable.
def clean_column_names(df): df.columns = (df.columns .str.strip() .str.lower() .str.replace(" ", "_") .str.replace("[()€$]", "",
regex=True)) return df # Clean the column names df = clean_column_names(df) print(df.columns)
Output:
Index(['manufacturer',
'model_name',
'category',
'screen_size',
'screen', 'cpu',
'ram',
'storage',
'gpu',
'operating_system', 'operating_system_version',
'weight',
'price_euros'], dtype='object')
After applying all fixes in one function, we now have fully cleaned column names that are easy to reference and work within any analysis or transformation.
Now your dataset is much easier to work with!
Why This Matters
Cleaning column names may seem like a small step, but it saves time and prevents frustrating errors. When you can quickly reference your columns, your data analysis flows much more smoothly.
Want to learn more about cleaning messy data? Check out the full Data Cleaning Basics lesson. If you’re serious about data analysis, explore the full Junior Data Analyst career path.
From the Community
Classifying Heart Disease: Jim’s machine-learning project showcases in-depth variable analysis and strong visualizations, achieving high-quality results with room for improvement.
52 Projects in 52 Weeks: Aryan is challenging himself to build a machine-learning project every week for a year—got unique ideas? Join the discussion and collaborate!
Multiple Columns from One Table in SQL: Raisa explains how to use conditional aggregation in SQL and when it’s a better choice than subqueries.
Calculating Accuracy for a Specific Class: Neha breaks down per-class accuracy in classification models, with both theory and a practical example.
DQ Resources
Building and Presenting Your Data Portfolio [NEW]: Learn how to create a strong data portfolio that showcases your skills, highlights impactful projects, and stands out to employers. Learn more
Profitable App Profiles—A Data Analysis Project: Learn to analyze mobile app market data and recommend profitable development strategies in this guided project. Gain hands-on experience with Python, data cleaning, and exploratory analysis. Learn more
Breaking Into a Data Career—Lessons From a CEO: In this video, Kishawna Peck, CEO of Womxn in Data Science, shares insights on navigating the data industry, building a strong portfolio, and breaking into a data career. Learn more
What We're Reading
A New Validation Technique for Better Forecasts: A new method could improve predictions in weather forecasting, climate research, public health, and ecological management.
OpenAI Struggles with Real-World Coding: OpenAI tested its latest AI models on freelance coding tasks, revealing surprising performance gaps in real-world software development.
Give 20%, Get $20: Time to Refer a Friend!
Give 20% Get $20
Now is the perfect time to share Dataquest with a friend. Gift a 20% discount, and for every friend who subscribes, earn a $20 bonus. Use your bonuses for digital gift cards, prepaid cards, or donate to charity. Your choice! Click here
High-fives from Vik, Celeste, Anna P, Anna S, Anishta, Bruno, Elena, Mike, Daniel, and Brayan.