The Dataquest Download

Level up your data and AI skills, one newsletter at a time.

Each week, the Dataquest Download brings the latest behind-the-scenes developments at Dataquest directly to your inbox. Discover our top tutorial of the week to boost your data skills, get the scoop on any course changes, and pick up a useful tip to apply in your projects. We also spotlight standout projects from our students and share their personal learning journeys.

Hello, Dataquesters!

Here’s what we have in store for you in this edition:

  • Concept of the Week: Learn how to explore datasets efficiently using pandas methods like info(), describe(), and value_counts(). Learn more

  • From the Community: AI’s impact on human professionals—does it replace or enhance our skills? Alina shares her perspective. Join the discussion

  • New Resource: In this video, Kishawna Peck, CEO of Womxn in Data Science, shares insights on navigating the data industry and building a strong portfolio. Learn more

If you ever decide to work with a dataset that doesn’t come with an instruction manual, you’re going to have questions. What does the data look like? Is it messy? What data is missing? Is it mostly numbers or text, and what kinds of statistics can we extract from it? Thankfully, pandas can help answer these questions in seconds with just a few commands. Think of it as your data detective toolkit, helping you make sense of everything before getting into deeper analysis.

Loading the Dataset

We’ll be working with data from the 2017 Fortune Global 500 list, which ranks the world’s largest companies by revenue. Once you’ve downloaded the dataset, here’s how to load it into pandas:

import pandas as pd

f500 = pd.read_csv("f500.csv", index_col=0)
f500.index.name = None

This dataset includes columns like:

  • rank: Global rank of the company
  • revenues: Revenue for the year (in millions)
  • industry: Industry in which the company operates
  • ceo: Name of the CEO

Now that the data is loaded, let’s look at the methods that will help us explore and get to know it.

Exploring DataFrames: Your First Look at the Data

While some of these methods might also work on Series objects, this section focuses on DataFrame methods to help you get an overview of your entire dataset. We’ll take a look at Series methods after going over these useful DataFrame techniques.

1. .head(): Take a Peek

The first step is often just seeing what the data looks like. The DataFrame.head() method returns the first few rows (5 by default), giving you an immediate sense of what’s inside. If you want to see more (or fewer) rows, just pass a number as an argument to the method call, like f500.head(10) or f500.head(3).

f500.head()

Output:

              rank  revenues  revenue_change  profits    ...
Walmart          1    485873             0.8  13643.0    ...
State Grid       2    315199            -4.4   9571.3    ...
Sinopec Group    3    267518            -9.1   1257.9    ...
China National   4    262573           -12.3   1867.5    ...
Toyota Motor     5    254694             7.7  16899.3    ...

This is like peeking into a box of chocolates to see what flavors you’ve got before you start ripping into them. If you notice anything odd, you can investigate further.

2. .info(): Structure and Data Health

What’s the overall structure of the data? Which columns have missing values? What data types are you working with? The DataFrame.info() method is your one-stop shop for this information.

f500.info()

Output:

<class 'pandas.core.frame.DataFrame'>
Index: 500 entries, Walmart to LG Electronics
Data columns (total 16 columns):
 #   Column                   Non-Null Count  Dtype
---  ------                   --------------  -----
 0   rank                     500 non-null    int64
 1   revenues                 500 non-null    int64
 2   revenue_change           498 non-null    float64
 3   profits                  499 non-null    float64
    ...

This quick overview helps you spot potential problems, like missing values or incorrect data types (e.g., a numeric column mistakenly stored as text).

3. .describe(): Numeric Summaries

The DataFrame.describe() method summarizes numeric columns by default, providing statistics like mean, min, max, and standard deviation.

f500.describe()

Output:

             rank     revenues  revenue_change      profits    ...
count  500.000000   500.000000      498.000000   499.000000    ...
mean   250.500000  55416.35800        4.538353    3055.2032    ...
std    144.481833 45725.478963       28.549067  5171.981071    ...
min      1.000000 21609.000000      -67.300000  -13038.0000    ...
25%    125.750000 29003.000000       -5.900000   556.950000    ...
50%    250.500000 40236.000000        0.550000   1761.60000    ...
75%    375.250000 63926.750000        6.975000   3954.00000    ...
max    500.000000 485873.00000      442.300000  45687.00000    ...

This is incredibly useful for identifying outliers or understanding the general scale of your data. For example, if the standard deviation for revenues is very high, it suggests that company sizes vary widely.

Exploring Series: Zooming In on Columns

Sometimes you don’t need to explore the whole dataset—you just want to focus on a single column. That’s where Series exploration comes in handy.

4. .describe() for Object Columns

Did you know that .describe() can summarize text-based (or object) columns as well? It provides stats like the number of unique values, the most frequently occurring value (top), and its count (freq).

f500["industry"].describe()

Output:

count                                500
unique                                58
top        Banks: Commercial and Savings
freq                                  51
Name: industry, dtype: object

In this example, you can see that the most common industry among the top 500 companies is Banking, with 51 companies in that category. This quick summary gives you a sense of the distribution without needing to write complex queries.

Example: Exploring the country Column

Let’s say you want to explore which countries have the most companies on the list:

f500["country"].value_counts().head()

Output:

USA       132
China      73
Japan      52
Germany    30
France     29
Name: country, dtype: int64

As expected, the United States tops the list, followed by China and Japan. This insight could prompt further investigation into regional patterns in company revenue or profits.

Why pandas Is So Fast

Here’s a fun fact: pandas is built on top of NumPy, which is designed to handle large datasets efficiently. This means pandas can apply vectorized operations—performing computations on entire Series or DataFrames without needing loops.

For example, if we wanted to calculate the change in company ranks:

f500["previous_rank"] - f500["rank"]

This subtraction happens all at once, thanks to vectorized operations. This is a big reason why pandas is so powerful when working with large datasets.

Understanding Axis Behavior

The axis parameter is a key option in many pandas methods, specifying the direction in which operations are applied. Here’s a simple visual using pseudo-code:

  • axis=0: Apply the operation down the rows (column-wise)
  • axis=1: Apply the operation across the columns (row-wise)

For example, to sum all column values:

f500.sum(axis=0)

To sum across rows:

f500.sum(axis=1)

Keep this distinction in mind—it’ll save you from head-scratching moments when debugging your code!

Final Thoughts: Practice Makes Perfect

These exploration methods are just the tip of the iceberg, but they’re enough to help you quickly understand the shape and structure of your data. Try them out on other datasets and see what insights you can uncover!

Want to learn more about pandas? Check out our full pandas fundamentals lesson or enroll in the Junior Data Analyst path to build on your skills.

Happy coding, and keep experimenting!

From the Community

 

Backpack Price Prediction on Kaggle: Neha put two machine learning models to the test in a Kaggle competition, showcasing top-tier data visualization, insightful storytelling, and well-documented code.

Evaluating Numerical Expressions in Python: Ramsey’s project dives into functional programming with clear, step-by-step documentation—great for Python learners!

Superstore Data Analysis in Power BI: Dan mastered Power BI while breaking down his process into clear, actionable insights. A must-see for Power BI learners!

Generative AI Survey: Participate in an anonymous research survey exploring generative AI’s use and sustainability impact.

Why AI Won’t Replace Human Professionals: Alina continues the discussion on AI, arguing that it enhances human capabilities rather than replacing knowledge work.

DQ Resources

Breaking Into a Data Career – Lessons From a CEO[New]: In this video, Kishawna Peck, CEO of Womxn in Data Science, shares insights on navigating the data industry, building a strong portfolio, and breaking into a data career. Learn more

Build Your First Data Project: Learn how to start your first data project with this beginner-friendly guide, featuring step-by-step walkthroughs for Python and SQL projects. Read more

Matrix Algebra for Data Science: Understand why matrix algebra is a key math skill for data science, with a practical breakdown of concepts you actually need—no advanced math degree required. Learn more

What We're Reading

Why Python Still Leads in 2025: Explore why Python remains the top choice for AI, data science, and web development in 2025, and how mastering it can unlock high-paying career opportunities.

What Is Creative Coding: Simon Willison recaps how 2024 redefined AI, with cutting-edge models breaking new benchmarks and even running on personal laptops.

Do People Actually Hate Coldplay: A data-driven look at Coldplay’s cultural reputation, exploring whether their popularity aligns with public perception.

Top Data Trends for 2025: AI is reshaping the data landscape, driving both consolidation and massive expansion. This article highlights key themes defining data in 2025.

Give 20%, Get $20: Time to Refer a Friend!

Give 20% Get $20

Now is the perfect time to share Dataquest with a friend. Gift a 20% discount, and for every friend who subscribes, earn a $20 bonus. Use your bonuses for digital gift cards, prepaid cards, or donate to charity. Your choice! Click here

High-fives from Vik, Celeste, Anna P, Anna S, Anishta, Bruno, Elena, Mike, Daniel, and Brayan.

2025-03-19

Python Project, Data Storytelling & DQ Resources

Investigate Fandango movie ratings with a hands-on Python project and learn what makes a data story. Plus, gain access to essential DQ resources. Read More
2025-03-12

PyTorch vs. TensorFlow

More AI developers are switching to PyTorch for its flexibility and ease of use. Learn why it’s gaining traction and build your first deep learning model to predict salaries. Read More
2025-03-05

Messy Column Names? Here’s How to Fix Them

Are messy column names slowing you down? Learn how to clean them up with pandas—remove spaces, standardize formatting, and make your dataset easy to work with. Read More

Learn faster and retain more.
Dataquest is the best way to learn