Dataquest

Working with DataFrames in PySpark

Mike Levy — Tue, 01 Jul 2025 11:03:26 +0000

In our previous tutorial, you learned about RDDs and saw how Spark's distributed processing makes big data analysis possible. You worked with transformations, actions, and experienced lazy evaluation across clusters of machines.

But as powerful as RDDs are, they require you to do a lot of manual work: parsing strings, managing data types, writing custom lambda functions for every operation, and building transformation logic line by line. When you're building data pipelines or working with structured datasets, you'll usually want a much more efficient and intuitive approach.

Spark DataFrames are the solution to this manual overhead. They provide a higher-level API that handles data parsing, type management, and common operations automatically. Think of them as pandas DataFrames designed for distributed computing. They let you explore and manipulate structured data using familiar operations like filtering, selecting, grouping, and aggregating, while Spark handles the optimization and execution details behind the scenes.

Unlike RDDs, DataFrames understand the schema of your data. They know column names, data types, and relationships, which means you can write cleaner, more expressive code, and Spark can optimize it automatically using its built-in Catalyst optimizer.

To demonstrate these concepts, we'll work with a real-world dataset from the 2010 U.S. Census containing demographic information by age group. This type of structured data processing represents exactly the kind of work where DataFrames excel: transforming raw data into insights with clean, readable code.

By the end of this tutorial, you'll be able to:

Create DataFrames from JSON files and understand their structure
Explore and transform columns using expressive DataFrame syntax
Filter rows and compute powerful aggregations
Chain operations to build complete data processing pipelines
Convert DataFrames between PySpark and pandas for visualization
Understand when DataFrames offer advantages over RDDs for structured data work

Let's see what makes DataFrames Spark's most powerful data abstraction for working with structured information.

Why DataFrames Matter for Data Engineers

DataFrames have become the preferred choice for most Spark applications, especially in production environments. Here's why they're so valuable.

Automatic Optimization with Catalyst

When you write DataFrame operations, Spark's Catalyst optimizer analyzes your code and creates an optimized execution plan. Unlike RDDs where you're responsible for performance optimization, DataFrames automatically apply techniques like:

Predicate pushdown: Moving filters as early as possible in the processing pipeline
Column pruning: Reading only the columns you actually need
Join optimization: Choosing the most efficient join strategy based on data size and distribution

This means your DataFrame code often runs faster than equivalent RDD operations, even when the RDD code is well-optimized.

Schema Validation and Data Quality

DataFrames provide schema awareness, meaning they automatically detect and enforce the structure of your data, including column names, data types, and constraints. This brings several production advantages:

Early error detection: Type mismatches and missing columns are caught at execution time
Consistent data contracts: Your pipelines can rely on expected column names and types
Better debugging: Schema information helps you understand data structure issues quickly

This is particularly valuable when building ETL pipelines that need to handle changing data sources or enforce data quality standards.

Familiar Pandas-like and SQL-like Operations

DataFrames provide an API that feels natural whether you're coming from pandas or SQL background:

# Pandas-style operations you might know:
# df[df['age'] > 21]                    # filtering
# df[['age', 'total']]                  # selecting columns  
# df.groupby('age').count()             # grouping

# Equivalent Spark DataFrame operations:
df.filter(df.age > 21)                  # filtering
df.select("age", "total")               # selecting columns
df.groupBy("age").count()               # grouping

This familiarity makes DataFrame code more readable and maintainable, especially for teams with mixed backgrounds.

Integration with the Broader Spark Ecosystem

DataFrames integrate naturally with other Spark components:

Spark SQL: Write SQL queries directly against DataFrames
MLlib: Machine learning algorithms work natively with DataFrame inputs
Structured Streaming: Process real-time data using the same DataFrame API

While DataFrames offer compelling advantages, RDDs still have their place. When you need fine-grained control over data distribution, custom algorithms, or processing of unstructured data, RDDs provide the flexibility you need. But for structured data processing—which represents the majority of production workloads—DataFrames offer a more powerful and maintainable approach.

Exploring Raw Data First

Before we start loading data into DataFrames, let's take a look at what we're working with. Understanding your raw data structure is always a good first step in any data project.

We'll be using a dataset from the 2010 U.S. Census that contains population counts broken down by age and gender. You can download this dataset locally to follow along with the code.

Let's examine the raw JSON file structure first:

# Open and examine the raw file
with open('census_2010.json') as f:
    for i in range(4):
        print(f.readline(), end="")

{"females": 1994141, "total": 4079669, "males": 2085528, "age": 0, "year": 2010}
{"females": 1997991, "total": 4085341, "males": 2087350, "age": 1, "year": 2010}
{"females": 2000746, "total": 4089295, "males": 2088549, "age": 2, "year": 2010}
{"females": 2002756, "total": 4092221, "males": 2089465, "age": 3, "year": 2010}

This gives us valuable insights into our data structure:

Each line is a valid JSON object (not an array of objects)
Every record has the same five fields: females, total, males, age, and year
All values appear to be integers
The data represents individual age groups from the 2010 Census

With RDDs, you'd need to handle this parsing manually: reading each line, parsing the JSON, converting data types, and dealing with any inconsistencies. DataFrames will handle all of this automatically, but seeing the raw structure first helps us appreciate what's happening behind the scenes.

From JSON to DataFrames

Now let's see how much simpler the DataFrame approach is compared to manual parsing.

Setting Up and Loading Data

import os
import sys
from pyspark.sql import SparkSession

# Ensure PySpark uses the same Python interpreter
os.environ["PYSPARK_PYTHON"] = sys.executable

# Create SparkSession
spark = SparkSession.builder \
    .appName("CensusDataAnalysis") \
    .master("local[*]") \
    .getOrCreate()

# Load JSON directly into a DataFrame
df = spark.read.json("census_2010.json")

# Preview the data
df.show(4)

+---+-------+-------+-------+----+
|age|females|  males|  total|year|
+---+-------+-------+-------+----+
|  0|1994141|2085528|4079669|2010|
|  1|1997991|2087350|4085341|2010|
|  2|2000746|2088549|4089295|2010|
|  3|2002756|2089465|4092221|2010|
+---+-------+-------+-------+----+
only showing top 4 rows

That's it! With just spark.read.json(), we've accomplished what would take dozens of lines of RDD code. Spark automatically:

Parsed each JSON line
Inferred column names from the JSON keys
Determined appropriate data types for each column
Created a structured DataFrame ready for analysis

The .show() method displays the data in a clean, tabular format that's much easier to read than raw JSON.

Understanding DataFrame Structure and Schema

One of DataFrame's biggest advantages is automatic schema detection. Think of a schema as a blueprint that tells Spark exactly what your data looks like. Just like how a blueprint helps architects understand a building's structure, a schema helps Spark understand your data's structure.

Exploring the Inferred Schema

Let's examine what Spark discovered about our census data:

# Display the inferred schema
df.printSchema()

root
 |-- age: long (nullable = true)
 |-- females: long (nullable = true)
 |-- males: long (nullable = true)
 |-- total: long (nullable = true)
 |-- year: long (nullable = true)

This schema tells us several important things:

All columns contain long integers (perfect for population counts)
All columns are nullable = true (can contain missing values)
Column names match our JSON keys exactly
Spark automatically chose appropriate data types

We can also inspect the schema programmatically:

# Get column names and data types for further inspection
column_names = df.columns
column_types = df.dtypes

print("Column names:", column_names)
print("Column types:", column_types)

Column names: ['age', 'females', 'males', 'total', 'year']
Column types: [('age', 'bigint'), ('females', 'bigint'), ('males', 'bigint'), ('total', 'bigint'), ('year', 'bigint')]

Why Schema Awareness Matters

This automatic schema detection provides immediate benefits:

Performance: Spark can optimize operations because it knows data types in advance
Error Prevention: Type mismatches are caught early rather than producing wrong results
Code Clarity: You can reference columns by name rather than by position
Documentation: The schema serves as living documentation of your data structure

Compare this to RDDs, where you'd need to maintain type information manually and handle schema validation yourself. DataFrames make this effortless while providing better performance and error checking.

Core DataFrame Operations

Now that we understand DataFrames and have our census data loaded, let's explore the core operations you'll use in most data processing workflows.

Selecting Columns

Column selection is one of the most fundamental DataFrame operations. The .select() method lets you focus on specific attributes:

# Select specific columns for analysis
age_gender_df = df.select("age", "males", "females")
age_gender_df.show(10)

+---+-------+-------+
|age|  males|females|
+---+-------+-------+
|  0|2085528|1994141|
|  1|2087350|1997991|
|  2|2088549|2000746|
|  3|2089465|2002756|
|  4|2090436|2004366|
|  5|2091803|2005925|
|  6|2093905|2007781|
|  7|2097080|2010281|
|  8|2101670|2013771|
|  9|2108014|2018603|
+---+-------+-------+
only showing top 10 rows

Notice that .select() returns a new DataFrame with only the specified columns. This is useful for:

Reducing data size before executing expensive operations
Preparing specific column sets for different analytical purposes
Creating focused views of your data for team members

Creating New Columns

DataFrames excel at creating calculated fields. The .withColumn() method lets you add new columns based on existing data:

from pyspark.sql.functions import col

# Calculate female-to-male ratio
df_with_ratio = df.withColumn("female_male_ratio", col("females") / col("males"))

# Show the enhanced dataset
ratio_preview = df_with_ratio.select("age", "females", "males", "female_male_ratio")
ratio_preview.show(10)

+---+-------+-------+------------------+
|age|females|  males| female_male_ratio|
+---+-------+-------+------------------+
|  0|1994141|2085528|0.9561804013180355|
|  1|1997991|2087350|0.9571902172611205|
|  2|2000746|2088549|0.9579598084603234|
|  3|2002756|2089465|0.9585018174508786|
|  4|2004366|2090436|0.9588267710659403|
|  5|2005925|2091803|0.9589454647497876|
|  6|2007781|2093905|0.9588691941611487|
|  7|2010281|2097080|0.9586095904781887|
|  8|2013771|2101670| 0.958176592899932|
|  9|2018603|2108014|0.9575851963032503|
+---+-------+-------+------------------+
only showing top 10 rows

The data reveals that the female-to-male ratio is consistently below 1.0 for younger age groups, meaning there are slightly more males than females born each year. This is a well-documented demographic phenomenon.

Filtering Rows

DataFrame filters are both more readable and better optimized than RDD equivalents. Let's find age groups where males outnumber females:

# Filter rows where female-to-male ratio is less than 1
male_dominated_df = df_with_ratio.filter(col("female_male_ratio") < 1)
male_dominated_preview = male_dominated_df.select("age", "females", "males", "female_male_ratio")
male_dominated_preview.show(5)

+---+-------+-------+------------------+
|age|females|  males| female_male_ratio|
+---+-------+-------+------------------+
|  0|1994141|2085528|0.9561804013180355|
|  1|1997991|2087350|0.9571902172611205|
|  2|2000746|2088549|0.9579598084603234|
|  3|2002756|2089465|0.9585018174508786|
|  4|2004366|2090436|0.9588267710659403|
+---+-------+-------+------------------+
only showing top 5 rows

These filtering patterns are common in data processing workflows where you need to isolate specific subsets of data based on business logic or analytical requirements.

Aggregations for Summary Analysis

Aggregation operations are where DataFrames really shine compared to RDDs. They provide SQL-like grouping and summarization with automatic optimization.

Basic Aggregations

Let's start with simple aggregations across the entire dataset:

from pyspark.sql.functions import sum

# Calculate total population by year (since all data is from 2010)
total_by_year = df.groupBy("year").agg(sum("total").alias("total_population"))
total_by_year.show()

+----+----------------+
|year|total_population|
+----+----------------+
|2010|       312247116|
+----+----------------+

This gives us the total U.S. population for 2010 (~312 million), which matches historical Census figures and validates our dataset's accuracy.

More Complex Aggregations

We can create more sophisticated aggregations by grouping data into categories. Let's explore different age ranges:

from pyspark.sql.functions import when, avg, count

# Create age categories and analyze patterns
categorized_df = df.withColumn(
    "age_category",
    when(col("age") < 18, "Child")
    .when(col("age") < 65, "Adult")
    .otherwise("Senior")
)

# Group by category and calculate multiple statistics
category_stats = categorized_df.groupBy("age_category").agg(
    sum("total").alias("total_population"),
    avg("total").alias("avg_per_age_group"),
    count("age").alias("age_groups_count")
)

category_stats.show()

+------------+---------------+------------------+----------------+
|age_category|total_population|avg_per_age_group|age_groups_count|
+------------+---------------+------------------+----------------+
|       Child|        74181467| 4121192.6111111 |              18|
|       Adult|       201292894| 4281124.9574468 |              47|
|      Senior|        36996966|  1027693.5      |              36|
+------------+---------------+------------------+----------------+

This breakdown reveals important demographic insights:

Adults (18-64) represent the largest population segment (~201M)
Children (0-17) have the highest average per age group, showing higher birth rates in recent years
Seniors (65+) have much smaller age groups on average, reflecting historical demographics

Chaining Operations for Complex Analysis

One of DataFrame's greatest strengths is the ability to chain operations into readable, maintainable data processing pipelines.

Building Analytical Pipelines

Let's build a complete analytical workflow that transforms our raw census data into insights:

# Chain multiple operations together for child demographics analysis
filtered_summary = (
    df.withColumn("female_male_ratio", col("females") / col("males"))
      .withColumn("gender_gap", col("females") - col("males"))
      .filter(col("age") < 18)
      .select("age", "female_male_ratio", "gender_gap")
)

filtered_summary.show(10)

+---+------------------+----------+
|age| female_male_ratio|gender_gap|
+---+------------------+----------+
|  0|0.9561804013180355|    -91387|
|  1|0.9571902172611205|    -89359|
|  2|0.9579598084603234|    -87803|
|  3|0.9585018174508786|    -86709|
|  4|0.9588267710659403|    -86070|
|  5|0.9589454647497876|    -85878|
|  6|0.9588691941611487|    -86124|
|  7|0.9586095904781887|    -86799|
|  8| 0.958176592899932|    -87899|
|  9|0.9575851963032503|    -89411|
+---+------------------+----------+
only showing top 10 rows

This single pipeline performs multiple operations:

Data enrichment: Calculates gender ratios and gaps
Quality filtering: Focuses on childhood age groups (under 18)
Column selection: Returns only the metrics we need for analysis

The results show that across all childhood age groups, there are consistently more males than females (negative gender gap), with the pattern being remarkably consistent across ages.

Performance Benefits of DataFrame Chaining

This chained approach offers several advantages over equivalent RDD processing:

Automatic Optimization: Spark's Catalyst optimizer can combine operations, minimize data scanning, and generate efficient execution plans for the entire pipeline.

Readable Code: The pipeline reads like a description of the analytical process, making it easier for teams to understand and maintain.

Memory Efficiency: Intermediate DataFrames aren't materialized unless explicitly cached, reducing memory pressure.

Compare this to an RDD approach, which would require multiple separate transformations, manual optimization, and much more complex code to achieve the same result.

PySpark DataFrames vs Pandas DataFrames

Now that you've worked with PySpark DataFrames, you might be wondering how they relate to the pandas DataFrames you may already know. While they share many concepts and even some syntax, they're designed for very different use cases.

Key Similarities

Both PySpark and pandas DataFrames share:

Tabular Structure: Both organize data in rows and columns with named columns
Similar Operations: Both support filtering, grouping, selecting, and aggregating data
Familiar Syntax: Many operations use similar method names like .select(), .filter(), and .groupBy()

Let’s compare their syntax for a common select and filter operation:

# PySpark
pyspark_result = df.select("age", "total").filter(col("age") > 21)

# Pandas
df_pandas = df.toPandas()  # Convert PySpark df to pandas
pandas_result = df_pandas[df_pandas["age"] > 21][["age", "total"]]

print("PySpark Result:")
{pyspark_result.show(5)}
print(f"Pandas Result: \n{pandas_result[:5]}")

PySpark Result:
+---+-------+
|age|  total|
+---+-------+
| 22|4362678|
| 23|4351092|
| 24|4344694|
| 25|4333645|
| 26|4322880|
+---+-------+
only showing top 5 rows

Pandas Result: 
    age    total
22   22  4362678
23   23  4351092
24   24  4344694
25   25  4333645
26   26  4322880

Key Differences

However, there are important differences:

Scale: pandas DataFrames run on a single machine and are limited by available memory, while PySpark DataFrames can process terabytes of data across multiple machines.
Execution: pandas operations execute immediately, while PySpark uses lazy evaluation and only executes when you call an action.
Data Location: pandas loads all data into memory, while PySpark can work with data stored across distributed systems.

When to Use Each

Use PySpark DataFrames when:

Working with large datasets (>5GB)
Data is stored in distributed systems (HDFS, S3, etc.)
You need to scale processing across multiple machines
Building production data pipelines

Use pandas DataFrames when:

Working with smaller datasets (<5GB)
Doing exploratory data analysis
Creating visualizations
Working on a single machine

Working Together

The real power comes from using both together. PySpark DataFrames handle the heavy lifting of processing large datasets, then you can convert results to pandas for visualization and final analysis.

Visualizing DataFrame Results

Due to PySpark's distributed nature, you can't directly create visualizations from PySpark DataFrames. The data might be spread across multiple machines, and visualization libraries like matplotlib expect data to be available locally. The solution is to convert your PySpark DataFrame to a pandas DataFrame first.

Converting PySpark DataFrames to Pandas

Let's create some visualizations of our census analysis. First, we'll prepare aggregated data and convert it to pandas:

# Create age group summaries (smaller dataset suitable for visualization)
age_summary = df.withColumn("female_male_ratio", col("females") / col("males")) \
                .select("age", "total", "female_male_ratio") \
                .filter(col("age") <= 60)

# Convert to pandas DataFrame
age_summary_pandas = age_summary.toPandas()

print("Converted to pandas:")
print(f"Type: {type(age_summary_pandas)}")
print(f"Shape: {age_summary_pandas.shape}")
print(age_summary_pandas.head())

Converted to pandas:
Type: 
Shape: (61, 3)
   age    total  female_male_ratio
0    0  4079669           0.956180
1    1  4085341           0.957190
2    2  4089295           0.957960
3    3  4092221           0.958502
4    4  4094802           0.958827

Creating Visualizations

Now we can create visualizations using matplotlib:

import matplotlib.pyplot as plt

# Create visualizations
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Plot 1: Population by age
ax1.plot(age_summary_pandas['age'], age_summary_pandas['total'], marker='o')
ax1.set_title('Population by Age (Ages 0-60)')
ax1.set_xlabel('Age')
ax1.set_ylabel('Total Population')
ax1.grid(True)

# Plot 2: Female-to-Male Ratio by age
ax2.plot(age_summary_pandas['age'], age_summary_pandas['female_male_ratio'],
         marker='s', color='red')
ax2.axhline(y=1.0, color='black', linestyle='--', alpha=0.7, label='Equal ratio')
ax2.set_title('Female-to-Male Ratio by Age')
ax2.set_xlabel('Age')
ax2.set_ylabel('Female-to-Male Ratio')
ax2.legend()
ax2.grid(True)

plt.tight_layout()
plt.show()

These visualizations reveal demographic patterns that would be difficult to spot in raw numbers:

Two distinct population peaks: One around ages 17-18 and another larger peak around ages 47-50, likely reflecting baby boom generations
Birth ratio patterns: More males are born than females (ratio below 1.0 for young ages), which is a well-documented biological phenomenon
Aging demographics: The female-to-male ratio crosses 1.0 around age 34 and continues rising, reflecting longer female life expectancy
Generation effects: The dramatic population drop after age 50 and the valley around ages 30-35 reveal distinct generational cohorts in the 2010 census data

Important Considerations for Conversion

When converting PySpark DataFrames to pandas, keep these points in mind:

Size Limitations: pandas DataFrames must fit in memory on a single machine. Always filter or aggregate your PySpark DataFrame first to reduce size.

Use Sampling: For very large datasets, consider sampling before conversion:

# Sample of data before converting
sample_df = large_df.sample(fraction=0.1)
sample_pandas = sample_df.toPandas()

Aggregate First: Create summaries in PySpark, then convert the smaller result:

# Aggregate in PySpark, then convert
summary = df.groupBy("category").agg(sum("value").alias("total"))
summary_pandas = summary.toPandas()  # Much smaller than original

This workflow—process big data with PySpark, then convert summaries to pandas for visualization—represents a common and powerful pattern in data analysis.

DataFrames vs RDDs: Choosing the Right Tool

After working with both RDDs and DataFrames, you have a practical foundation for choosing the right tool. Here's a decision framework to help you weigh your options:

Choose DataFrames When:

Working with structured data: If your data has a clear schema with named columns and consistent types, DataFrames provide significant advantages in readability and performance.
Building ETL pipelines: DataFrames excel at the extract, transform, and load operations common in data engineering workflows. Their SQL-like operations and automatic optimization make complex transformations more maintainable.
Team collaboration: DataFrame operations are more accessible to team members with SQL backgrounds, and the structured nature makes code reviews easier.
Performance matters: For most structured data operations, DataFrames will outperform equivalent RDD code due to Catalyst optimization.

Choose RDDs When:

Processing unstructured data: When working with raw text, binary data, or complex nested structures that don't fit the tabular model.
Implementing custom algorithms: If you need fine-grained control over data distribution, partitioning, or custom transformation logic.
Working with key-value pairs: Some RDD operations like reduceByKey() and aggregateByKey() provide more direct control over grouped operations.
Maximum flexibility: When you need to implement operations that don't map well to DataFrame's SQL-like paradigm.

A Practical Example

Consider processing web server logs. You might start with RDDs to parse raw log lines:

# RDD approach for initial parsing of unstructured text
log_rdd = spark.sparkContext.textFile("server.log")
parsed_rdd = log_rdd.map(parse_log_line).filter(lambda x: x is not None)

But once you have structured data, switch to DataFrames for analysis:

# Convert to DataFrame for structured analysis
log_schema = ["timestamp", "ip", "method", "url", "status", "size"]
log_df = spark.createDataFrame(parsed_rdd, log_schema)

# Use DataFrame operations for analysis
error_summary = log_df \
    .filter(col("status") >= 400) \
    .groupBy("status", "url") \
    .count() \
    .orderBy("count", ascending=False)

This hybrid approach uses the strengths of both abstractions: RDD flexibility for parsing and DataFrame optimization for analysis.

Next Steps and Resources

You've now learned Spark DataFrames and understand how they bring structure, performance, and expressiveness to distributed data processing. You've seen how they differ from both RDDs and pandas DataFrames, and you understand when to use each tool.

Throughout this tutorial, you learned:

DataFrame fundamentals: How schema awareness and automatic optimization make structured data processing more efficient
Core operations: Selecting, filtering, transforming, and aggregating data using clean, readable syntax
Pipeline building: Chaining operations to create maintainable data processing workflows
Integration with pandas: Converting between PySpark and pandas DataFrames for visualization
Tool selection: A practical framework for choosing between DataFrames and RDDs

The DataFrame concepts you've learned here form the foundation for most modern Spark applications. Whether you're building ETL pipelines, preparing data for machine learning, or creating analytical reports, DataFrames provide the tools you need to work efficiently at scale.

Moving Forward

Your next steps in the Spark ecosystem build naturally on this DataFrame foundation:

Spark SQL: Learn how to query DataFrames using familiar SQL syntax, making complex analytics more accessible to teams with database backgrounds.

Advanced DataFrame Operations: Explore window functions, complex joins, and advanced aggregations that handle sophisticated analytical requirements.

Performance Tuning: Discover caching strategies, partitioning techniques, and optimization approaches that make your DataFrame applications run faster in production.

Keep Learning

For deeper exploration of these concepts, check out these resources:

Spark SQL Programming Guide: Official documentation on DataFrames and SQL operations
PySpark API Reference: Complete reference for all DataFrame functions and methods

You've taken another significant step in understanding distributed data processing. The combination of RDD fundamentals and DataFrame expressiveness gives you the flexibility to tackle big data challenges. Keep experimenting, keep building, and most importantly, keep learning!

Introduction to Docker

Anna Strahl — Fri, 27 Jun 2025 16:05:17 +0000

Have you ever heard someone say, "Well, it worked on my machine"?

It’s one of the most common problems in software and data workflows: something works fine in your local environment, but completely breaks when someone else tries to run it.

Docker helps solve that. It lets you create isolated environments called containers that bundle together everything your code or tools need to run: the operating system, dependencies, configuration, and more.

Instead of asking a teammate to install the “right” version of PostgreSQL or configure Python packages, you can share a single containerized setup that just works, no matter what machine they’re on.

In this tutorial, you’ll get hands-on with Docker. By the end, you’ll be able to:

Install Docker Desktop
Run a PostgreSQL database inside a container
Connect to it using built-in tools
Persist data across restarts
Manage containers using both the CLI and Docker Desktop

All without permanently installing PostgreSQL on your machine.

To exemplify the value of Docker, we’ll explore a common data engineering scenario:

You’re developing a small ETL pipeline for a grocery chain that will eventually run in production, where it connects to a PostgreSQL database and processes product data. But your local machine doesn’t have Postgres installed, and even if it did, it might not match the version used in production.

Rather than configuring Postgres manually, you’ll use Docker to spin up a local container that mirrors the production environment. This gives you a clean, isolated setup to develop and test your pipeline, without modifying your system.

We’ll use a sample dataset and create a database called products to simulate the pipeline’s target environment.

Let’s get started!

What is Docker?

As we begin, let’s clarify what Docker is and why it’s useful.

Docker is a platform that lets you run containers, which are lightweight, portable environments that isolate your code and tools from the rest of your system.

You can think of a container like a self-contained box. It includes an operating system, libraries, configuration, and the software you want to run. Because everything is bundled together, containers run the same way on any machine with Docker installed.

These containers are created from Docker images. A Docker image is like a blueprint. It defines what goes inside a container, including the OS version, installed tools, and configuration. When you run an image, Docker creates the container**, which is the live, running version of that environment.

If you need to store data, say, rows in a Postgres database, you can create a volume. That keeps your data safe even if the container is stopped or removed. And if you want to connect to a service running inside the container (like a database), you’ll use port mapping to make it accessible from your host machine.

We’ll use all of these concepts in this tutorial. We’ll start by pulling a Postgres image from Docker Hub, run it as a container, connect to it using port mapping, insert and query data, and use a volume to keep that data safe across restarts.

Compared to virtual machines, containers are more lightweight and efficient. They share the host’s operating system, which makes them faster to start and easier to manage.

Docker is especially helpful when you:

Need to run tools like PostgreSQL, Spark, or Airflow locally
Want consistent environments across dev, staging, and production
Are sharing reproducible workflows with teammates
Deploy pipelines to the cloud

Now, let’s install Docker and try it out.

Installing Docker Desktop

To follow along with this tutorial, you’ll need to have Docker installed on your machine. Docker provides an all-in-one app called Docker Desktop, which works on Windows, macOS, and most Linux distributions.

If you don’t already have it, here’s how to get set up:

Download the installer for your operating system.
Follow the installation instructions for your platform. On Windows, you may be asked to enable WSL 2 or Hyper-V. On macOS, you’ll need to grant permissions during the install. On Linux, you might need to add your user to the docker group to run Docker without sudo, and optionally configure Docker to start automatically with systemd.
Once installation is complete, launch Docker Desktop.

Note: Docker Desktop may prompt you to sign in with a Docker Hub account. You can skip this step; it’s optional and not required for anything in this tutorial.

You’ll know it’s running when you see the Docker whale icon in your system tray or menu bar.

To confirm everything’s working, open a terminal and run:

docker --version

You should see a version number, like:

Docker version 28.0.2, build abc123

Then try:

docker ps

This command lists any running containers. If you haven’t started one yet, you’ll just see the column headers — that’s expected.

Optional: If you’re curious, take a minute to open Docker Desktop and explore the interface. You’ll be able to visually inspect running containers, check logs, and manage volumes — all of which you’ll use later in this tutorial.

Pull and run a PostgreSQL container

With Docker installed, you're ready to run your first real service: a PostgreSQL database.

We’ll start by using an official image from Docker Hub, a public repository of prebuilt images.

Step 1: Pull the image

In your terminal, run:

docker pull postgres:15

This downloads version 15 of the official PostgreSQL image to your machine. You only need to do this once because Docker will reuse the image unless you update or delete it.

Step 2: Run a container

Next, let’s launch the container:

docker run --name local_pg \
  -e POSTGRES_USER=postgres \
  -e POSTGRES_PASSWORD=postgres \
  -e POSTGRES_DB=products \
  -p 5432:5432 \
  -d postgres:15

To break that command down:

--name local_pg: gives the container a name.
-e ...: sets environment variables used by Postgres (like username, password, and database name).
-p 5432:5432: maps port 5432 inside the container to your local machine (you’ll use this to connect later).
-d: runs the container in the background (detached).
postgres:15: tells Docker which image to use.

Once it’s running, check it with:

docker ps

You should see local_pg listed in the output. That means your PostgreSQL server is up and running in a container.
Thanks to the -p 5432:5432 option we included earlier, the Postgres server running inside the container is now accessible from your host machine at localhost:5432. This port mapping connects the internal port used by Postgres (inside the container) to the same port on your system, so tools like psql or database GUIs can talk to it as if it were running natively.

Connect to the running database

Now that your PostgreSQL container is running, let’s connect to it and try a simple query.

The easiest way is to use a built-in tool: psql, the interactive Postgres terminal. It’s already installed inside the container.

To open a shell inside the container, run:

docker exec -it local_pg bash

In this command:

exec: runs a command in an existing container.
it: makes the session interactive, so you can type commands and see the output, just like you would in a normal terminal window.
local_pg: the name of the container.
bash: starts a Bash shell inside it.

Once you're inside the container, connect to Postgres using:

psql -U postgres -d products

This tells psql to:

U postgres: connect as the postgres user.
d products: connect to the products database.

You should now see a products=# prompt. That means you’re connected and ready to run SQL.

Let’s try a few commands:

CREATE TABLE test_table (id serial PRIMARY KEY, name text);
INSERT INTO test_table (name) VALUES ('Alice'), ('Bob');
SELECT * FROM test_table;

This creates a simple table, adds two rows, and displays the results. You should see:

 id | name
----+-------
  1 | Alice
  2 | Bob

To exit the psql interface, type:

\q

To exit the Bash shell and return to your normal terminal, type:

exit

You’ve just connected to a containerized database, created a table, inserted data, and ran a query — all without installing PostgreSQL on your system.

Next, we’ll load a real dataset.

Load a sample dataset

So far, you’ve manually added a few rows to a table. Let’s load something a little more realistic: a product CSV file.

We’ll use the USDA’s 2022 vegetable prices dataset, which includes average retail prices for fresh vegetables across U.S. cities. You can download the CSV from here.

Make sure the file is saved somewhere accessible on your host machine, ideally in the same folder as your project. For this example, we’ll assume the file is named vegetables.csv.

Here’s how to load it:

Step 1: Copy the file into the container

In your terminal:

docker cp vegetables.csv local_pg:/vegetables.csv

This command copies the CSV file from your local system into the running container.

Step 2: Connect to the database

docker exec -it local_pg bash

Then:

psql -U postgres -d products

Step 3: Create a table

Before we can import the CSV, we need a table for the data to go into.

Unlike tools like pandas, PostgreSQL doesn't infer the table structure from the file. We have to define the table ourselves first, specifying the column names and data types that match the structure of the CSV.

Based on the actual CSV content, here’s the schema we’ll use:

CREATE TABLE vegetables (
  vegetable text,
  form text,
  retail_price numeric,
  retail_price_unit text,
  yield numeric,
  cup_equivalent_size numeric,
  cup_equivalent_unit text,
  cup_equivalent_price numeric
);

Step 4: Import the data

Now you’re ready to load the CSV into your new table. We’ll use the COPY command, which is built into Postgres and optimized for loading large files.

Run this command inside psql:

COPY vegetables
FROM '/vegetables.csv'
DELIMITER ','
CSV HEADER;

Here’s what each part does:

COPY vegetables: Tells Postgres to load data into the vegetables table.
FROM '/vegetables.csv': Specifies the full path to the CSV file inside the container. This is the path we used when we ran docker cp.
DELIMITER ',': Says that commas separate the columns.
CSV HEADER: Tells Postgres to skip the first row, since it contains column names, not data.

If everything goes smoothly, Postgres will confirm how many rows were inserted. If there’s a typo in the path or a mismatch in the columns, it will give you an error. Double-check that the file exists at /vegetables.csv and that your table schema matches the CSV.

Step 5: Preview the data

SELECT * FROM vegetables LIMIT 5;

If everything worked, you’ll see a few rows from the dataset.

You’ve now loaded real data into your containerized database: a powerful technique for quickly exploring or testing new data sources.

Persist data with Docker volumes

Right now, the data you’ve loaded into Postgres is stored inside the container. That’s fine for quick experiments, but what happens if the container is deleted or rebuilt?

Try it out:

docker stop local_pg
docker rm local_pg

This stops and removes the container. Now run:

docker ps -a

The container is gone, and so is your data! When you launch a new container from the same image, it’ll start from scratch.

To fix this, Docker provides volumes: a way to store data outside the container’s internal filesystem. Volumes are managed by Docker, and they persist even when containers are stopped or removed.

Now let’s create a volume for our vegetable.csv data.

Step 1: Create and run a new container with a volume

docker run --name local_pg \
  -e POSTGRES_USER=postgres \
  -e POSTGRES_PASSWORD=postgres \
  -e POSTGRES_DB=products \
  -p 5432:5432 \
  -v pgdata:/var/lib/postgresql/data \
  -d postgres:15

This is nearly the same command as before, with one new part:

v pgdata:/var/lib/postgresql/data: This tells Docker to mount a named volume (pgdata) at the path where Postgres stores its data. The volume lives outside the container, so your database files will survive restarts and removal.

Step 2: Confirm that the volume exists

docker volume ls

You should see pgdata listed. You can also inspect it:

docker volume inspect pgdata

This shows where the volume is stored on your system, along with metadata.

Step 3: Reload the data

Since this is a fresh container, the database is empty again. You can repeat the steps from before:

Copy vegetables.csv into the container
Recreate the vegetables table
Load the data with COPY

This time, once you’ve confirmed the data is loaded, stop and remove the container again:

docker stop local_pg
docker rm local_pg

Now restart it with the same command you used in Step 1 (reusing the pgdata volume), reconnect, and run:

SELECT COUNT(*) FROM vegetables;

Your data is still there.

Volumes are useful in any real-world Docker workflow because they keep your data safe, isolated, and portable, all without polluting your local filesystem.

Manage containers from the CLI and Docker Desktop

You’ve been using docker run and docker stop already. Now let’s expand your toolbox a bit. As you work with more containers, it becomes important to know how to inspect, clean up, and monitor what’s running.

See what’s running

To list running containers:

docker ps

To list all containers (including stopped ones):

docker ps -a

Each row shows the container ID, name, status, and the command used to start it.

Inspect logs

To view logs from a container’s standard output and error streams:

docker logs local_pg

This shows everything the container’s main process has output so far, which is useful for debugging or verifying that services are running as expected.

To follow logs in real time (like tail -f):

docker logs -f local_pg

This is especially helpful when your container runs in the background (-d) and you want to monitor what it’s doing.

Note: docker logs is similar to how Kubernetes handles container output with kubectl logs.

Exec vs. Attach: Two ways to connect

When working with running containers, there are two important ways to interact:

docker exec
Earlier in this tutorial, you used docker exec to open a terminal session inside your running container:

docker exec -it local_pg bash

This command started a new process inside the container (in this case, a Bash shell) so you could connect to the Postgres database and run SQL commands. You can use docker exec the same way to inspect files, restart services, or run any command inside a running container.

It’s like opening a fresh terminal window inside the container’s environment, separate from the main process.

docker attach

In comparison to docker exec, docker attach is used to connect directly to the container’s main process, which is the one that started when the container launched:

docker attach local_pg

You’ll see the output from that process, and you can interact with it if it accepts input. Unlike exec, attach doesn’t start anything new, it simply hooks into what’s already running.

If the container was launched in the background (with -d), docker attach is one way to watch it in real time. But be careful: pressing Ctrl+C while attached will send a termination signal to the main process, which stops the container. To detach safely without stopping anything, press Ctrl+P followed by Ctrl+Q.

This behavior demonstrates an important part of how containers work: each container wraps a single main process. When that process exits because of an error, a signal, or a normal shutdown, the container stops.

For example, if you change a config file and restart a service from within the container, but that service isn’t the container’s main process, your changes might not apply as expected, because Docker only tracks the main process.

To sum up the ways we’ve seen to interact with running containers:

docker exec: Run a new command or open a shell inside the container
docker logs: View output from the main container process (similar to kubectl logs)
docker attach: Connect to the main process to view output or interact with it in real time

Understanding these tools gives you a clearer picture of how containers behave and how to debug or manage them effectively.

Remove unused containers

If you’re done with a container:

docker stop local_pg
docker rm local_pg

This stops and deletes it, but, thanks to volumes, your data remains if you're using one.

To delete everything, including volumes and images (be careful!):

docker system prune

It’s a good idea to run this only when you’re sure you don’t need anything left behind.

View and manage with Docker Desktop

Docker Desktop makes this easier with a visual interface. Open it, and you can:

View running containers
Start/stop containers with one click
See logs in real time
Browse volumes and their contents

It’s a great way to check that things are working (and a good fallback if you forget a command!).

Wrap-up

You’ve now gone end-to-end with Docker in a real data engineering workflow:

Installed Docker Desktop and verified that it’s running
Pulled and started a PostgreSQL container from Docker Hub
Connected to the database using built-in tools
Created tables, inserted rows, and ran SQL queries
Loaded a real dataset
Used Docker volumes to persist your data
Managed your containers using both the CLI and Docker Desktop

This setup mirrors real-world tasks that data engineers do all the time: spinning up temporary databases, testing data pipelines, and working with isolated environments that are portable and reproducible.

Now that you’ve seen Docker in action, you can begin applying it to your own projects, whether you're testing ETL scripts, exploring new tools, or collaborating with teammates across different systems.

Nice work!

Working with RDDs in PySpark

Mike Levy — Thu, 26 Jun 2025 15:04:34 +0000

In the previous tutorial, you saw how to set up PySpark locally and got your first taste of SparkSession, the modern entry point that coordinates Spark's distributed processing. You saw how the Driver orchestrates Executors across a cluster, and you experienced lazy evaluation firsthand when operations didn't execute until you called an action.

Now it's time to go deeper into working with RDDs. To make this tutorial hands-on and practical, we'll work with real-world population data sourced from Wikipedia's list of countries by population. This CSV dataset contains country names, population counts, percentages of world population, and update dates. We'll use it to explore how RDDs handle structured data at scale.

By the end of this tutorial, you'll understand:

What makes RDDs "resilient and distributed" within the SparkSession ecosystem
The difference between transformations and actions in PySpark
How to create and manipulate RDDs using core operations
How Directed Acyclic Graphs (DAGs) optimize your data processing pipeline
When RDDs are still the right choice for modern data science workflows

Let's start by understanding what exactly makes an RDD the foundation of Spark's distributed computing power.

What Are RDDs?

Resilient Distributed Datasets (RDDs) are Spark's original data structure for distributed computing. Think of an RDD as a very large Python list that's been:

Split into partitions and distributed across multiple machines or CPU cores
Stored in memory for fast access during processing
Made fault-tolerant through automatic rebuilding if any partition fails

As you learned in the previous tutorial, SparkSession is your unified entry point to all Spark functionality. While SparkSession gives you access to higher-level abstractions like DataFrames and SQL, it also provides seamless access to RDD operations through its underlying SparkContext.

Why Learn RDDs When DataFrames Exist?

While DataFrames handle most of your day-to-day data tasks, understanding Resilient Distributed Datasets (RDDs) gives you the foundation to work with Spark's core execution model. RDDs are Spark's original data abstraction. They’re the building blocks that made distributed computing accessible to Python developers.

But you may wonder: "If DataFrames are more optimized and easier to use, why learn RDDs?" And it’s a great question. Here's why RDDs remain valuable:

RDDs excel when you need:

Fine-grained control over data distribution and processing
Custom transformations that don't map well to SQL-like operations
Machine learning algorithms that require iterative processing
Streaming data processing with complex event handling

DataFrames are better for:

Structured data with known schemas
SQL-like operations (filtering, grouping, joining)
Automatic optimizations for query performance
Integration with Spark SQL and MLlib

For this tutorial, we'll focus on RDD fundamentals through practical examples.

Let's start by loading our population dataset and exploring how RDDs work.

Loading Data with RDDs

Using our familiar SparkSession setup:

import os
import sys
from pyspark.sql import SparkSession

# Ensure PySpark uses the same Python interpreter as this script
os.environ["PYSPARK_PYTHON"] = sys.executable

# Create SparkSession
spark = SparkSession.builder \
    .appName("RDD_PopulationAnalysis") \
    .master("local[*]") \
    .config("spark.driver.memory", "2g") \
    .getOrCreate()

# Load data through SparkSession's SparkContext
population_rdd = spark.sparkContext.textFile("world_population.csv")
sample_data = population_rdd.take(3)
print("Sample data from population dataset:")
for line in sample_data:
    print(line)

print(f"\nRDD type:    {type(population_rdd)}")
print(f"Sample type: {type(sample_data)}")

Sample data from population dataset:
location,population,percent_of_world,date_updated
India,1413324000,17.3%,1 Mar 2025
China,1408280000,17.2%,31 Dec 2024

RDD type:    
Sample type:

Let's break down what's happening in this code:

SparkSession Configuration:

.appName("RDD_PopulationAnalysis") gives our Spark application a descriptive name
.master("local[*]") runs Spark locally using all available CPU cores
.config("spark.driver.memory", "2g") allocates 2GB of memory for the driver process

Loading and Sampling Data:

spark.sparkContext.textFile() creates an RDD where each line becomes one element
.take(3) returns the first 3 elements as a regular Python list

Did you notice something important? We iterated over sample_data (the result from .take(3)), not over population_rdd directly. That's because RDDs aren't iterable like Python lists, since their data is distributed across multiple machines or cores. To work with RDD contents, we need to use Spark actions that either return Python objects or trigger distributed computation.

Transformations vs Actions: The Heart of Spark Programming

The distinction between transformations and actions is fundamental to writing efficient PySpark code. We touched on this previously when we talked about lazy evaluation, but now let's explore it thoroughly.

Transformations: Building Your Data Pipeline

Transformations are operations that define what you want to do with your data, but they don't execute immediately. Instead, they build a computation plan that Spark will optimize and execute later.

When you need to load data from files, use:

.textFile(): Load text data from files into an RDD

When you need to transform each element, use:

.map(): Apply a function to each element in the RDD

When you need to filter data based on conditions, use:

.filter(): Keep only elements that meet certain criteria

Let's see these transformations in action:

# These are all transformations - they build a plan but don't execute
header_removed = population_rdd.filter(lambda line: not line.startswith("location"))
split_lines = header_removed.map(lambda line: line.split(","))
countries_only = split_lines.map(lambda fields: (fields[0], int(fields[1])))

print("Transformations created:")
print(f"Original RDD: {population_rdd}")
print(f"After filter: {header_removed}")
print(f"After map: {split_lines}")
print(f"After final map: {countries_only}")
print("\nNo data has been processed yet - these are just execution plans!")

Transformations created:
Original RDD: world_population.csv MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0
After filter: PythonRDD[2] at RDD at PythonRDD.scala:53
After map: PythonRDD[3] at RDD at PythonRDD.scala:53
After final map: PythonRDD[4] at RDD at PythonRDD.scala:53

No data has been processed yet - these are just execution plans!

Each transformation creates a new RDD object with a reference like PythonRDD[2], PythonRDD[3], etc. The number in square brackets is Spark's internal ID for tracking different RDDs in your application—each new transformation gets the next sequential number. These aren't data, they're execution plans waiting to be triggered.

Actions: Triggering Computation and Returning Results

Actions are operations that force Spark to execute your transformation pipeline and return results. Unlike transformations that return RDDs, actions return regular Python objects.

When you need to inspect your data, use:

.take(n): Return the first n elements as a Python list
.first(): Return the first element as a Python object

When you need to collect results, use:

.collect(): Return all elements as a Python list (use carefully!)

When you need to count elements, use:

.count(): Return the number of elements as a Python integer

When you need to aggregate data, use:

.reduce(): Combine elements using a function, return a single Python object

Let's trigger our transformation pipeline with an action:

# This action triggers execution of the entire transformation chain
# Note: .take() returns a regular Python list, not an RDD
result = countries_only.take(5)
print("Now the transformations execute!")
print(f"Result type: {type(result)}")  # This will be a Python list
print("Country-population pairs:")
for country, population in result:
    print(f"{country}: {population:,}")

Now the transformations execute!
Result type: 
Country-population pairs:
India: 1,413,324,000
China: 1,408,280,000
United States: 340,110,988
Indonesia: 282,477,584
Pakistan: 241,499,431

Only when we called .take(5) did Spark actually execute the pipeline and return a regular Python list. This is just lazy evaluation at work.

The key pattern to remember: transformations return RDDs, actions return Python objects. This is why .collect() is so useful; it converts your distributed RDD into a regular Python list that you can iterate over normally. However, avoid using .collect() on large datasets since it loads everything into memory at once. Filtering your data first can help a lot.

(Note: Some actions like .saveAsTextFile() return nothing after writing to disk.)

Understanding DAGs: Spark's Optimization Engine

When you chain transformations together, Spark creates a Directed Acyclic Graph (DAG) that represents your computation. A DAG is a network of connected nodes where:

Nodes represent operations (like map, filter, or data sources)
Directed edges show the flow of data from one operation to the next
Acyclic means there are no circular dependencies because data flows in one direction

Looking at the diagram above, you can see how node 'a' flows to nodes 'b' and 'c' (amongst others), which both contribute to node 'd', and finally it flows to node 'e'. In Spark terms, this might represent: load data (a) → filter (b) and map (c) → join results (d) → final output (e).

This DAG structure allows Spark to:

Combine operations for efficiency (e.g., filter + map = single pass through data)
Minimize data movement between machines
Recover from failures by rebuilding only necessary partitions
Optimize resource usage across the cluster

Let's create a more complex transformation pipeline to see DAG optimization at work:

# Complex transformation chain
processed_data = population_rdd \
    .filter(lambda line: not line.startswith("location")) \
    .map(lambda line: line.split(",")) \
    .filter(lambda fields: len(fields) >= 2) \
    .map(lambda fields: (fields[0], int(fields[1]))) \
    .filter(lambda pair: pair[1] > 100_000_000)

print("Complex transformation chain created")
print(f"Final RDD: {processed_data}")
print("Spark has built a DAG to optimize this pipeline")

# Trigger execution and get a Python list
large_countries = processed_data.collect()  # Returns Python list
print(f"\nResult type: {type(large_countries)}")
print(f"Countries with population > 100 million:")
for country, population in large_countries[:10]:  # Now we can iterate normally
    print(f"{country}: {population:,}")

Complex transformation chain created
Final RDD: PythonRDD[8] at RDD at PythonRDD.scala:53
Spark has built a DAG to optimize this pipeline

Result type: 
Countries with population > 100 million:
India: 1,413,324,000
China: 1,408,280,000
United States: 340,110,988
Indonesia: 282,477,584
Pakistan: 241,499,431
Nigeria: 223,800,000
Brazil: 212,583,750
Bangladesh: 169,828,911
Russia: 146,028,325
Mexico: 130,417,144

Let's break down this transformation chain:

.filter(lambda line: not line.startswith("location")) - Remove the CSV header row
.map(lambda line: line.split(",")) - Split each line into a list of fields
.filter(lambda fields: len(fields) >= 2) - Keep only rows with at least 2 columns (data quality check)
.map(lambda fields: (fields[0], int(fields[1]))) - Create tuples of (country_name, population_as_integer)
.filter(lambda pair: pair[1] > 100_000_000) - Keep only countries with population over 100 million

Each transformation adds a step to Spark's execution plan, building up a processing pipeline that won't execute until we call an action. Behind the scenes, Spark optimizes this chain by pipelining operations and planning efficient execution, all before running a single step! (With DataFrames, these optimizations get even more sophisticated.)

Hands-On: Population Data Analysis with RDDs

Now let's put these concepts into practice with a comprehensive analysis of our world population data. We'll work through the complete data processing pipeline, from loading raw data to computing meaningful insights.

Task 1: Load and Explore the Dataset Structure

First, we need to understand what we're working with. For this task, we'll use the .take() action to get a sample and the .count() action to understand the dataset size.

# Examine the first few lines (action - returns Python list)
sample_lines = population_rdd.take(5)
print("Raw data sample:")
for i, line in enumerate(sample_lines):
    print(f"Line {i}: {line}")

# Check total number of lines (action - returns Python integer)
total_lines = population_rdd.count()
print(f"\nTotal lines in dataset: {total_lines}")
print(f"Sample type: {type(sample_lines)}")  # Python list
print(f"Count type: {type(total_lines)}")    # Python integer

Raw data sample:
Line 0: location,population,percent_of_world,date_updated
Line 1: India,1413324000,17.3%,1 Mar 2025
Line 2: China,1408280000,17.2%,31 Dec 2024
Line 3: United States,340110988,4.2%,1 Jul 2024
Line 4: Indonesia,282477584,3.5%,30 Jun 2024

Total lines in dataset: 240
Sample type: 
Count type:

Our dataset contains 240 lines, including the header. Notice how the actions return regular Python objects that we can work with normally.

Task 2: Remove Headers from Raw Data

Next, we need to clean our data by removing the header row. For this task, we'll use the .first() action to identify the header and the .filter() transformation to exclude it.

# Get the header line (action - returns Python string)
header = population_rdd.first()
print(f"Header: {header}")
print(f"Header type: {type(header)}")  # Python string

# Remove header from dataset (transformation - returns new RDD)
data_without_header = population_rdd.filter(lambda line: line != header)

# Verify header removal (action - returns Python list)
verification_sample = data_without_header.take(3)
print("\nData after header removal:")
for line in verification_sample:
    print(line)

# Count remaining lines (action - returns Python integer)
data_count = data_without_header.count()
print(f"\nLines after header removal: {data_count}")

Header: location,population,percent_of_world,date_updated
Header type: 

Data after header removal:
India,1413324000,17.3%,1 Mar 2025
China,1408280000,17.2%,31 Dec 2024
United States,340110988,4.2%,1 Jul 2024

Lines after header removal: 239

Perfect! We've successfully removed the header and confirmed we have 239 countries in our dataset.

Task 3: Parse Raw Text into Structured Data

Now we need to convert our raw CSV strings into structured data we can analyze. For this task, we'll use the .map() transformation to split and restructure each line.

# Split each line into fields (transformation - returns new RDD)
split_data = data_without_header.map(lambda line: line.split(","))

# Check the structure (action - returns Python list)
sample_split = split_data.take(3)
print("Parsed data structure:")
for i, fields in enumerate(sample_split):
    print(f"Country {i+1}: {fields}")

# Transform into structured tuples (transformation - returns new RDD)
structured_data = split_data.map(lambda fields: (
    fields[0],                   # country name
    int(fields[1]),              # population as integer
    fields[2],                   # percentage (keeping as string for now)
    fields[3]                    # date
))

# Verify the transformation (action - returns Python list)
sample_structured = structured_data.take(3)
print("\nStructured data:")
for country, pop, pct, date in sample_structured:
    print(f"{country}: {pop:,} people ({pct}) - updated {date}")

Parsed data structure:
Country 1: ['India', '1413324000', '17.3%', '1 Mar 2025']
Country 2: ['China', '1408280000', '17.2%', '31 Dec 2024']
Country 3: ['United States', '340110988', '4.2%', '1 Jul 2024']

Structured data:
India: 1,413,324,000 people (17.3%) - updated 1 Mar 2025
China: 1,408,280,000 people (17.2%) - updated 31 Dec 2024
United States: 340,110,988 people (4.2%) - updated 1 Jul 2024

Excellent! Our data is now properly structured with population values converted to integers for mathematical operations.

Task 4: Filter Data Based on Population Thresholds

Let's find countries that meet specific population criteria. For this task, we'll use the .filter() transformation to select countries and the .collect() action to gather results.

# Create a simplified RDD with just country and population (transformation)
country_population = structured_data.map(lambda record: (record[0], record[1]))

# Find countries with population over 100 million (transformation)
large_countries = country_population.filter(lambda pair: pair[1] >= 100_000_000)

# Count how many large countries there are (action - returns Python integer)
large_country_count = large_countries.count()
print(f"Countries with population >= 100 million: {large_country_count}")

# Get the complete list (action - returns Python list)
large_countries_list = large_countries.collect()
print(f"\nResult type: {type(large_countries_list)}")
print("Large countries (>=100M population):")
for country, population in large_countries_list:
    print(f"{country}: {population:,}")

Countries with population >= 100 million: 16

Result type: 
Large countries (>=100M population):
India: 1,413,324,000
China: 1,408,280,000
United States: 340,110,988
Indonesia: 282,477,584
Pakistan: 241,499,431
Nigeria: 223,800,000
Brazil: 212,583,750
Bangladesh: 169,828,911
Russia: 146,028,325
Mexico: 130,417,144
Japan: 123,340,000
Philippines: 114,123,600
Ethiopia: 109,499,000
Democratic Republic of the Congo: 109,276,000
Egypt: 107,271,260
Vietnam: 101,343,800

We have 16 countries with populations of 100 million or more. Notice how we used transformations (.map() and .filter()) to build our processing pipeline, then actions (.count() and .collect()) to actually execute the computation and retrieve results.

Task 5: Compute Global Population Statistics

Now let's calculate summary statistics across all countries. For this task, we'll use the .reduce() action to perform custom aggregations. This method works by combining elements two at a time using a function you provide, continuing until only one result remains.

# Extract just the population values (transformation)
populations = country_population.map(lambda pair: pair[1])

# Calculate total world population (action - returns Python integer)
total_population = populations.reduce(lambda x, y: x + y)
print(f"Total world population: {total_population:,}")

# Find the maximum population (action - returns Python integer)
max_population = populations.reduce(lambda x, y: max(x, y))
print(f"Largest country population: {max_population:,}")

# Find the minimum population (action - returns Python integer)
min_population = populations.reduce(lambda x, y: min(x, y))
print(f"Smallest country population: {min_population:,}")

# Calculate average population (combining actions)
country_count = populations.count()  # Action - returns Python integer
average_population = total_population / country_count
print(f"Average country population: {average_population:,.0f}")

print(f"\nAll results are Python objects:")
print(f"Total type: {type(total_population)}")
print(f"Max type: {type(max_population)}")
print(f"Count type: {type(country_count)}")

Total world population: 7,982,296,746
Largest country population: 1,413,324,000
Smallest country population: 35
Average country population: 33,398,731

All results are Python objects:
Total type: 
Max type: 
Count type:

These statistics reveal the extreme inequality in global population distribution. While the average country has about 33 million people, the largest country (India) has over 1.4 billion while the smallest has just 35 people. This massive range shows why understanding your data distribution matters before applying analytics.

Notice that all our results are regular Python integers returned by actions, ready for further calculations or analysis. Population totals may vary from other sources since official statistics are constantly updated with new estimates and projections.

Task 6: Analyze Population Distribution by Categories

Let's explore how countries are distributed across different population ranges. For this task, we'll use .map() for categorization and .reduceByKey() for counting. The .reduceByKey() transformation works specifically with key-value pair RDDs, grouping all pairs that share the same key and then combining their values using a function you provide.

# Categorize countries by population size (transformation)
def categorize_population(population):
    if population >= 100_000_000:
        return "Very Large (100M+)"
    elif population >= 10_000_000:
        return "Large (10M-100M)"
    elif population >= 1_000_000:
        return "Medium (1M-10M)"
    elif population >= 100_000:
        return "Small (100K-1M)"
    else:
        return "Very Small (<100K)"

# Apply categorization and prepare for counting (transformation)
categorized = country_population.map(
    lambda pair: (categorize_population(pair[1]), 1)
)

# Count countries in each category (transformation + action)
category_counts = categorized.reduceByKey(lambda x, y: x + y).collect()

counts_dict = dict(category_counts)
category_order = ["Very Large (100M+)", "Large (10M-100M)", "Medium (1M-10M)", 
                  "Small (100K-1M)", "Very Small (<100K)"]

print("Countries by population category:")
for category in category_order:
    print(f"{category}: {counts_dict[category]} countries")

Countries by population category:
Very Large (100M+): 16 countries
Large (10M-100M): 79 countries
Medium (1M-10M): 65 countries
Small (100K-1M): 39 countries
Very Small (<100K): 40 countries

This distribution reveals interesting patterns in global demographics. The majority of countries (144 out of 239, or 60%) fall into the "Medium" and "Large" categories, representing the typical nation-state size. What's striking is that while only 16 countries qualify as "Very Large," these nations contain the majority of the world's population. Meanwhile, 40 countries have fewer than 100,000 people—often small island nations or city-states.

Notice how .reduceByKey() automatically handled the grouping and counting for us, demonstrating the power of Spark's key-value operations for aggregation tasks.

Properly Managing Spark Resources

When you're finished with your analysis, it's important to properly shut down your SparkSession to free up system resources:

# Clean up resources when finished
spark.stop()
print("SparkSession stopped successfully")

SparkSession stopped successfully

The .stop() method ensures that all executors are terminated and resources are released back to your system. This is especially important when running multiple Spark applications or working in shared environments.

Review and Next Steps

You've built a solid foundation in RDD programming and understand the core concepts that power Spark's distributed processing engine.

Key Takeaways

RDD Fundamentals:

RDDs provide fine-grained control over distributed data processing
They're accessible through SparkSession's underlying SparkContext
Unlike Python lists, RDDs aren't directly iterable—you need actions to access data

Transformations vs Actions:

Transformations (.map(), .filter(), .textFile()) return RDDs and build execution plans
Actions (.take(), .collect(), .count(), .first(), .reduce()) return Python objects and trigger computation
Spark builds DAGs (execution plans) from transformation chains and optimizes them before actions trigger execution

Cloud Setup for Airflow (Part II)

Brayan Opiyo — Fri, 20 Jun 2025 20:23:43 +0000

You’ve built and tested your ETL pipeline locally using Apache Airflow and Docker — well done. But running it on your own machine has its limits. What happens when your laptop is off? Or when other teams need access? Or when your workload grows beyond what your local setup can handle?

In this next phase, we’ll bring your Airflow project to life in the cloud using Amazon Web Services (AWS). The goal? To build out the infrastructure that supports a reliable, automated, and scalable pipeline, no more manual restarts or local limitations.

Here’s what you’ll set up in this part of the tutorial:

An S3 bucket to store your transformed CSV files
A PostgreSQL database on Amazon RDS to persist Airflow metadata
A custom IAM role to allow Airflow to interact securely with AWS
A security group to manage traffic between services
An Application Load Balancer (ALB) to expose the Airflow UI
And a tested DAG that runs locally and successfully uploads to S3

By the end of this tutorial, you’ll have built the cloud infrastructure needed to support Airflow in a real deployment environment, and you’ll have verified that your DAG runs successfully from start to finish with S3 as its target destination.

You’ll create all of these AWS resources step-by-step, with clear instructions and the reasoning behind each one.

To follow along, you’ll need an AWS account. If you don’t have one, you can create a free account here. You’ll also need to create an Access Key ID and Secret Access Key from your root account to authenticate with AWS via the CLI. (We’ll use the root account just for simplicity in this practice project — not recommended in production.)

Ready to launch into the cloud? Let’s get started — one resource at a time.

Amazon S3 Bucket – Store Pipeline Output

Earlier in our tutorial, we created Task 3 in our DAG — the one responsible for sending our final CSV to cloud storage. It looked like this:


# Task 3: Upload to S3
def upload_to_s3(**kwargs):
    run_date = kwargs['ds']
    bucket_name = 'your-bucket-name'
    s3_key = f'your-directory-name/events_transformed_{run_date}.csv'
    s3 = boto3.client('s3')
    s3.upload_file(transformed_path, bucket_name, s3_key)
    print(f"Uploaded to s3://{bucket_name}/{s3_key}")

This is the very last stop in our ETL pipeline — the part that loads data into the cloud. But for this to work, you’ll need to create a place for it to land. That’s where Amazon S3 comes in.

What Is Amazon S3?

Think of S3 as your pipeline’s cloud-based file cabinet. Every time your DAG runs, it creates a CSV file — and this final task uploads that file into a designated S3 bucket.

To make this work, you only need to configure two key values:

bucket_name: This is the name of the S3 bucket where the file will be stored.
s3_key: This is the path and filename the file will be saved as inside that bucket.

These two values must exactly match the configuration of your AWS resources — let’s set those up now.

Creating an S3 bucket:

Go to the AWS Console → Search S3 → Click Create bucket

Under General configuration: Bucket type → General purpose ; Bucket name→your-bucket-name e.g.my-airflow-bucket (this what will be assigned to bucket_name field, in upload_to_s3 in your task)

Leave all other fields as default.

Then scroll down to Block Public Access settings for this bucket:

Uncheck “Block all public access” (we’re doing this for learning purposes)
Acknowledge the warning checkbox that appears

Click Create bucket

Creating a bucket policy

Now that we’ve created our bucket, it’s time to give it permission to accept uploads.

By default, S3 buckets are locked down tight, and nothing can be uploaded until we explicitly allow it. That’s where a bucket policy comes in. We’ll use it to tell AWS:

“Hey, let certain resources (like our ECS tasks) put files into this specific folder.”

Here’s how to do it:

Go to your newly created bucket and click it:

Click on the Permissions tab → Bucket policy → Edit;

Under Edit bucket policy → Policy Generator (you should copy your Bucket ARN; we will use it in the next step);

Step 1: Select policy type →Type of Policy → S3 Bucket Policy;

Step 2: Add statement(s) → Effect → Allow ;Principal → *; Actions → PutObject; Amazon Resource Name (ARN)→ your-bucket-arn(arn:aws:s3:::your-bucket-name/*).

Make sure not to omit the /* at the end of your-bucket-name in the ARN field—this specifies access to all objects within the bucket, not the bucket itself.

Click on Add Statement; Generate the policy, copy the JSON file. Back to your Bucket policy, under policy section, paste it into the bucket policy editor, and then click Save.

What this policy means is: allow everyone (*) in the world to upload (PutObject) files to the your-bucket-name S3 bucket. This setup is acceptable for practice but not recommended for production. The best practice is to define separate security groups and access controls.

So far, you’ve:

Created an S3 bucket for storing your ETL output
Added a bucket policy to allow uploads to a specific folder inside that bucket

But here’s the catch: just because the bucket can accept uploads doesn’t mean your Airflow tasks are automatically allowed to do so.

Airflow will be running inside an ECS container, and that container needs permission to call AWS services, like putting a file into S3.

That’s where the ecsTaskExecutionRole comes in.

This is the identity your ECS task assumes when it runs. By giving this role the right permissions, we can safely allow your DAG’s upload_to_s3() task to send data to your S3 bucket without making your bucket publicly open in a real deployment.

Let’s go ahead and create that role.

Creating the `ecsTaskExecutionRole`

Go to the AWS Console → Search IAM and open it.

Scroll to Roles and click on it → Create role.

Step 1: Select trusted entity; Trusted entity type → AWS service; Use case →Service or use case – Search for and select Elastic Container Service → Use case – Choose Elastic Container Service Task, and click Next

Step 2: Add permissions; Search for and select AmazonECSTaskExecutionRolePolicy. This policy allows ECS tasks to call AWS services on your behalf. Click Next.

Step 3: Name, review, and create; Role name → Enter ecsTaskExecutionRole → Leave the description as the default. Click Create role.

Once done, the ecsTaskExecutionRole will be created successfully. However, we still need to explicitly attach an additional policy—besides AmazonECSTaskExecutionRolePolicy—to allow ECS to put objects into our S3 bucket.

Grant S3 Access with an Inline Policy

To allow this role to put files into your bucket — and specifically into the folder referenced in your DAG’s s3_key — do the following:

In the IAM Console, go to Roles; Find and click on ecsTaskExecutionRole; then under Permissions → Add permissions → Create inline policy; under the policy editor, choose JSON, and paste the following:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::your-bucket-name/our-files/*"
        }
    ]
}

Make sure:

your-bucket-name matches the name of your actual S3 bucket.
our-files matches the folder path in your DAG’s s3_key — for example:
```
s3_key = f'our-files/events_transformed_{run_date}.csv'
```

Click Next, give your policy a name(e.g AirflowS3UploadPolicy ), and click Create policy

What we are basically saying is: allow the role (in this case, ecsTaskExecutionRole) to upload files (PutObject) into any object key under the our-files/ folder in your specified S3 bucket.
(Note: The folder name our-files in the Resource field can be any name—just make sure it matches the directory name (your-directory-name) you reference in your DAGs, specifically in the s3_key field.)

Congratulations; you’ve now connected the dots:

The S3 bucket is ready to store output
The bucket policy allows uploads to a specific folder
The ECS task role (ecsTaskExecutionRole) is now authorized to put files in that folder

With this setup, when your DAG runs and hits the upload_to_s3() task, it will have everything it needs to push the file to the right place in S3 — securely, automatically, and cloud-natively.

Now that our pipeline has permission to send data to S3 and ECS knows how to do it securely — it’s time to open the right doors for all our cloud services to communicate.

Adding a Security Group

Now that you’ve set up your S3 bucket and given your ECS tasks permission to upload to it, the next step is to make sure all your AWS services, Airflow UI, RDS database, and ECS containers — can talk to each other securely.

That’s where a Security Group comes in.

Think of a security group as a cloud firewall — it controls which types of traffic are allowed in and out of your services.

In our Airflow deployment, we need to do two things:

Let users access the Airflow UI (via HTTP on port 80, exposed through an Application Load Balancer)
Let ECS connect to the RDS database (on port 5432, which is PostgreSQL’s default)

These open up the essential communication pathways your pipeline depends on.

Creating the Security Group

To create a security group, go to the AWS Console → Search for VPC → Under Security, select Security Groups → Click Create security group.

Under Basic Details; Security group name → Enter the name of your security group(e.g., my-airflow-sg → Description → Add a brief description (e.g., “Security group for Airflow resources”) → VPC → Leave it/Choose the default vpc

Under Inbound links rules → Add rule ; and fill in the following:

First inbound link, type → HTTP ; Protocol → TCP ; Port range → 80 ; Source,info→ Custom, 0.0.0.0/0 → description(optional)→ Allow all users to access..
Second inbound link, type → PostgreSQL → Protocol → TCP → Port range → 5432 → Source,info→ Custom, 0.0.0.0/0 → description(optional)→ Allow ecs to connect to our db

Leave anything else as default, and click Create security group.

And that’s it — you’ve successfully created a security group for your Airflow project.

Allowing access from 0.0.0.0/0 means anyone on the internet can attempt to reach your resource — which is fine for testing but not safe in production.

In a real deployment, you should:

Create separate security groups for ECS and RDS
Limit RDS access to only ECS and/or the ALB
Restrict HTTP traffic to only known IPs (or your team’s VPN)

For now, though, this setup gives you the flexibility to move quickly while developing and testing.
Next, we’ll use this security group when setting up Amazon RDS for Airflow’s metadata — and later attach it to our ALB and ECS tasks to control network access across the pipeline.

Setting Up an Application Load Balancer (ALB)

Now that our services can communicate and the security group is in place, it’s time to make your Airflow UI publicly accessible so you (or your team) can open it in a browser anytime, from anywhere.

That’s the job of an Application Load Balancer.

Think of the ALB as your project’s front door — it takes in all incoming HTTP traffic and routes it to the right ECS task running your Airflow api-server.

This gives your application a single, public-facing URLand ensures that requests are only forwarded to healthy, running containers behind the scenes.

But before we create the ALB itself, we need to define where it should send traffic. That’s done using a target group.

i. Create a Target Group

The target group is a collection of resources (in our case, ECS tasks) that the ALB will forward traffic to.

Go to AWS Console → Search ec2 and click on it.

Under Load balancing, select Target Groups, and click Create target group ;

Step 1: Specify group details:

Basic configarion: Choose a target type → ip addrasses ;Target group name → your-target-group-name (e.g my-airflow-tg); Protocol → HTTP ; Port → 8080 ; IP address type → IPV4 ; VPC → Select the default ; Protocol version → HTTP1

Health checks: Health check protocol → HTTP ; Health check path → /

Click Next

Step 2: Register targets:

Leave everything else as default, then click Create target group.

ii. Create the Application Load Balancer

Now that we have the target group created, let’s move on to creating the Application Load Balancer (ALB). In the EC2 Dashboard still, scroll down to the Load Balancing section → Click on Load Balancers → Then click Create Load Balancer :

Compare and select load balancer type: Load balancer types → Application Load Balancer → create

under Basic configuration: Load balancer name → your-alb-name(e.g. my-airflow-alb); Scheme → internet-facing ; Load balancer IP address type → IPV4

Under Network mapping: VPC → select-your-default-vpc ; Availability Zones and subnets → select all listed

Under Security groups: ****Security groups → select-your-security-group-name i.e the security group created

Under Listeners and routing: Protocol → HTTP; Port → 80; Default action → select-your-target-group ( the target group we have just created above)

Keep everything as default and click Create Load Balancer.

That’s It — You Now Have an ALB!

Give it a minute or two to finish provisioning.

Once it’s ready, head back to the Load Balancers page and look for the DNS name of your ALB — this will look something like:

my-airflow-alb-1234567890.us-east-1.elb.amazonaws.com

Make sure to note this URL — you’ll use it to configure our ECS services and access the Airflow UI once everything is deployed.

Coming up next: we’ll set up a database (Amazon RDS) so Airflow has a place to store its state and metadata. Let’s give Airflow a memory!

Amazon Relational Database Service(RDS)

Now that we’ve created an S3 bucket for our uploads, assigned the necessary IAM roles, set up a security group to manage traffic, and configured an Application Load Balancer to expose the Airflow UI, our pipeline is almost fully cloud-ready.

So… what do you think we need next?

We’ve built the system, exposed the interface, and opened communication between services — but where does Airflow actually store everything it knows?

By default, Airflow uses a lightweight SQLite database, and while that’s fine for local development, it just won’t cut it in the cloud.

That’s why we’re now switching to Amazon RDS, using a managed PostgreSQL database.

Airflow doesn’t just run tasks — it also remembers:

Which DAGs were triggered
What their outcomes were
Who ran them, and when
User accounts, roles, login credentials
Your entire operational history

All of this needs to live somewhere persistent, resilient, and accessible from anywhere — and that’s exactly what RDS gives us.

Let’s set it up.

Creating a PostgreSQL Database with Amazon RDS

We’ll use the Standard Create option to spin up a simple, cloud-managed PostgreSQL instance.

Go to AWS Console → Search Aurora and RDS, and open it.

Under Databases select create database.

Choose a database creation method → Standard create

Engine options → PostgreSQL

Under Templates → Free tier (This gives us a simpler instance for free of up to 12 months, ideal for development).

Under Settings : DB instance identifier → your-db-identifier-name (e.g. my-database); Master username → your-db-username (e.g. postegres); Credentials management → Self managed → your-db-password .

Under Instance configuration: DB instance class → Burstable classes (includes t classes) → db.t3.micro.

Under Connectivity: Compute resource → Compute resource Don’t connect to an EC2 compute resource; Network type → IPv4; Virtual private cloud (VPC) → select-your-default-vpc; DB subnet group → select-your-default-db-subnet-group; Public access → yes; VPC security group ****→ choose existing → select-your-security-group (the one we created earlier e.g my-airflow-sg ).

Leave everything else as default, then Click Create database.

It may take a few minutes to provision. Once it’s ready, the status will change to Available.

Once the database is ready:

Click on your new database
Scroll to Connectivity & security
Copy the endpoint — it will look something like:

airflow-db.1234567890.us-east-1.rds.amazonaws.com

You’ll use this endpoint to configure your Airflow Docker image and connect the ECS task to the metadata database.

With your RDS instance up and running, Airflow now has a persistent, cloud-based memory to keep track of tasks, users, and more, ready to support your workflow reliably, even as containers spin up or down.

Updating Our Docker for Local S3 Testing

Now that we’ve set up all the necessary AWS resources — especially our S3 bucket — it’s time to test the full pipeline locally before deploying anything to the cloud.

We’re not touching RDS or the Application Load Balancer just yet. For now, we just want to make sure our DAG can run from start to finish and successfully upload the final CSV to S3.

To do that, we’ll:

Update our Docker setup with the required Python dependencies (like boto3)
Uncomment Task 3 in our DAG to enable the S3 upload
And run everything locally to verify it works as expected

Step 1: Navigate to Your Project

Before anything else, open your terminal and move into the directory containing your Airflow project — the one with your docker-compose.yaml file: e.g.,

cd path/to/your/airflow-docker

Step 2: Configure AWS Credentials

Next, let’s give your local environment access to your AWS account. Add the following environment variables to your docker-compose.yaml under the environment section of the x-airflow-common block:

    AWS_ACCESS_KEY_ID: your-aws-access-key
    AWS_SECRET_ACCESS_KEY: your-aws-secret-access-key
    AWS_DEFAULT_REGION: your-region-as-indicated-in-your-aws-account i.e us-east-1

These keys authenticate your Airflow container with AWS and allow it to upload files to S3 using the boto3 library.

Your updated section in docker-compose.yaml should look something like this:

x-airflow-common:
  &airflow-common
  image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:3.0.1}
  environment:
    &airflow-common-env
    AIRFLOW__CORE__EXECUTOR: LocalExecutor
    AWS_ACCESS_KEY_ID: your-aws-access-key
    AWS_SECRET_ACCESS_KEY: your-aws-secret-access-key
    AWS_DEFAULT_REGION: your-region-as-indicated-in-your-aws-account i.e us-east-1

    # ... other Airflow settings

Reminder: These credentials are for development purposes only. Never expose AWS secrets in public projects or production code.

Once this is done, you're ready to proceed with updating your DAG and testing the pipeline end-to-end.

Step 2: Finalize Your DAG

Now that the S3 bucket is ready and permissions are in place, go ahead and update your DAG.

Here’s how your final DAG should look:

from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.python import PythonOperator
import os
import pandas as pd
import random
import boto3

default_args = {
    'owner': 'your-name',
    'retries': 3,
    'retry_delay': timedelta(minutes=1)
}

output_dir = '/opt/airflow/tmp'
raw_file = 'raw_events.csv'
transformed_file = 'transformed_events.csv'
raw_path = os.path.join(output_dir, raw_file)
transformed_path = os.path.join(output_dir, transformed_file)

# Task 1: Generate dynamic event data
def generate_fake_events():
    events = [
        "Solar flare near Mars", "New AI model released", "Fusion milestone", "Celestial event tonight",
        "Economic policy update", "Storm in Nairobi", "New particle at CERN", "NASA Moon base plan",
        "Tremors in Tokyo", "Open-source boom"
    ]
    sample_events = random.sample(events, 5)
    data = {
        "timestamp": [datetime.now().strftime("%Y-%m-%d %H:%M:%S") for _ in sample_events],
        "event": sample_events,
        "intensity_score": [round(random.uniform(1, 10), 2) for _ in sample_events],
        "category": [random.choice(["Science", "Tech", "Weather", "Space", "Finance"]) for _ in sample_events]
    }
    df = pd.DataFrame(data)
    os.makedirs(output_dir, exist_ok=True)
    df.to_csv(raw_path, index=False)
    print(f"[RAW] Saved to {raw_path}")

# Task 2: Transform data and save new CSV
def transform_and_save_csv():
    df = pd.read_csv(raw_path)
    df_sorted = df.sort_values(by="intensity_score", ascending=False)
    df_sorted.to_csv(transformed_path, index=False)
    print(f"[TRANSFORMED] Sorted and saved to {transformed_path}")

# Task 3: Upload to S3
def upload_to_s3(**kwargs):
    run_date = kwargs['ds']
    bucket_name = 'your-bucket-name'
    s3_key = f'your-directory-name/events_transformed_{run_date}.csv'
    s3 = boto3.client('s3')
    s3.upload_file(transformed_path, bucket_name, s3_key)
    print(f"Uploaded to s3://{bucket_name}/{s3_key}")

# DAG setup
with DAG(
    dag_id="daily_etl_pipeline_with_transform",
    default_args=default_args,
    description='Simulate a daily ETL flow with transformation and S3 upload',
    start_date=datetime(2025, 5, 24),
    schedule='@daily',
    catchup=False,
) as dag:
    task_generate = PythonOperator(
        task_id='generate_fake_events',
        python_callable=generate_fake_events
    )
    task_transform = PythonOperator(
        task_id='transform_and_save_csv',
        python_callable=transform_and_save_csv
    )
    task_upload = PythonOperator(
        task_id='upload_to_s3',
        python_callable=upload_to_s3,
    )

    # Task flow
    task_generate >> task_transform >> task_upload

Important:

Replace 'your-bucket-name' with your actual S3 bucket name
Replace 'your-directory-name' in the s3_key with the exact folder name used in your IAM role’s S3 policy — e.g., our-files

Step 3: Run It Locally

Make sure Docker Desktop is open and running
If you have containers already running, take them down cleanly:
```
docker compose down -v
```
Save your DAG file and start your environment:
```
docker compose up -d
```
Once all containers are marked as healthy, go to:

http://localhost:8080
Log in, turn on your DAG, and trigger it manually

What to Expect

All three tasks should run without errors
The transformed CSV file should appear in your S3 bucket under the correct folder (e.g., our-files/)
You can double-check by visiting your bucket in the AWS console and browsing to the file like:

s3://your-bucket-name/our-files/events_transformed_2025-06-20.csv

The logs also confirm the activities that happened when we triggered our dags

If you hit an obstacle or this part feels overwhelming, no worries; you can always revisit the first part of this tutorial to review the Docker and DAG setup before continuing.

You’ve got this, and now your DAG is cloud-aware and tested from end to end!

Wrap-Up & What’s Next

At this point, you've done a lot of the heavy lifting — and you’ve moved far beyond just running Airflow locally.

You’ve:

Set up essential AWS resources like an S3 bucket, ALB, RDS, IAM roles, and a security group
Built a working DAG that simulates a real ETL process
Configured and tested your Airflow project to successfully upload files to S3
Gained hands-on experience editing and running Airflow inside Docker containers

This phase gave you a strong foundation in both DAG design and cloud-aware orchestration while keeping everything testable and manageable in your local environment.

But now it’s time to go further.

In the next and final part of this tutorial, we’ll deploy your Airflow pipeline fully to the cloud. You’ll learn how to:

Create and push a custom Docker image to Amazon ECR/Repository
Launch a PostgreSQL metadata database using Amazon RDS
Configure Airflow to use your RDS backend and run reliably in production
Deploy all Airflow components to Amazon ECS (Fargate)
Route traffic through an Application Load Balancer, giving you a public Airflow UI

By the end of Part III, you’ll have a fully operational, cloud-hosted Airflow environment — automated, scalable, and production-ready. Let’s do it!

PySpark Tutorial for Beginners – Install and Learn Apache Spark with Python

Mike Levy — Fri, 20 Jun 2025 19:26:12 +0000

Imagine you're an analyst who's responsible for analyzing customer data for a growing e-commerce company. Last year, your Python pandas scripts handled the data just fine. But now, with millions of customers and billions of transactions, your laptop freezes when you try to load a single month's worth of data. Python crashes. Even your beefed-up desktop machine runs out of memory when trying to load the data in Excel.

Your imaginary self has hit the wall that every data professional eventually faces: the point where traditional tools simply can't handle the scale of modern data.

When datasets grow beyond what a single machine can process, you need a different approach. You need distributed computing. You need a way to spread the work across multiple machines so they can tackle the problem together. That's exactly what Apache Spark was built to do.

PySpark brings this distributed processing power to Python developers, letting them write familiar Python code that runs across entire clusters of computers. Instead of being limited by your laptop's 16GB of RAM, you can process terabytes of data using the combined memory and processing power of dozens or even hundreds of machines.

In this tutorial, you'll learn:

What Apache Spark and PySpark are, and their role in big data processing
Spark's core architecture, including Drivers, Executors, and Clusters
How to set up PySpark locally and run it within a Jupyter notebook environment using modern practices

By the end, you'll have PySpark running on your machine and understand how it transforms the way we think about data processing. Let's get started.

Why Big Data Processing Matters

Before we get into the specifics of PySpark, it's worth considering the problem it solves and how it does that. For decades, when data processing got slow, the solution was simple: scale up! Buy a bigger machine. Add more RAM. Get a faster CPU. This approach worked well when datasets were measured in megabytes or a couple of gigabytes.

But today's data is a different beast. Companies generate massive volumes of information on a daily basis: web logs, sensor data, transaction records, social media activity. We're talking about datasets that are hundreds of gigabytes, terabytes, or even petabytes in size. No single machine, no matter how powerful, can handle this scale efficiently or cost-effectively.

The alternative is to scale out: distribute the workload across multiple machines working together. Instead of one supercomputer, you use many regular computers as a coordinated team. This is the foundation of modern big data processing.

The Evolution from Hadoop to Spark

Early distributed systems like Hadoop and MapReduce made this possible, but they had a major limitation: they relied heavily on disk storage. Every step in the processing pipeline wrote intermediate results to disk, then read them back for the next step. This constant disk I/O made the systems slow and less responsive, especially for iterative tasks like machine learning or interactive data exploration.

Apache Spark changed this around by introducing in-memory computation. Instead of constantly writing to disk, Spark keeps data in memory between processing steps. This makes it dramatically faster for iterative workloads, sometimes 10x to 100x faster than Hadoop for certain tasks.

Built from the ground up for speed and scalability, Spark supports multiple programming languages, including Python through the PySpark API. This means you can cash in on the power of distributed computing using the Python skills you already have.

Understanding Apache Spark and PySpark

Let's start by clarifying what we're working with here.

Apache Spark is an open-source distributed computing engine designed to process large datasets across clusters of machines. While Spark itself is written in Scala (a language that runs on the Java Virtual Machine), it provides APIs for several programming languages.

PySpark is Spark's Python API. It's the bridge that lets you write Spark applications using Python syntax and libraries you're already familiar with. When you write PySpark code, you're actually controlling the Spark engine underneath, which handles all the complex distributed processing.

Apache Spark's Data Structures

At the heart of Spark are several data structures for working with distributed data, with these two being the most important for PySpark users:

Resilient Distributed Datasets (RDDs) are Spark's foundational data structure. Think of an RDD as a collection of data that's automatically split into chunks and distributed across different machines in your cluster. These chunks can be processed in parallel, which is what makes Spark so fast.
DataFrames are a higher-level abstraction built on top of RDDs that provide a table-like interface similar to pandas DataFrames or SQL tables. They offer better performance optimization and are easier to work with for many common data processing tasks.

In this tutorial, we'll focus on RDDs because they help you understand how Spark works at a fundamental level. Once you've got a good grip on RDD concepts, DataFrames become much easier to understand. We'll explore DataFrames in depth in future tutorials.

RDDs are:

Resilient: If a machine fails or data is lost, Spark can rebuild the data from other machines
Distributed: The data is spread across multiple machines, allowing parallel processing
Datasets: Collections of data objects you can process, transform, and analyze

Now that you understand what Spark is and the data structures it uses, let's explore how you actually interact with Spark through its entry points.

Meet `SparkSession`: Your Entry Point to PySpark

When working with PySpark, you need a way to connect to and control the Spark engine. This connection point is called an entry point, your gateway to all of Spark's distributed processing capabilities.

In modern PySpark development (since 2016), that entry point is SparkSession.

What is `SparkSession`?

Think of it as your command center for all Spark operations. It's a unified interface that gives you access to every Spark feature you might need:

RDD operations for low-level data processing
DataFrame operations for structured data analysis
SQL queries for familiar database-style operations
Streaming data for real-time processing
Machine learning libraries for building models

Before SparkSession existed, you had to create and manage separate entry points for each type of Spark operation. SparkSession simplified this by combining everything into one consistent interface.

Why `SparkSession` Became the Standard

SparkSession was introduced in Spark 2.0 as part of a major simplification effort. The Spark developers recognized that managing multiple entry points was confusing and error-prone, so they created SparkSession as a "one-stop shop" for all Spark functionality.

Here's what makes SparkSession powerful:

Unified Access: Everything you need is available through one object
Automatic Management: It handles complex setup and resource management behind the scenes
Built-in Optimization: It includes performance improvements that weren't available in older approaches
Future-Proof: New Spark features are built to work with SparkSession first

This unified approach means you can focus on solving data problems instead of wrestling with configuration and setup.

Now that you understand what SparkSession is and why it's the modern standard, let's see it in action.

Your First PySpark Program with `SparkSession`

Let's put theory into practice by loading and exploring a real dataset. We'll use data from The Daily Show with Jon Stewart, which contains information about guests who appeared on the show. You can download the daily_show.tsv file we’re using in this tutorial here.

daily_show.tsv

This example will show you how SparkSession coordinates all the distributed processing we've been discussing:

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.getOrCreate()

# Load data using RDD through SparkSession
daily_show_rdd = spark.sparkContext.textFile("daily_show.tsv")
daily_show_sample = daily_show_rdd.take(3)
print(daily_show_sample)

['YEAR\tGoogleKnowlege_Occupation\tShow\tGroup\tRaw_Guest_List', '1999\tactor\t1/11/99\tActing\tMichael J. Fox', '1999\tComedian\t1/12/99\tComedy\tSandra Bernhard']

Let's break down what each part of this code does:

from pyspark.sql import SparkSession: Imports the SparkSession class from PySpark's SQL module
SparkSession.builder: Starts the process of creating a SparkSession using the builder pattern (a common design pattern for complex object creation)
.getOrCreate(): This either creates a new SparkSession or returns an existing one if it's already running
spark.sparkContext.textFile("daily_show.tsv"): Accesses RDD functionality through the sparkContext and loads the file line-by-line into an RDD
.take(3): Is an action that tells Spark to actually execute the job and return the first three lines

Even though this looks like fairly simple Python code, there's something powerful happening behind the scenes here. Spark has created a distributed dataset (RDD) that's ready for processing across a cluster. The data is loaded lazily (only when we call .take()), and Spark has built an execution plan for processing it efficiently.

This is your first taste of how PySpark abstracts away the complexity of distributed computing while giving you the full power of cluster processing.

The Evolution of Spark Entry Points

Before we take a closer look at Spark's underlying architecture, it's important we discuss another key entry point you may encounter when working with PySpark: SparkContext.

SparkContext was Spark's original entry point, designed specifically for working with RDDs. In early versions of Spark, it was the only way to access Spark functionality. If you wanted to work with different types of data or operations, you had to create separate contexts for each.

This is why SparkSession was introduced in Spark 2.0 as a unified replacement, wrapping around SparkContext and other older contexts. Instead of managing multiple entry points, you now work with just one.

How They Connect

Here's the key insight: SparkSession actually contains and manages a SparkContext automatically. When you create a SparkSession, it creates a SparkContext behind the scenes and gives you access to it when needed.

Let's see this relationship in action by loading the same Daily Show dataset using both approaches:

from pyspark.sql import SparkSession
from pyspark import SparkContext

# Method 1: Using SparkSession (modern approach)
ss = SparkSession.builder.getOrCreate()
rdd_via_sparksession = ss.sparkContext.textFile("daily_show.tsv")

# Method 2: Using SparkContext directly (legacy approach)
sc = SparkContext.getOrCreate()
rdd_via_sparkcontext = sc.textFile("daily_show.tsv")

# Let's examine what we created
print("SparkSession object:           ", ss)
print("SparkContext from SparkSession:", ss.sparkContext)
print("Direct SparkContext object:    ", sc)

# The key insight: they're the same underlying object
print("\nAre they the same SparkContext?", sc is ss.sparkContext)

# Both RDDs work identically
sample_from_session = rdd_via_sparksession.take(10)
sample_from_context = rdd_via_sparkcontext.take(10)
print("Same data?", sample_from_session == sample_from_context)

SparkSession object:            
SparkContext from SparkSession: 
Direct SparkContext object:     

Are they the same SparkContext? True
Same data? True

This example demonstrates several important points:

ss.sparkContext and sc are literally the same object in memory
Both approaches create identical RDDs that work exactly the same way
SparkSession doesn't replace SparkContext; it wraps and manages it
You can use either ss.sparkContext.textFile() or sc.textFile() interchangeably

When You'll Encounter Each

Use SparkSession for all new PySpark development because it's the modern standard and gives you access to everything
You'll see SparkContext in older tutorials, legacy codebases, and educational materials that focus specifically on RDD fundamentals
Both work together seamlessly since SparkSession manages SparkContext automatically

Now let's explore how SparkSession orchestrates Spark's distributed architecture.

Spark's Architecture Deep Dive

To use PySpark effectively, it helps to see what's happening under the hood when you run a Spark application. Spark's architecture consists of three main components working together: the Driver, Executors, and the Cluster. Let's explore each one.

The Driver: Your Command Center

When you create a SparkSession, you're actually launching the Spark Driver—the central coordinator of your Spark application. The Driver is the process running your Python code, whether that's in a script, Jupyter notebook, or part of a larger data pipeline.

Think of the Driver as the conductor of an orchestra. It doesn't play any instruments itself, but it reads the music (your code), guides each musician (the executors), and ensures everything stays in sync.

The Driver's responsibilities include:

Reading your code and creating execution plans: Analyzes your PySpark operations and builds a Directed Acyclic Graph (DAG) that maps out the most efficient way to process your data
Managing resources: Communicates with cluster managers to allocate memory and CPU cores
Distributing tasks: Breaks your job into smaller chunks and sends them to worker processes
Coordinating everything: Tracks progress, handles failures, and collects results
Hosting the monitoring interface: Provides a web-based UI (typically at http://localhost:4040 in local mode) where you can track job progress, examine execution plans, and debug performance issues

Let's see the Driver in action through SparkSession:

from pyspark.sql import SparkSession

# SparkSession creation launches the Driver
spark = SparkSession.builder \
    .appName("ArchitectureDemo") \
    .master("local[*]") \
    .config("spark.driver.memory", "2g") \
    .getOrCreate()

# Driver properties accessible through SparkSession
print("Driver is running as:", spark.sparkContext.appName)
print("Running in mode:", spark.sparkContext.master)
print("Driver memory:", spark.conf.get("spark.driver.memory"))
print("Default parallelism:", spark.sparkContext.defaultParallelism)

Driver is running as: ArchitectureDemo
Running in mode: local[*]
Driver memory: 2g
Default parallelism: 8

Let's examine what each part of this code does:

spark.builder creates a SparkSession builder that uses the builder pattern, allowing you to chain multiple configuration methods together before calling .getOrCreate(). The builder pattern lets you chain configurations in any order. You could write .master().appName().config() or .config().master().appName() with the same result.
.appName("ArchitectureDemo") sets the application name visible to monitoring tools
.master("local[*]") sets where Spark runs: local[*] uses all CPU cores locally, local[4] uses 4 cores, yarn for Hadoop clusters or spark://host:port for standalone Spark clusters
.config("spark.driver.memory", "2g") allocates 2 gigabytes of memory to the Driver process
.getOrCreate() either creates a new SparkSession with your configurations or returns an existing one if it's already running (prevents multiple sessions from conflicting)
spark.sparkContext.appName retrieves the application name from the underlying SparkContext
spark.sparkContext.master shows the current deployment mode and resource allocation
spark.conf.get() retrieves configuration values you've set
spark.sparkContext.defaultParallelism shows the default number of partitions Spark will create for RDDs when you don't specify the number explicitly

Behind the scenes, SparkSession handles all the complex initialization that used to require manual setup of multiple contexts. It creates the Driver, establishes connections to the cluster, and prepares everything for distributed processing.

Executors: Your Distributed Workers

If the Driver is the conductor, then Executors are the musicians doing the actual work. Executors are distributed worker processes that run on machines in your cluster (or on different cores of your local machine when developing). They are the worker processes that the Driver coordinates to handle your data processing tasks.

Once the Driver decides what needs to be done, tasks get sent to Executors, who:

Process data chunks in parallel across different machines or CPU cores
Cache intermediate results in memory for fast access in later operations
Execute your transformations like filtering, mapping, and aggregating data
Return results back to the Driver when work is complete
Report status back to the Driver for monitoring and failure recovery

Here's where Spark's lazy evaluation becomes important. Your code doesn't immediately execute because Spark waits until you specifically request results:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("LazyEvalDemo").getOrCreate()

# This creates a plan but doesn't execute anything yet
daily_show_rdd = spark.sparkContext.textFile("daily_show.tsv")
print("RDD created:", daily_show_rdd)
print("Type:", type(daily_show_rdd))

# Check Executor configuration
print("Executor memory:", spark.conf.get("spark.executor.memory", "default"))
print("Executor cores:", spark.conf.get("spark.executor.cores", "default"))

# Only when we call an action do Executors start working
print("\nTriggering execution...")
sample_data = daily_show_rdd.take(3)
print("Results:", sample_data)

RDD created: daily_show.tsv MapPartitionsRDD[4] at textFile at NativeMethodAccessorImpl.java:0
Type: 
Executor memory: default
Executor cores: default

Triggering execution...
Results: ['YEAR\tGoogleKnowlege_Occupation\tShow\tGroup\tRaw_Guest_List', '1999\tactor\t1/11/99\tActing\tMichael J. Fox', '1999\tComedian\t1/12/99\tComedy\tSandra Bernhard']

Breaking down this example:

spark.sparkContext.textFile() creates an RDD execution plan but doesn't read the file yet
The RDD object shows as a reference like MapPartitionsRDD[1] - this is the execution plan, not data
spark.conf.get() shows how memory and cores are allocated to Executors
Only .take(3) triggers the Driver to send tasks to Executors for actual data processing
Executors read the file, process it, and return the first 3 lines to the Driver

What you just saw demonstrates two fundamental PySpark concepts:

Transformations like textFile() are lazy: they build an execution plan but don't process data. They return new RDDs that represent "what to do" rather than actual results.

If an operation is lazy → it's a transformation

Actions like take() trigger execution: they force Spark to run all queued transformations and return actual data to your program.

If an operation executes immediately → it's an action

In our next tutorial, we'll take a much closer look at transformations and actions to understand their differences in detail and explore the most important operations in each category.

Lazy evaluation allows Spark to optimize the entire workflow before Executors begin processing, potentially combining operations or skipping unnecessary work.

The Cluster: Where Everything Lives

The cluster is the collection of machines (or CPU cores) that provides the computational foundation for your Driver and Executors. Whether you're processing gigabytes or petabytes of data, the cluster gives you the distributed computing power that makes big data processing possible.

SparkSession automatically handles cluster management—coordinating resources, scheduling tasks, and managing communication between the Driver and Executors across your cluster.

Cluster deployment options that SparkSession supports include:

Local Mode: Simulates a cluster using your machine's CPU cores - perfect for development and learning
Standalone: Spark's built-in cluster manager for dedicated Spark clusters
YARN: Integrates with Hadoop ecosystems and shared cluster resources
Kubernetes: Modern choice for container-based deployments in cloud environments
Mesos: General-purpose cluster resource manager

For learning PySpark, local mode is ideal because it simulates distributed behavior on your single machine:

from pyspark.sql import SparkSession

# Configure local cluster through SparkSession
spark = SparkSession.builder \
    .master("local[4]") \
    .appName("LocalClusterDemo") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.executor.memory", "1g") \
    .getOrCreate()

print("Cluster mode:", spark.sparkContext.master)
print("Application name:", spark.sparkContext.appName)
print("Default partitions:", spark.sparkContext.defaultParallelism)
print("Total executor memory:", spark.conf.get("spark.executor.memory"))

# Clean shutdown
spark.stop()

Cluster mode: local[4]
Application name: LocalClusterDemo
Parallelism level: 4

Breaking down the cluster configuration:

.master("local[4]") creates a simulated 4-core cluster on your machine
spark.sql.adaptive.enabled enables query optimization across the "cluster"
spark.executor.memory allocates memory for each simulated Executor
spark.sparkContext.defaultParallelism shows the default number of partitions Spark will create for RDDs, which determines how many parallel tasks will be generated for processing the data
spark.stop() properly shuts down all cluster connections and releases resources

Even in local mode, you experience the same distributed processing patterns you'd use on a real multi-machine cluster. This makes local development an excellent way to learn cluster computing concepts before scaling up.

Setting Up Your Local PySpark Environment

Now that you've seen how PySpark works conceptually, it's time to get it running on your own machine so you can experiment freely and build your own Spark applications.

Setting up PySpark requires two main components:

Java — because Spark runs on the Java Virtual Machine
PySpark — the Python API for Apache Spark

Let's walk through installing both.

Installing Java (The Foundation)

Even though you'll write Python code with SparkSession, Spark's core engine is built in Scala and runs on the Java Virtual Machine (JVM). Without Java, PySpark can't launch or manage Spark jobs.

First, check if Java is already installed on your local machine by opening a terminal or command prompt and running:

java -version

If Java is installed, you'll see output like:

java version "1.8.0_451"

If Java isn't installed, here's how to get it:

Mac and Linux

Use your system's package manager for the easiest installation:

Ubuntu/Debian:

sudo apt update
sudo apt install default-jdk

macOS with Homebrew:

brew install openjdk

If you don't have Homebrew, install it first from brew.sh, then run the command above.

Windows

You have a couple of options:

Option 1: Manual Installation
Download and install the JDK from the official Oracle website. Follow the installer instructions.

Option 2: Using Chocolatey
If you have Chocolatey installed, you can install Java from the command line:

choco install openjdk

Verify Your Installation

After installing Java, restart your terminal and run java -version again to confirm it's working properly. You should see version information displayed.

Installing PySpark

With Java installed, getting PySpark is straightforward. In your terminal, navigate to your project directory and install PySpark using pip:

pip install pyspark

This installs the PySpark package, which includes everything needed to create SparkSession objects and work with distributed data. By default, PySpark will run in local mode, using your machine as a single-node Spark cluster — perfect for development and learning.

Jupyter Notebook Integration

While you can run PySpark from any Python environment, Jupyter Notebook provides the best experience for interactive development and learning. The combination of live code, visualizations, and documentation makes it ideal for exploring data with PySpark.

Installing Jupyter via Anaconda

The most reliable way to get Jupyter working smoothly with PySpark is through the Anaconda distribution. Anaconda includes Jupyter along with many scientific computing libraries, and it handles environment management automatically.

Download Anaconda from the official website and follow the installation instructions for your operating system.

Once installed, launch Jupyter by running this command in your terminal or Anaconda Prompt:

jupyter notebook

This opens a browser window where you can create new notebooks and start coding with Python.

Essential Configuration for PySpark

Here's an important step for ensuring smooth PySpark operation in Jupyter: you need to make sure that Spark's worker processes use the same Python interpreter as your Jupyter kernel. This prevents compatibility issues between different Python environments.

Without this configuration, you might encounter errors like "Python worker failed to connect back" or timeout issues when Spark tries to distribute tasks.

Add this code at the top of any notebook where you'll use PySpark:

import os
import sys

# Ensures PySpark uses the same Python interpreter as Jupyter
os.environ["PYSPARK_PYTHON"] = sys.executable

Here's what this does:

import os and import sys give you access to system environment variables and Python interpreter information
sys.executable contains the path to the Python interpreter running your Jupyter kernel
os.environ["PYSPARK_PYTHON"] sets an environment variable that tells PySpark which Python interpreter to use for its worker processes

This environment variable tells PySpark exactly which Python interpreter to use for its worker processes, avoiding compatibility issues between the driver (your Jupyter kernel) and the executors (worker processes).

Pro tip: After adding this code, restart your kernel and run all cells to ensure the environment variable is properly set before PySpark starts up.

Test Your Setup

Let's verify that SparkSession is working correctly and can access RDD functionality. In a new Jupyter notebook, run:

import os
import sys

# Set the Python executable for PySpark
os.environ["PYSPARK_PYTHON"] = sys.executable

# Test PySpark import and version
import pyspark
print("PySpark version:", pyspark.__version__)

# Test SparkSession creation (same pattern as architecture sections)
from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("SetupTest") \
    .master("local[*]") \
    .config("spark.driver.memory", "2g") \
    .getOrCreate()

print("SparkSession created successfully")
print("Application name:", spark.sparkContext.appName)
print("Running in mode:", spark.sparkContext.master)
print("Default partitions:", spark.sparkContext.defaultParallelism)

# Test with daily show data (same as lazy evaluation example)
daily_show_rdd = spark.sparkContext.textFile("daily_show.tsv")
print("RDD created:", daily_show_rdd)

# Use the same action from the architecture section
sample_data = daily_show_rdd.take(3)
print("Sample data retrieved:")
for line in sample_data:
    print(line)

# Clean shutdown
spark.stop()
print("SparkSession stopped successfully")

If your setup is working correctly, you should see output similar to this:

PySpark version: 3.5.0
SparkSession created successfully
Application name: SetupTest
Running in mode: local[*]
Default partitions: 8
RDD created: daily_show.tsv MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0
Sample data retrieved:
YEAR    GoogleKnowlege_Occupation   Show    Group   Raw_Guest_List
1999    actor   1/11/99 Acting  Michael J. Fox
1999    Comedian    1/12/99 Comedy  Sandra Bernhard
SparkSession stopped successfully

What each part tells you:

PySpark version confirms the installation worked
SparkSession created successfully means Java and PySpark are communicating properly
Running in mode: local[*] shows Spark is using all your CPU cores for local development
RDD created with a reference like MapPartitionsRDD[1] indicates lazy evaluation is working but no data has been read yet
Sample data retrieved proves Spark can actually read files and execute actions
SparkSession stopped successfully confirms proper resource cleanup

Common issues and solutions:

"Java not found" errors: Make sure you've installed Java and restarted your terminal
"Python worker failed to connect": Double-check that you've set the PYSPARK_PYTHON environment variable
File not found errors: Ensure daily_show.tsv is in the same directory as your notebook
Permission errors on Windows: Try running Jupyter as administrator
Memory warnings: These are usually safe to ignore in local mode, but you can reduce spark.driver.memory if needed

If you see this output without errors, your PySpark environment is ready for data processing!

What You've Accomplished

Congratulations! You've successfully:

Installed Java and PySpark with a modern development environment
Configured Jupyter Notebook to work seamlessly with SparkSession
Created your first SparkSession with proper configuration and resource management
Loaded and explored real data using RDDs accessed through SparkSession
Observed distributed processing in action through Driver coordination and lazy evaluation
Executed both transformations and actions to process distributed data
Properly managed resources with clean SparkSession shutdown

You now have a fully functional modern PySpark development environment and understand how SparkSession coordinates Spark's distributed architecture. This foundation will serve you well as you dive deeper into data processing, transformations, and analysis.

The approach you've learned here—using SparkSession as your entry point while understanding the underlying RDD operations—represents current best practices in the PySpark ecosystem.

Review and Next Steps

In this tutorial, you've built a solid foundation in modern PySpark development. Let's recap what you've learned and where to go from here.

Key Takeaways

Apache Spark and PySpark Fundamentals:

Apache Spark is a distributed computing engine that processes large datasets across clusters of machines using in-memory computation for dramatic speed improvements over disk-based systems
PySpark provides a Python interface to Spark, letting you use familiar Python syntax for big data processing
SparkSession is the modern unified entry point introduced in Spark 2.0 that consolidates all Spark functionality into a single, easy-to-use interface

Modern Spark Architecture:

SparkSession serves as your command center, automatically managing the Driver process that coordinates your entire Spark application
The Driver builds execution plans, distributes tasks, and coordinates results across your distributed cluster
Executors are the distributed workers that process data in parallel, managed seamlessly by SparkSession
Lazy evaluation means SparkSession builds execution plans but waits until actions are called to actually process data, enabling powerful optimizations

Development Environment:

Java is required because Spark runs on the JVM, even when using Python through PySpark
SparkSession setup is straightforward and provides immediate access to both high-level DataFrame operations and low-level RDD functionality
Jupyter integration works smoothly with proper Python interpreter configuration
Local mode gives you full distributed processing simulation on your development machine

What's Next

You've built a strong foundation, but this is just the beginning of your PySpark journey. Your next learning steps include:

Advanced RDD Operations: Now that you understand RDD basics through SparkSession, you can explore powerful transformations like map(), filter(), reduceByKey(), and join() that make distributed data processing so effective.

DataFrames and Structured Data: SparkSession really shines when working with DataFrames—structured data that offers SQL-like operations, automatic optimizations, and better performance for most common data tasks.

Spark SQL: Learn how to query your data using familiar SQL syntax through SparkSession, making complex analytics more accessible and readable.

The modern SparkSession-based approach you've learned here will serve you well in all these areas. You now understand both the high-level interface and the underlying distributed processing concepts, giving you the flexibility to tackle a lot of big data challenges.

Keep Learning

Ready to continue your PySpark journey? Here are valuable resources:

Apache Spark Documentation: The official documentation with comprehensive guides and API references
PySpark API Reference: Detailed documentation for all PySpark functions and methods

You've taken the first step into this exciting field using modern best practices. Keep experimenting, keep building, and most importantly, keep learning. See you in the next tutorial that focuses on Working with RDDs in PySpark! Or skip ahead if you're eager to work with Spark DataFrames,

Introduction to Snowflake

Anna Strahl — Tue, 17 Jun 2025 18:44:24 +0000

Snowflake is one of the most in-demand tools for modern data engineers.

Many companies use it as part of their analytics pipelines and cloud data platforms, and it's frequently listed as a required or preferred skill in data engineering roles.

But what exactly is Snowflake? How does it work? And how do you actually use it day to day?

In this tutorial, you’ll get hands-on experience with Snowflake. You’ll learn how to navigate the interface, load data, run queries, and monitor credit usage, giving you the confidence to start using Snowflake in real-world projects.

So what is Snowflake?

Snowflake is a cloud-based data warehouse platform and scalable SQL interface. It handles all the infrastructure for you — including scaling up (or down) when needed, storing your data, and processing your queries.

You don’t need to worry about configuring servers or managing performance. You just load your data and start querying.

Who uses Snowflake, and why?

Companies like Capital One, HelloFresh, Penske, and Nissan use Snowflake to power their analytics and data operations. It’s especially popular with data engineering and analytics teams because it solves common problems: long query times, scattered or siloed data, and complicated infrastructure that’s difficult to scale.

With Snowflake, teams can combine data from various business departments, such as sales, finance, marketing, and support, and analyze it all in one place. They can also share live datasets between departments or with external partners, without needing to duplicate or export any data. And because Snowflake scales compute automatically, it can handle everything from routine dashboard queries to large-scale reporting during peak business hours.

Analysts use it to explore and visualize data in tools like Power BI or Tableau. Engineers use it to build pipelines that clean and transform raw data. And data scientists can run machine learning workflows directly against Snowflake tables, without needing to move data into separate systems.

Whether you’re working with structured tables, semi-structured logs (like JSON), or large flat files in cloud storage, Snowflake provides a single place to store, query, and share data efficiently.

What makes Snowflake different?

One reason Snowflake is so popular is how it handles performance at scale. In traditional systems like PostgreSQL, your queries run on the same machine that stores your data. In Snowflake, these two components are separated:

Storage: where your data lives
Compute: the resources that process your queries

This separation means different teams can run queries at the same time without slowing each other down. It also means you can scale your compute power up or down depending on your needs, and pause it entirely when you’re not using it to save costs.

Getting started with Snowflake

You can sign up for a free trial of Snowflake through their website.

You’ll be prompted to select a Snowflake edition. The “Standard” edition is ideal for learning because it gives you access to the core features at the lowest credit cost.

You’ll also be asked to choose a cloud provider (like AWS, GCP, or Azure). This doesn’t require any cloud account or prior knowledge — Snowflake runs on top of these platforms, and this step simply determines where your data and compute will be hosted behind the scenes.

For learning purposes, any provider is fine. You can safely go with AWS or pick whichever one you’re most familiar with as well as the default “Region”. You won’t need to interact with the cloud provider at all.

Snowflake’s free trial includes a set of credits you can use to explore the platform and run queries over a 30-day period. No payment method is required.

📝 Note: If you just signed up for a Snowflake account, you may see a pop-up offering to help you load data. You can close it for now because we’ll walk through that process step by step. We’re skipping the Quickstart in this tutorial so you can get hands-on with each part of the process. That way, you’ll understand not just what to do, but how Snowflake works behind the scenes.

Once you’ve completed signup and logged in and dismissed any quickstart pop-ups you see, you’ll land in Snowsight, Snowflake’s web-based interface. This is where you’ll spend most of your time.

Start by locating the navigation panel on the left. You should see tabs for Projects, Data, Monitoring, Admin, and more. You don't need to explore everything immediately, but these are the main tools you'll use.

Let’s start with the component that does the heavy lifting: the virtual warehouse.

Understanding virtual warehouses

In Snowflake, your data is stored in centralized cloud storage. It stays there, regardless of how or when you query it.

To access and process that data, Snowflake uses something called a virtual warehouse. A warehouse isn’t a physical building, but instead is a temporary engine made up of compute resources that powers your SQL queries.

Every time you run a query, Snowflake uses a warehouse to process it. Because your data is stored separately, you can create multiple warehouses that all access the same data for different teams, projects, or use cases. Each warehouse runs independently, so teams can work in parallel without slowing each other down.

This separation of compute (warehouses) and storage (your data) is what makes Snowflake flexible, scalable, and cost-effective.

Creating a warehouse

You can create a warehouse from the Admin > Warehouses tab.

Click + Warehouse in the upper right-hand corner.
Give it a name like LEARN_WH.
Choose a size. For practice, the smallest size (X-Small) is fine.

Snowflake charges compute credits per second while a warehouse is active. The good news? You can pause your warehouse when you’re not using it, and you should, to avoid unnecessary credit usage.

Once your warehouse is created, you can start or stop it at any time.

By default, Snowflake will also automatically suspend your warehouse after a few minutes of inactivity. You can check this setting in the Advanced Options when creating or editing your warehouse.

This auto-suspend feature helps you avoid accidentally leaving a warehouse running, which is especially useful when you’re experimenting or learning. We recommend keeping this option enabled.

Creating a database

Before you can load or query data in Snowflake, you'll need a database to hold your work.

In Snowflake, a database is essentially a container, similar to a folder, that stores your tables and other objects. It helps keep things organized and separate from other projects or datasets.

To create your first database:

Go to the Data > Databases tab in the left-hand sidebar.
Click + Database.
Name your database something related to the dataset. For this tutorial, we’ll use the name BOOKSTORE.
Click Create.

That’s it — you've created a database. For now, we’ll keep things simple and work in this database.

Creating a table and loading data

Now that you have a database, let’s create a simple table so you can start writing SQL queries.

There are two ways to create a table in Snowflake: you can write a CREATE TABLE statement manually, or you can use the interface to upload a CSV file and have Snowflake create the table for you automatically.

In this tutorial, we’ll use the CSV upload method. It’s the fastest way to get started, and it helps you learn how to work with external data.

First, download the sample CSV file here.

We’ll use this dataset of online bookstore sales to practice loading and querying data in Snowflake.

Go to Data > Add Data and choose Load data into a Table.

Before uploading, take a quick look at the top-right corner of the interface. You should see a dropdown for the Warehouse.

Even though you created a warehouse earlier (e.g., LEARN_WH), Snowflake may still default to one called COMPUTE_WH. That’s a pre-created warehouse that comes with new accounts.

For this tutorial, switch the dropdown to LEARN_WH (or whatever name you gave your warehouse) so you’re using the compute resources you set up yourself. It’s not strictly required for this step, but it’s a good habit to build.

Now you’re ready to load the file:

Drag and drop bookstore.csv into the Load Data into Table pane.
Select BOOKSTORE as your database. The schema should automatically select PUBLIC.
Name your table something like books.

Once you click Next, you’ll see options to help you successfully load the CSV contents into Snowflake.

In the File format panel under “View Options”, specify Header > First line contains header.
Click Load.

After loading the data, you may see a pop-up with the option to Query Data. Skip this for now and click Done.

Creating and using a worksheet

With your data loaded, you’re ready to start exploring it with SQL queries.

The place where you do this in Snowflake is called a worksheet. A worksheet is like a scratchpad where you can write, run, and save your SQL code.

To create a new worksheet:

Go to the Projects >Worksheets tab in the left-hand sidebar.
Click + Worksheet to create a new SQL worksheet.

By default, your new worksheet will have a generic name like Worksheet 1 or the current date/time. It’s a good habit to rename your worksheets so you can easily find them later.

To rename it:

Double-click the worksheet name at the top.
Enter a new name, such as Bookstore Queries.

Before you start writing SQL, double-check your context selectors in the toolbar:

Warehouse: the compute resource running your queries (e.g., LEARN_WH)
Database: where your data is stored (e.g., BOOKSTORE)
Schema: a sub-container inside your database (most likely PUBLIC)

These tell Snowflake where to look for your data and which compute resources to use. You can change them using the dropdown menus at the top of the worksheet. If your context selectors aren’t set correctly, your queries may return an error or no results, so it’s a good habit to always double-check them before running queries.

Writing and running queries

Now that you have a worksheet open, it’s time to use it to explore your data.

Snowflake uses a SQL dialect that’s mostly ANSI-compliant, which means the core syntax is very similar to what you’d find in other databases. If you’ve used SQL in PostgreSQL, MySQL, or SQLite, most of your queries will work the same way here.

Later on, you may notice some Snowflake-specific functions and behaviors, but for now, everything you need follows standard SQL.

To run a query:

Click anywhere inside the query you want to run.
Either click the Run button (the ▶️ icon), or use the keyboard shortcut:
- Mac: Cmd + Enter
- Windows/Linux: Ctrl + Enter

Only the currently selected query will run, even if you have multiple queries in the worksheet. If you want to run several queries at once, you can highlight them all and click Run.

When you run a query, Snowflake displays the results in a panel at the bottom of the worksheet.

This panel always shows the results of your most recently run query. If you run another query, the panel will update to show the new results.

You won’t see the output of older queries in the worksheet unless you run them again.

In the next section, you’ll learn how to view your full query history, including past results, using the History tab.

Example queries

Let’s practice with a few simple queries on your BOOKS table.

-- preview the books table
SELECT *
FROM BOOKS
LIMIT 10;

-- average price by rating
SELECT RATING, AVG(PRICE)
FROM BOOKS
GROUP BY RATING
ORDER BY RATING DESC;

These examples should give you a good start. Try writing a few more queries of your own — you can keep adding them to your worksheet. For example:

Find books with a price over 50
Count the number of books per author
Sort books by title

You don’t need to delete older queries; just click inside the one you want to run.

Modifying data

In addition to reading data, you can also use SQL to modify it. For example, to update records or create new tables. You can use statements like:

CREATE OR REPLACE TABLE
UPDATE
DELETE
INSERT INTO

Just remember: modifying data consumes compute resources (credits), just like querying. Use small datasets while learning to keep costs low.

Viewing your query history

By now, you've run several queries in your worksheet. You may have noticed that Snowflake only shows the results of your last run query in the results panel.

But what if you want to see queries you ran earlier? Or check how long they took to run?

That’s where the Query History tab comes in.

You can find it in the left-hand sidebar: just click Monitoring > Query History.

This tab shows a list of all the queries you’ve run recently across all worksheets in your current Snowflake account.

For each query, you can see:

The SQL text you ran
The time it ran and how long it took
The warehouse that processed it
The status (successful, failed, etc.)

You can click on any query in the History tab to view the exact SQL you ran, along with a preview of its results.

Why is this helpful?

If you accidentally close a worksheet or lose your place, you can still find your past queries here.
If you want to copy an earlier query and tweak it, this is the fastest way to find it.

The History tab only shows your own queries (not those of other users), and it stores a limited history. For long-term saving, it’s still a good idea to name and save important worksheets.

Understanding Snowflake credits and costs

Now that you’ve seen how to view your query history, let’s take a closer look at how Snowflake’s pricing works and how to manage your credit usage as you run queries.

Snowflake’s pricing model is based on usage. You pay separately for:

Storage: priced per terabyte per month. In most learning cases, this cost is negligible.
Compute: based on how long your warehouse is active. A running warehouse consumes credits by the second.

When you run queries, you use compute, which is what triggers credit consumption.

You can’t see credit usage for individual queries in the History tab, but you can view overall credit consumption in the Admin tab.

To check this:

Go to Admin → Cost Management.
The Cost Management view displays the number of credits your warehouses have consumed over time, broken down by day and warehouse.

Monitor this in your trial account to make the most of your free credits. It’s a good habit to check this tab occasionally so you understand where your credits are going.

A few tips for keeping costs low while learning:

Pause your warehouse when it's not in use. You only pay for compute while your warehouse is running.
Use a small warehouse size (such as X-Small) unless you require more power.
Avoid rerunning queries unnecessarily. Snowflake charges for each execution, even if the data hasn’t changed.

Learning to monitor your credit usage now will help you use Snowflake more efficiently in real projects.

Next steps

You’ve accomplished a lot already. You've:

Uploaded data from a file
Created and configured a warehouse
Written and run SQL queries in Snowflake
Explored Snowflake’s pricing and usage model

With these skills, you now have the foundation to use Snowflake in real-world data workflows — from loading and querying data to monitoring compute usage. You also understand what makes Snowflake different from traditional databases and why it’s a valuable tool for modern data engineering.

Now that you’re set up, the best way to keep learning is to keep using it. Try loading larger datasets. Experiment with different warehouse sizes. Monitor your credit usage as you go.

When you're ready, you can explore more of what Snowflake offers and start applying it to real projects:

Organize your data with schemas to separate different projects, teams, or data types within the same account.
Connect multiple data sources by uploading new files, using cloud storage integrations (like AWS S3), or querying data from external stages.
Query semi-structured data like JSON, XML, or Parquet — directly, using SQL — to explore logs, API exports, or cloud-based event data.
Explore public datasets through the Snowflake Marketplace, and join them with your own data for richer insights.
Build a simple reporting pipeline: clean your data using SQL, summarize it into new tables, and visualize it in a BI tool like Power BI or Tableau.
Manage access with roles to control who can query, modify, or administer specific databases or warehouses.

Each of these is a step toward building the kind of practical, portfolio-ready experience that employers look for.

And when that job description says "Snowflake experience preferred," you'll know what they mean and be ready to show it.

Project Tutorial: Predicting Insurance Costs with Linear Regression

Anna Strahl — Fri, 13 Jun 2025 15:54:24 +0000

In this project walkthrough, we'll explore how to build a linear regression model to predict patient medical insurance costs. By analyzing demographic and health data, we'll develop a predictive model that could help hospitals and insurance companies forecast costs and allocate resources more effectively.

Predicting medical costs is a challenge in healthcare administration. Accurate predictions enable better budget planning, resource allocation, and pricing strategies. Through this hands-on project, we'll work through the entire machine learning pipeline—from exploratory data analysis to model evaluation and interpretation.

What makes this project particularly interesting is that we'll discover some surprising patterns in the data that challenge our initial assumptions about linear regression, leading us to explore creative solutions for improving model performance.

What You'll Learn

By the end of this tutorial, you'll know how to:

Perform exploratory data analysis to understand relationships in healthcare data
Transform skewed data using logarithmic transformations
Build and evaluate linear regression models using scikit-learn
Diagnose model issues using residual analysis
Interpret model coefficients to derive business insights
Identify when linear regression might not be the best choice
Apply domain knowledge to improve model performance

Before You Start

To make the most of this project walkthrough, follow these preparatory steps:

Review the Project

Access the project and familiarize yourself with the goals and structure: Predicting Insurance Costs Project.
Prepare Your Environment
- If you're using the Dataquest platform, everything is already set up for you
- If working locally, ensure you have Python and the required libraries installed:
  - pandas, numpy, matplotlib, seaborn, and sklearn
- Download the insurance.csv dataset from the project or from Kaggle
Prerequisites
- Comfortable with Python basics (loops, functions, data structures)
- Familiar with pandas DataFrames and basic data manipulation
- Understanding of basic statistics for Python (mean, median, correlation)
- Some exposure to machine learning concepts is helpful, but not required

New to Markdown? We recommend learning the basics to format headers and add context to your Jupyter notebook: Markdown Guide.

Setting Up Your Environment

Let's begin by importing the necessary libraries and loading our dataset:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load the insurance dataset
insurance = pd.read_csv('insurance.csv')
insurance.head()

   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4   32    male  28.880         0     no  northwest   3866.85520

Our dataset contains information about individual medical insurance bills with the following features:

age: Age of the primary beneficiary
sex: Gender of the beneficiary (male/female)
bmi: Body Mass Index, calculated as weight(kg) / height(m)²
children: Number of children/dependents covered by the insurance
smoker: Whether the beneficiary smokes (yes/no)
region: The beneficiary's residential area in the US (northeast, southeast, southwest, northwest)
charges: Individual medical costs billed by health insurance (our target variable)

Exploratory Data Analysis (EDA)

Before building any model, we need to understand our data thoroughly. Let's start by examining the structure and checking for missing values:

insurance.info()


RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   age       1338 non-null   int64
 1   sex       1338 non-null   object
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64
 4   smoker    1338 non-null   object
 5   region    1338 non-null   object
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB

Great! We have 1,338 records with no missing values—a clean dataset that won't require imputation. Notice that sex, smoker, and region are object types (categorical variables) that we'll need to convert to numerical data before our model can use them.

Learning Insight: Coming from a teaching background, I've learned that clean data is rare in the real world. This dataset being so clean might indicate it's synthetic data, which we should keep in mind as we analyze patterns. Real healthcare data often has missing values, inconsistencies, and outliers that require careful handling.

Let's examine the summary statistics for our numeric variables:

insurance.describe()

              age          bmi     children       charges
count  1338.000000  1338.000000  1338.000000   1338.000000
mean     39.207025    30.663397     1.094918  13270.422265
std      14.049960     6.098187     1.205493  12110.011237
min      18.000000    15.960000     0.000000   1121.873900
25%      27.000000    26.296250     0.000000   4740.287150
50%      39.000000    30.400000     1.000000   9382.033000
75%      51.000000    34.693750     2.000000  16639.912515
max      64.000000    53.130000     5.000000  63770.428010

A key insight from the summary statistics is the difference between mean and median charges. The mean (\$13,270) is significantly higher than the median (\$9,382), suggesting right skew in our target variable.

Learning Insight: For linear regression, this skew can lead to poor predictions because the algorithm assumes errors are normally distributed.

Let's visualize the distribution of charges:

insurance.hist('charges')
plt.show()

The histogram confirms our suspicion; the charges are heavily right-skewed with a long tail of expensive medical bills. This makes it unlikely that the errors in the model will truly be centered at zero. It might be worth it to log-transform the outcome.

insurance["log_charges"] = np.log2(insurance["charges"])

insurance.hist("log_charges")
plt.show()

The log-transformed charges show a much more normal distribution, which should help our linear regression model perform better.

Learning Insight: I chose base-2 logarithm here, but you could use natural log (np.log) with similar results. The key is consistency—if you transform with log2, remember to inverse-transform with 2^x when interpreting results. It's like choosing between Celsius and Fahrenheit—both work, but mixing them causes problems!

Feature Selection

Now let's examine correlations between our variables to identify potential predictors:

correlations = insurance.corr()
correlations

                age       bmi  children   charges  log_charges
age        1.000000  0.109272  0.042469  0.299008     0.527834
bmi        0.109272  1.000000  0.012759  0.198341     0.132669
children   0.042469  0.012759  1.000000  0.067998     0.161336
charges    0.299008  0.198341  0.067998  1.000000     0.892964
log_charges 0.527834  0.132669  0.161336  0.892964     1.000000

sns.heatmap(correlations, cmap='Blues', annot=True)
plt.show()

Key observations from the correlation matrix:

Age has a 52.8% correlation with log_charges (much stronger than the 29.9% with raw charges)
BMI shows weaker correlation (13.3% with log_charges)
Children has minimal correlation (16.1% with log_charges)

Let's visualize relationships using a pair plot:

insurance_numeric = insurance[['age', 'bmi', 'children', 'charges', 'log_charges']]
sns.pairplot(insurance_numeric, kind='scatter', plot_kws={'alpha': 0.4})

Learning Insight: The pair plot revealed something that made me say "whoa!"—look at the age vs charges relationship. Those aren't random scatter points; there are three distinct bands or clusters. This was my first clue that simple linear regression might struggle with this data. In my experience as a data analyst, when you see patterns like this, it often means there's a hidden categorical variable creating different groups in your data.

Now let's examine our categorical variables using box plots:

insurance.boxplot(column=["log_charges"], by="sex")
plt.show()

insurance.boxplot(column=["log_charges"], by="smoker")
plt.show()

insurance.boxplot(column=["log_charges"], by="region")
plt.show()

The box plots reveal:

Sex: Males seem to have a wider distribution of charges compared to women
Smoker: Smokers have much higher costs than non-smokers
Region: There doesn't seem to be many appreciable differences between regions

Learning Insight: The smoker variable shows such a dramatic difference that it made me wonder if we're really looking at two different populations. From experience, I’ve learned that when one variable dominates like this, it’s often a good idea to split the data and train two separate models since each subset could have differing relationships.

Based on the univariate relationships shown above, age, bmi, and smoker are positively associated with higher charges. We'll include these predictors in our final model.

Dividing the Data

Let's convert our categorical smoker variable to numeric and split our data:


python
# Splitting the data up into a training and test set
insurance["is_smoker"] = (insurance["smoker"] == "yes")
# insurance["is_male"] = (insurance['sex'] == 'male')# insurance['is_west'] = ((insurance['region'] == 'northwest') | (insurance['region'] == 'southwest'))

X = insurance[["age", "bmi", "is_smoker"]]
y = insurance["log_charges"]

# 75% for training set, 25% for test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,
                                                    random_state=42)

Learning Insight: Notice the commented-out code in the solution notebook for is_male and is_west. This shows good exploratory thinking—the analyst considered these features but ultimately decided they weren't necessary based on the boxplot analysis. Sometimes what you don't include is as important as what you do!

Building the Model

Now we'll create and train our linear regression model:

insurance_model = LinearRegression()
insurance_model.fit(X_train, y_train)
insurance_model.coef_

array([0.0508618 , 0.01563733, 2.23214787])

The coefficients tell us:

Each year of age increases log_charges by 0.051
Each unit of BMI increases log_charges by 0.016
Being a smoker increases log_charges by 2.232 (a massive effect!)

Let's evaluate our model's performance:

y_pred = insurance_model.predict(X_train)
train_mse = mean_squared_error(y_train, y_pred)
train_mse

0.44791919632992105

# MSE on the original scale for the insurance charges
train_mse_orig_scale = np.exp(mean_squared_error(y_train, y_pred))
train_mse_orig_scale

1.565052228580154

train_r2 = r2_score(y_train, y_pred)
train_r2

0.7433336007728248

The training MSE for the model is 0.448 and is 1.57 on the original scale. The R² indicates that the model can explain 74% of the variation in the log-insurance charges. These preliminary results are promising, but we must remember that these are optimistic values.

Residual Diagnostics

Let’s examine the actual versus predicted values as well as the residuals to check our model assumptions:

# Create a DataFrame with predictions and actual values for easier plotting
plot_df = pd.DataFrame({
    'predictions': y_pred,
    'actual': y_train,
    'is_smoker': X_train['is_smoker'],
    'age': X_train['age'],
    'bmi': X_train['bmi'],
    'residuals': y_train - y_pred,
})

# Create scatter plot
plt.figure(figsize=(10, 6))
sns.scatterplot(x='predictions', y='actual',
                data=plot_df, alpha=0.7)

plt.xlabel('Predicted log_charges')
plt.ylabel('Actual log_charges')

plt.show()

This plot reveals a concerning pattern—instead of points clustering around a straight diagonal line, we see a curved relationship. This suggests our linear model isn't capturing the true relationship in the data.

Learning Insight: This was my "aha!" moment. Despite the decent R² score, the curved pattern tells us our model is systematically over-predicting for some values and under-predicting for others. It's like using a straight ruler to measure a curved road—you'll get a measurement, but it won't be accurate.

Let's examine the residuals more closely:

sns.scatterplot(x='predictions', y='residuals',
                data=plot_df, alpha=0.7)

The residuals suggest some violations to the assumptions of linear regression. As predicted values get larger, the residuals trend downward. We expect an even band, centered around zero. This does not necessarily make the model predictions unusable, but it puts into question the linear regression assumptions.

Learning Insight: In a well-specified linear regression model, this plot should look like a random cloud of points around zero—what we call "white noise." Instead, we see a clear downward trend. This violates the assumption of homoscedasticity (constant variance of errors) and suggests our model specification needs improvement.

Interpreting the Model

Let's interpret our current model coefficients:

cdf = pd.DataFrame(insurance_model.coef_, X.columns, columns=['Coef'])
print(cdf)

               Coef
age        0.050862
bmi        0.015637
is_smoker  2.232148

insurance_model.intercept_

10.199942936238687

The linear regression equation is:

log_charges = 10.200 + 0.051×age + 0.016×bmi + 2.232×is_smoker

Learning Insight: The smoker coefficient (2.232) dominates the others. Since we're in log space, this means being a smoker multiplies your expected charges by 2^2.232 ≈ 4.7 times! This may explain why we see such distinct groups in our predictions.

Final Model Evaluation

Let's evaluate our model on the test set:

test_pred = insurance_model.predict(X_test)

mean_squared_error(y_test, test_pred)

0.4529281560931769

# Putting the outcome (in log-terms) back into the original scale
np.exp2(mean_squared_error(y_test, test_pred))

1.368815646563475

While the MSE seems low, remember that our residual analysis revealed serious model violations.

Segmenting by Smoker Status

Given the distinct patterns for smokers, let's build a separate model just for this group:

smokers_df = insurance[insurance["is_smoker"] == True]
X = smokers_df[["age", "bmi"]]
y = smokers_df["log_charges"]

# 75% for training set, 25% for test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,
                                                    random_state=42)

smoker_model = LinearRegression()
smoker_model.fit(X_train, y_train)
smoker_model.coef_

array([0.01282851, 0.07098738])

Notice how the coefficients changed! For smokers:

Age effect dropped from 0.051 to 0.013
BMI effect increased from 0.016 to 0.071

This confirms that the relationships between predictors and costs are fundamentally different for smokers.

y_pred = smoker_model.predict(X_train)
train_mse = mean_squared_error(y_train, y_pred)
train_mse

0.07046354357369704

train_r2 = r2_score(y_train, y_pred)
train_r2

0.7661650418251628

Learning Insight: The improvement is dramatic—MSE dropped from 0.448 to 0.070, an 84% reduction! This confirms our hypothesis that smokers and non-smokers follow different cost patterns. In my data analysis work, I've learned that sometimes the best model isn't the most complex one, but rather separate simple models for different groups.

Let's check if the actual versus predicted values and the residual patterns improved:

# Create a DataFrame with predictions and actual values for easier plotting
plot_df = pd.DataFrame({
    'predictions': y_pred,
    'actual': y_train,
    'age': X_train['age'],
    'bmi': X_train['bmi'],
    'residuals': y_train - y_pred,
})

# Create scatter plot
plt.figure(figsize=(10, 6))
sns.scatterplot(x='predictions', y='actual',
                data=plot_df, alpha=0.7)

plt.xlabel('Predicted log_charges')
plt.ylabel('Actual log_charges')

plt.show()

sns.scatterplot(x='predictions', y='residuals',
                data=plot_df, alpha=0.7)

While not perfect, the residual pattern is much more random than our original model—a significant improvement!

test_pred = smoker_model.predict(X_test)

mean_squared_error(y_test, test_pred)

0.09416078156173782

Key Insights and Takeaways

Through this analysis, we've learned several critical lessons:

Data Transformation Matters: Log-transforming our skewed target variable was essential for meeting linear regression assumptions. Without this step, our predictions would have been dominated by outliers.
Always Check Residuals: Despite good R² values, residual analysis revealed serious model violations that metrics alone didn't capture. This is why visual diagnostics are so important.
Domain Knowledge Helps: Recognizing that smoking status fundamentally changes the relationship between predictors and costs led to a better modeling approach.
One Size Doesn't Fit All: Building separate models for distinct groups (smokers vs non-smokers) dramatically improved performance. Sometimes, multiple simple models outperform a single complex model.

This project reinforced an important lesson: just like students learn differently, data behaves differently in different contexts. The key is recognizing when to adapt your approach rather than forcing a one-size-fits-all solution.

Next Steps

To further improve this analysis, consider these challenges:

Clustering Analysis: Use k-means or Gaussian Mixture Models to identify the distinct bands visible in the age vs charges plot
Non-Smoker Model: Build a separate model for non-smokers and compare performance
Cross-Validation: Implement k-fold cross-validation to get more robust performance estimates
Alternative Models: Try Random Forests or XGBoost, which can naturally handle non-linear patterns and interactions
Feature Engineering: Create age groups or BMI categories based on medical standards

We have some other project walkthrough tutorials you may also enjoy:

Sharing Your Work

When you complete this project, consider sharing it on GitHub as a Jupyter notebook. Include:

Clear markdown explanations of each step
Visualizations that tell the story
Your interpretation of the results
Ideas for future improvements

This project demonstrates that real-world data often violates the assumptions of simple models. Success in machine learning requires not just applying algorithms, but understanding their assumptions, diagnosing problems, and creatively adapting your approach based on what the data tells you.

If you're new to Python and found this project challenging, start with our Python Basics for Data Analysis skill path. For those comfortable with Python but new to machine learning, our Machine Learning Fundamentals course covers the essential concepts used in this analysis.

Remember, the journey from good model metrics to actual insights requires curiosity, persistence, and a willingness to challenge your assumptions.

Happy modeling!

Setting Up Apache Airflow with Docker Locally (Part I)

Brayan Opiyo — Mon, 02 Jun 2025 18:09:17 +0000

Let’s imagine this: you’re a data engineer working for a company that relies heavily on data. Your everyday job? Extract data, transform it, and load it somewhere, maybe a database, maybe a dashboard, maybe a cloud storage system.

At first, you probably set all this up on your local machine. You write a script that scrapes data, another one that cleans and transforms it, and yet another that uploads it to a destination like Amazon S3. Sounds manageable, right?

But soon, things start piling up:

The data source keeps changing → you need to update your script regularly.
The machine shuts down → you have to restart tasks manually.
You forget to run it → quickly, your data is out of date.
A minor bug crashes your transformation step → your whole pipeline fails.

Now, you’re stuck in a never-ending ETL loop. Every failure, every delay, every update falls back on you. It’s exhausting, and it’s not scalable.

But what if the cloud could run this entire pipeline for you—automatically, reliably, and 24/7?

In our previous tutorials, we explored what cloud computing is, the different service models (IaaS, PaaS, SaaS), cloud deployment models, and cloud providers (AWS, Azure, GCP). Now, it’s time to put all that theory into practice.

In this tutorial, we’ll begin building a simple data pipeline using Apache Airflow, with tasks for extracting, transforming, and loading data into Amazon S3. This first part focuses entirely on developing and testing the pipeline locally using Docker Compose.

In the second part, we’ll configure the necessary cloud infrastructure on AWS. This will include an S3 bucket for storage, RDS PostgreSQL for Airflow metadata, IAM roles and security groups for secure access, and an Application Load Balancer to expose the Airflow UI.

The final part of the series walks you through running Airflow in containers on Amazon ECS (Fargate). You’ll learn how to define ECS tasks and services, push your custom Docker image to Amazon ECR, launch background components like the scheduler and triggerer, and deploy a fully functioning Airflow web interface that runs reliably in the cloud.

By the end, you’ll have a production-ready, cloud-hosted Airflow environment that runs your workflows automatically, scales with your workload, and frees you from manual task orchestration.

Why Apache Airflow and Why Use Docker?

Before we jump into building your first ETL project, let’s clarify what Apache Airflow is and why it’s the right tool for the job.

Apache Airflow is an open-source platform for authoring, scheduling, and monitoring data workflows.

Instead of chaining together standalone scripts or relying on fragile cron jobs, you define your workflows using Python as a DAG (Directed Acyclic Graph). This structure clearly describes how tasks are connected, in what order they run, and how they handle retries and failures.

Airflow provides a centralized way to automate, visualize, and manage complex data pipelines. It tracks every task execution, provides detailed logs and statuses, and offers a powerful web UI to interact with your workflows. Whether you're scraping data from the web, transforming files, uploading to cloud storage, or triggering downstream systems. Airflow can coordinate all these tasks in a reliable, scalable, and transparent way.

Prerequisites: What You’ll Need Before You Start

Before we dive into setting up Airflow project, make sure the following tools are installed and working on your system:

Docker Desktop – Required to build and run your Airflow environment locally using containers. Check out our post for an Introduction to Docker if it's completely new to you.
Code editor, e.g., Visual Studio Code – For writing DAGs, editing configuration files, and running terminal commands.
Python 3.8+ – Airflow DAGs and helper scripts are written in Python. Make sure Python is installed and available in your terminal or command prompt.
AWS CLI – We’ll use this later in part two of this tutorial to authenticate, manage AWS services, and deploy resources from the command line. Run aws configure after installing to set it up.

Running Airflow Using Docker

Alright, now that we’ve got our tools ready, let’s get Airflow up and running on your machine.

We'll use Docker Compose, which acts like a conductor for all the Airflow services. It ensures everything (the scheduler, API server, database, DAG processor, triggerer) starts together and can communicate properly.

And don’t worry because this setup is lightweight and perfect for local development and testing. Later on, we’ll move the entire pipeline to the cloud.

What Is Docker?

Docker is a platform that lets you package applications and their dependencies into portable, isolated environments called containers. These containers run consistently on any system, so your Airflow setup will behave the same whether you're on Windows, macOS, or Linux.

Why Are We Using Docker?

Have you ever installed a tool or Python package that worked perfectly… until you tried it on another machine?

That’s exactly why we’re using Docker. It keeps everything—code, dependencies, config—inside isolated containers so your Airflow project works the same no matter where you run it.

Step 1: Let’s Create a Project Folder

First, open VS Code (or your preferred terminal), and set up a clean folder to hold your Airflow files:

mkdir airflow-docker && cd airflow-docker

This folder will eventually hold your DAGs, logs, and plugins as you build out your Airflow project.

Step 2: Get the Official `docker-compose.yaml` File

The Apache Airflow team provides a ready-to-go Docker Compose file. Let’s download it:

curl -LfO 'https://airflow.apache.org/docs/apache-airflow/3.0.1/docker-compose.yaml'

This file describes everything we need to run: the scheduler (which triggers your tasks based on the DAG schedule), the API server (your web UI), a SQLite database, the triggerer (used for deferrable tasks and efficient wait states), and the DAG processor (which parses and monitors your DAG files in the background). You can confirm this by exploring the docker-compose.yaml file generated in your airflow-docker project directory.

Pretty neat, right?

Step 3: Create the Needed Folders

Now, we need to make sure Airflow has the folders it expects. These will be mounted into the Docker containers:

mkdir -p ./dags ./logs ./plugins ./config

dags/ → where you’ll put your pipeline code
logs/ → for task logs
plugins/ → for any custom Airflow plugins
config/ → for extra settings, if needed

Step 4: Set the User ID

If you're on Linux, this step avoids permission issues when Docker writes files to your local system.

Run:

echo -e "AIRFLOW_UID=\$(id -u)" > .env

If you’re on macOS or Windows, you may get a warning that AIRFLOW_UID is not set. To fix this, create a .env file in the same directory as your docker-compose.yaml and add:

AIRFLOW_UID=50000

Step 5: Initialize the Database

Before anything works, Airflow needs to set up its metadata database. This is where it tracks tasks, runs, and logs. Make sure Docker Desktop is launched and running in the background (just open the app, no terminal commands are needed).

Run:

docker compose up airflow-init

After running this command, you’ll see a bunch of logs scroll by. Once it finishes, it’ll say something like Admin user airflow created.

These are your default login credentials:

Username: airflow
Password: airflow

Step 6: Time to Launch!

Let’s start the whole environment:

docker compose up -d

This will start all services: the api-server, scheduler, triggerer, and dag-processor.

Once everything’s up, open your browser and go to:

http://localhost:8080

You should see the Airflow UI. Go ahead and log in. And that’s it! You now have Apache Airflow running locally.

You should also see all your containers running and hopefully marked as healthy.

If something keeps restarting or your local localhost page fails to load, you probably need to allocate more memory to Docker—at least 4 GB, but 8 GB is even better. You can change this in Docker Desktop under Settings > Resources. On Windows, if you don’t see the memory allocation option there, you may need to switch Docker to use Hyper-V instead of WSL.

Before switching, press Windows + R, type optionalfeatures, and ensure both Hyper-V and Virtual Machine Platform are checked. Click OK and restart your computer if prompted.

Then open Docker Desktop, go to Settings → General, uncheck “Use the WSL 2 based engine”, and restart Docker when prompted.

Configuring the Airflow Project

Now that Airflow is up and running, let’s customize it a bit. We'll start with a clean environment and set it up to match our needs.

When you first open the Airflow UI, you’ll notice a bunch of example DAGs. They’re helpful, but we won’t be using them. Let’s clean them out.

Disable Example DAGs and Switch to LocalExecutor

First, shut everything down cleanly:

docker compose down -v

Next, open your docker-compose.yaml and find this line under environment::

AIRFLOW__CORE__LOAD_EXAMPLES: 'true'

Change 'true' to 'false'. This disables the default example DAGs.

Now, we’re not using CeleryExecutor in this project. We’ll keep things simple with LocalExecutor. So change this line:

AIRFLOW__CORE__EXECUTOR: CeleryExecutor

to:

AIRFLOW__CORE__EXECUTOR: LocalExecutor

Remove Celery and Redis Config

Since we have changed our executor from Celery to Local, we will delete all Celery-related components from the setup. LocalExecutor runs tasks in parallel on a single machine without needing a distributed task queue. Celery requires additional services like Redis, workers, and Flower, which add unnecessary complexity and overhead. Removing them results in a simpler, lighter setup that matches our production architecture. Let’s delete all related parts from the docker-compose.yaml:

Any AIRFLOW__CELERY__... lines in environment.
The airflow-worker service (used by Celery).
The optional flower service (Celery dashboard).

Use CTRL + F to search for celery and redis, and remove each related block.

This leaves us with a leaner setup, perfect for local development using LocalExecutor.

Creating Our First DAG

With the cleanup done, let’s now create a real DAG that simulates an end-to-end ETL workflow.

This DAG defines a simple yet creative 3-step pipeline:

Generate mock event data (simulating a daily data scrape)
Transform the data by sorting it based on intensity and saving it to a new CSV
Load the final CSV file into Amazon S3

In the dags/ directory, create a new Python file named our_first_dag.py and paste the DAG code into that file.

This DAG uses PythonOperator for all three tasks and writes intermediate files to a local directory (/opt/airflow/tmp) inside the container. You should not worry about S3 setup in task 3 at this point. The bucket and role permissions will be configured later in the tutorial.

Here’s the code:

from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.python import PythonOperator
import os
import pandas as pd
import random
import boto3

default_args = {
    'owner': 'your-name',
    'retries': 3,
    'retry_delay': timedelta(minutes=1)
}
output_dir = '/opt/airflow/tmp'
raw_file = 'raw_events.csv'
transformed_file = 'transformed_events.csv'
raw_path = os.path.join(output_dir, raw_file)
transformed_path = os.path.join(output_dir, transformed_file)
# Task 1: Generate dynamic event data
def generate_fake_events():
    events = [
        "Solar flare near Mars", "New AI model released", "Fusion milestone","Celestial event tonight", "Economic policy update", "Storm in Nairobi",
        "New particle at CERN", "NASA Moon base plan", "Tremors in Tokyo", "Open-source boom"
    ]
    sample_events = random.sample(events, 5)
    data = {
        "timestamp": [datetime.now().strftime("%Y-%m-%d %H:%M:%S") for _ in sample_events],
        "event": sample_events,
        "intensity_score": [round(random.uniform(1, 10), 2) for _ in sample_events],
        "category": [random.choice(["Science", "Tech", "Weather", "Space", "Finance"]) for _ in sample_events]
    }
    df = pd.DataFrame(data)
    os.makedirs(output_dir, exist_ok=True)
    df.to_csv(raw_path, index=False)
    print(f"[RAW] Saved to {raw_path}")

# Task 2: Transform data and save new CSV
def transform_and_save_csv():
    df = pd.read_csv(raw_path)
    # Sort by intensity descending
    df_sorted = df.sort_values(by="intensity_score", ascending=False)
    # Save transformed CSV
    df_sorted.to_csv(transformed_path, index=False)
    print(f"[TRANSFORMED] Sorted and saved to {transformed_path}")

# Task 3: Upload to S3
def upload_to_s3(**kwargs):
    run_date = kwargs['ds']
    bucket_name = 'your-bucket-name'
    s3_key = f'your-directory-name/events_transformed_{run_date}.csv'
    s3 = boto3.client('s3')
    s3.upload_file(transformed_path, bucket_name, s3_key)
    print(f"Uploaded to s3://{bucket_name}/{s3_key}")

# DAG setup
with DAG(
    dag_id="daily_etl_pipeline_with_transform",
    default_args=default_args,
    description='Simulate a daily ETL flow with transformation and S3 upload',
    start_date=datetime(2025, 5, 24),
    schedule='@daily',
    catchup=False,
) as dag:
    task_generate = PythonOperator(
        task_id='generate_fake_events',
        python_callable=generate_fake_events
    )
    task_transform = PythonOperator(
        task_id='transform_and_save_csv',
        python_callable=transform_and_save_csv
    )
    task_upload = PythonOperator(
        task_id='upload_to_s3',
        python_callable=upload_to_s3,

    )
    # Task flow
    task_generate >> task_transform >> task_upload

Understanding What’s Happening

This DAG simulates a complete ETL process:

It generates mock event data, transforms it by sorting based on intensity, and it should upload the final CSV to an S3 bucket.

The DAG is defined using with DAG(...) as dag:, which wraps all the tasks and metadata related to this workflow. Within this block:

dag_id="daily_etl_pipeline_with_transform" assigns a unique name for Airflow to track this workflow.
start_date=datetime(2025, 5, 24) sets when the DAG should start running.
schedule='@daily' tells Airflow to trigger the DAG once every day.
catchup=False ensures that only the current day’s run is triggered when the DAG is deployed, rather than retroactively running for all past dates.

This line task_generate >> task_transform >> task_upload defines the execution order of tasks, ensuring that data is generated first, then transformed, and finally uploaded to S3 in a sequential flow.

PythonOperator is used to link your custom Python functions (like generating data or uploading to S3) to actual Airflow tasks that the scheduler can execute.

We haven’t configured the S3 bucket yet, so you can temporarily comment out the upload_to_s3 task (and don’t forget to remove >> task_upload from the task sequence). We’ll return to this step after setting up the AWS bucket and permissions in the second part of this tutorial.

Run It and See It in Action

Now restart Airflow:

docker compose up -d

Then open:

http://localhost:8080

You should now see daily_etl_pipeline_with_transform listed in the UI. Turn it on, then trigger it manually from the top-right corner.

Click into each task to see its logs and verify that everything ran as expected.

And just like that, you’ve written and run your first real DAG!

Wrap-Up & What’s Next

You’ve now set up Apache Airflow locally using Docker, configured it for lightweight development, and built your first real DAG to simulate an ETL process—from generating event data, to transforming it, and preparing it for cloud upload. This setup gives you a solid foundation in DAG structure, Airflow components, and local testing practices. It also highlights the limits of local workflows and why cloud-based orchestration is essential for reliability and scalability.

In Part II: Cloud Setup for Airflow of this tutorial series, we’ll move to the cloud. You’ll learn how to configure AWS infrastructure to support your pipeline. This will include setting up an S3 bucket, RDS for metadata, IAM roles, and security groups. You’ll also build a production-ready Docker image and push it to Amazon ECR, preparing for full deployment on Amazon ECS. By the end of Part 2, your pipeline will be ready to run in a secure, scalable, and automated cloud environment.

How to Choose the Right Cloud Service Provider for Your Team

Brayan Opiyo — Fri, 30 May 2025 15:38:39 +0000

Many development teams spend more time than necessary comparing Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP), often getting lost in endless feature lists as if shopping for groceries. This approach misses the point: before getting into any particular vendor details, teams need to clarify the core problem they’re trying to solve.

The reality is that choosing a cloud provider isn't about finding the platform with the most services. It's about strategic alignment between your specific needs and each provider's core strengths. Despite what vendor marketing suggests, the "best" choice varies dramatically based on your team, existing infrastructure, and project goals.

Analysis of deployment patterns across hundreds of organizations—from Netflix's AWS-powered streaming infrastructure to Spotify's GCP-based recommendation engine—reveals clear patterns about when each provider truly shines.

Since most cloud testing involves Python scripting for automation and APIs, you might also want to brush up on Python fundamentals or explore our Data Engineering career path if you're planning to work with data pipelines in the cloud.

The Payoff of Getting it Right

Consider two similar startups launching data analytics platforms:

Company A built their platform on AWS, leveraging EC2 for custom microservices, RDS for transactions, and S3 for product images. Their experienced DevOps team appreciated the granular control and used Reserved Instances to optimize costs. Result: A highly customized platform that scales globally and handles complex business logic efficiently.

Company B chose GCP, using Cloud Run for containerized services and BigQuery for customer analytics. Their smaller team valued the simplified deployment process and built-in machine learning capabilities. Result: Faster time-to-market with powerful data insights, though they occasionally hit limitations requiring custom solutions.

Both companies succeeded, but they optimized for different goals. Company A prioritized flexibility and control to support complex requirements. Company B prioritized speed and developer productivity for rapid iteration.

The key insight: neither approach was "wrong." Each team aligned their cloud strategy with their business priorities, team capabilities, and growth trajectory. The problems arise when teams choose based on marketing promises rather than strategic fit.

The Three-Question Framework That Cuts Through the Noise

Skip the feature comparison spreadsheets. Instead, teams should answer these three strategic questions:

1. What's Your Primary Use Case?

The reality in 2025: all three providers offer comparable core services. AWS, Azure, and GCP can all handle complex architectures, data analytics, and enterprise workloads. The differences are increasingly about ecosystem fit, pricing models, and team preferences rather than fundamental capabilities.

Decision Factor	AWS	Azure	GCP
Existing Ecosystem	Broad service range, mature third-party support	Deep Microsoft integration	Google Workspace synergy
Team Expertise	Largest community, most tutorials	Enterprise IT familiarity	Common in data and ML team
Pricing Approach	Complex but optimizable	Hybrid licensing benefits	Transparent, predictable
Sweet Spot	Teams wanting maximum flexibility	Microsoft-heavy organizations	Data-heavy, cloud-native workloads

The Reality Check: Netflix runs on AWS, but Spotify (similar scale and complexity) thrives on GCP. PayPal uses GCP for fraud detection, while Intuit runs comparable ML workloads on AWS. The choice often comes down to:

What your team already knows: Switching providers means retraining and rebuilding institutional knowledge
Existing tool dependencies: If you're deep in Google Workspace or Microsoft 365, integration matters
Billing preferences: Some teams prefer GCP's per-second billing; others optimize AWS's complex pricing
Procurement and discounts: Providers aggressively compete on pricing, especially for larger contracts

Common Pattern: Many successful companies use multiple providers strategically rather than betting everything on one platform. The "pure play" approach is becoming less common as teams optimize for specific workloads across different clouds.

2. How Much Infrastructure Do You Want to Manage?

All three providers offer the complete spectrum from "no infrastructure management" to "full control over everything." The choice often comes down to which interface feels more intuitive to your team and which pricing model fits your budget.

Management Level	AWS Options	Azure Options	GCP Options
Minimal Infrastructure	Lambda, Fargate, Elastic Beanstalk, Amplify	Azure Functions, App Service, Container Instances	Cloud Functions, Cloud Run, App Engine
Full Control	EC2, ECS, VPC, Security Groups, CloudFormation	VMs, AKS, Virtual Network, ARM Templates	Compute Engine, GKE, VPC, Deployment Manager

The Reality: Each provider has mature offerings across both ends of the spectrum. AWS Lambda and GCP Cloud Functions are functionally equivalent. EC2 and Compute Engine provide similar VM capabilities. The differences are in:

Interface preferences: Some teams find GCP's console cleaner; others prefer AWS's detailed options
Documentation style: AWS has extensive but sometimes overwhelming docs; GCP tends toward simpler guides
Integration patterns: How easily services connect with your existing tools and workflows
Pricing structure: Per-second vs per-hour billing, sustained use discounts, reserved instance options

Practical Advice: During your testing phase, try deploying the same simple application using both the "minimal infrastructure" and "full control" options on each platform. You'll quickly discover which workflows feel natural to your team and which pricing models align with your usage patterns.

The "best" choice often comes down to subjective preferences around UI, documentation, and which platform's approach to common tasks matches how your team thinks about infrastructure.

3. What Does Your Team Already Know?

Beyond the learning curve, technical familiarity plays a key role in productivity and unseen costs.

Teams living in Microsoft tools find Azure's integration with Visual Studio, Azure DevOps, and Active Directory means faster onboarding and fewer integration headaches. Developers comfortable with Google's approach to APIs and already using Kubernetes find GCP's container-native architecture natural. Teams that prioritize flexibility and customization often benefit from AWS's broad set of configuration options, which offer powerful control for a range of DevOps skill levels

Cost Reality Check: Getting Started and Avoiding Surprises

Free Tiers and Startup Credits

Provider	Free Tier Highlights	Startup Programs
AWS	12 months: 750 hours/month EC2 t2.micro, 5GB S3 storage, 1M Lambda requests, \$300 credits for 6 months for Connected Community members	AWS Activate: Up to \$100k credits for qualifying startups
Azure	12 months: 750 hours B1S VM, 5GB blob storage, 1M Azure Functions, \$200 credit for the first 30 days	Microsoft for Startups: Up to \$150k credits
GCP	Always free: 1 f1-micro VM, 5GB Cloud Storage, 2M Cloud Functions, 90-day free trial with \$300 credits	Google for Startups: Up to \$200k credits over 2 years

Universal Cost Gotchas (All Providers)

These surprises hit teams regardless of which cloud they choose:

Cost Trap	What Happens	Prevention Strategy
Data Transfer	Moving data between regions or out of cloud	Plan data architecture, use CDNs strategically
Idle Resources	Forgot to turn off dev/test environments	Set up auto-shutdown policies and resource tagging
Storage Snapshots	Automated backups accumulating over time	Configure lifecycle policies for old snapshots
Managed Kubernetes	Control plane costs + node costs + networking	Start with simpler container services, upgrade when needed
Premium Services	High-performance databases, specialized AI tools	Begin with standard tiers, monitor usage closely
Auto-scaling Gone Wild	Traffic spikes trigger expensive scaling	Set spending alerts and scaling limits

Practical Cost Management Tips

Set Up Billing Alerts Early: All providers offer spending notifications. Configure them before you deploy anything significant.

Use Cost Calculators: Test your architecture's estimated costs across providers before committing. Real workload costs often differ significantly from marketing examples.

Start Small: Begin with basic services and upgrade as you understand your actual usage patterns. Premium features look attractive but may not be necessary initially.

Monitor Resource Utilization: Unused CPU, storage, and network resources add up quickly. Regular audits help identify optimization opportunities.

Understand Data Transfer Costs: Moving data between availability zones, regions, or out of the cloud entirely can be expensive. Design with this in mind.

The reality: cost optimization requires active management regardless of provider. Teams that monitor usage, set alerts, and regularly review their architecture tend to control costs effectively on any platform.

Decision Framework in Action: Real Scenarios

Scenario	Primary Need	Team Profile	Recommended Provider	Key Considerations
Early-stage data startup	Simple API + analytics pipeline	Junior developers, data focus	GCP (App Engine) or AWS (Elastic Beanstalk/Amplify)	Compare free credit programs and data tooling
Enterprise migration	Hybrid Windows systems	Microsoft-focused IT team	Azure (Active Directory)	Licensing integration and compliance requirements
Scale-up platform	Global distributed system	Experienced DevOps team	AWS or GCP (multi-service) or multi-cloud	Evaluate based on existing expertise and specific needs

Early-stage data startup building analytics platform: Primary use case involves web API with database plus data processing, infrastructure management should be minimal, and team expertise includes junior developers with data focus but limited DevOps experience.
Recommendation: GCP's App Engine or Cloud Run for managed deployment with built-in data tooling access. AWS offers Elastic Beanstalk or Amplify with similar benefits. Eligibility for free credits varies between providers, so startups should explore both AWS Activate and Google for Startups programs to maximize runway.

Mid-size company migrating legacy Windows applications: Primary use case requires hybrid cloud connecting on-premises systems, infrastructure management needs enterprise governance, and team expertise includes strong Microsoft background.
Recommendation: Azure. Seamless Active Directory integration, hybrid capabilities, familiar tooling, and potential licensing cost savings through Azure Hybrid Benefit.

Scale-up building complex distributed systems: Primary use case involves global platform with diverse technical requirements, infrastructure management requires full control, and team expertise includes experienced DevOps engineers.
Recommendation: AWS or GCP. Both offer broad service catalogs, maximum flexibility, and global infrastructure capable of supporting complex, distributed workloads. Choice often depends on existing team expertise, specific service requirements, and pricing negotiations.

Real-World Decision Patterns That Actually Work

Analysis of deployment decisions across organizations reveals three successful patterns:

1. The "Start Simple, Scale Smart" Pattern

Many successful companies begin with one provider for core workloads, then strategically add others for specific use cases. Dataquest relies on GCP for the learning platform and various internal services, uses AWS for data pipelines and storage, while serverless code execution is split across all three cloud providers, leveraging the strengths of each.”

Fun fact - we originally used AWS for all of our infrastructure, shifted mainly to GCP about six years ago, and then added back some strategic AWS services over time to optimize our setup.

2. The "Follow Your Data" Pattern

Data-intensive organizations often find GCP's analytics tools (BigQuery, Dataflow, Vertex AI) so well-integrated that they structure their entire cloud strategy around them. The productivity gains from purpose-built tools can outweigh the complexity of a smaller service ecosystem.

However, AWS has been aggressively closing the analytics gap with services like Redshift Serverless, SageMaker, and improved data lake capabilities through services like Glue and Athena. Many teams already invested in AWS infrastructure find these evolving analytics offerings sufficient for their needs, especially when considering the cost and complexity of migrating existing workloads.

Bottom line: While GCP may have an edge in analytics tooling simplicity and integration, AWS remains a solid choice for data workloads, particularly for teams with existing AWS expertise or those needing to integrate analytics with complex, multi-service architectures already running on AWS.

3. The "Enterprise Integration" Pattern

Organizations with significant Microsoft investments frequently choose Azure not for its individual services, but for seamless integration. The ability to extend existing Active Directory, use familiar management tools, and leverage existing licensing often delivers immediate ROI.

Your 30-Day Hands-On Provider Testing Roadmap

The fastest way to understand which provider fits specific needs is to actually use them. Here's a structured roadmap that takes just 30 days and costs nothing thanks to free tiers. (If you're new to cloud concepts, consider reviewing cloud deployment models and service models before starting your hands-on testing.)

Week 1: AWS Deep Dive

Goal: Experience AWS's breadth and configuration flexibility

Day	Task	Service	What You'll Learn
1-2	Launch a Linux VM with web server (Nginx/Apache)	EC2	AWS infrastructure control, security groups, and key pair management
3-4	Deploy a static website with global distribution	S3 + CloudFront	Storage architecture, CDN configuration, and global delivery
5-6	Set up a private network with public/private subnets	VPC	Network design, routing tables, and security isolation
7	Build a REST API with serverless function	API Gateway + Lambda	Event-driven computing, API design, and serverless integration

Key Questions to Answer:

How intuitive is the AWS console for your team?
How detailed are the configuration options—helpful or overwhelming?
How easily can you connect services together (EC2 + VPC, API Gateway → Lambda, S3 → Lambda)?
What's the learning curve for understanding AWS networking and serverless concepts?

Week 2: Azure Exploration

Day	Task	Service	What You'll Learn
8-9	Deploy a web application with scaling	App Service	Platform-as-a-service experience and deployment simplicity
10-11	Create serverless functions with HTTP triggers	Azure Functions	Serverless development workflow and trigger options
12-13	Set up managed database and connect to your app	Azure SQL Database	Database provisioning, connection strings, and scaling options
14	Build a simple data pipeline	Azure Data Factory	Data movement, transformation basics, and integration patterns

Key Questions to Answer:

How seamlessly does Azure connect with existing Microsoft tools?
Is the enterprise governance approach helpful or restrictive for your use case?
How does Azure's hybrid story align with infrastructure needs?
What's the experience like for data integration and database management?

Week 3: GCP Hands-On

Goal: Evaluate GCP's developer experience and data capabilities

Day	Task	Service	What You'll Learn
15-16	Launch a VM and set up private networking	Compute Engine + VPC	GCP's infrastructure approach and network configuration
17-18	Deploy a containerized app with auto-scaling	Cloud Run	Container-native development and automatic scaling
19-20	Build REST API with serverless functions	Cloud Functions + API Gateway	Serverless workflow and API management
21	Create storage bucket and analyze sample data	Cloud Storage + BigQuery	Data storage architecture and analytics capabilities

Key Questions to Answer:

How developer-friendly is GCP's interface and workflow?
How does Cloud Run's container approach feel for application deployment?
Do the data and analytics tools (BigQuery) provide clear insights for your use cases?
How does GCP's automation and default configurations reduce infrastructure overhead?

Week 4: Comparative Analysis

Goal: Make decisions based on real experience

Day	Focus	Activity	Deliverable
22-24	Service-by-Service Comparison	Deploy the same sample app using equivalent services	Document deployment time, complexity, and performance differences
25-26	Cost Analysis	Compare pricing for your specific use cases using each provider's calculator	Create cost projection spreadsheet with realistic usage scenarios
27-28	Team Usability Testing	Have different team members complete key tasks on each platform	Collect usability feedback and identify learning curve differences
29-30	Final Decision Framework	Apply the 3-question framework with real hands-on experience	Choose primary provider and document reasoning

Equivalent Services for Direct Comparison

Test the same functionality across all three providers to understand real differences:

Function	AWS	Azure	GCP
Virtual Machines	EC2	Virtual Machines	Compute Engine
Container Deployment	Fargate/ECS	Container Instances	Cloud Run
Serverless Functions	Lambda + API Gateway	Azure Functions	Cloud Functions + API Gateway
Managed Database	RDS	Azure SQL Database	Cloud SQL
Object Storage	S3	Blob Storage	Cloud Storage
Analytics/Data Warehouse	Redshift	Synapse Analytics	BigQuery

Sample Application for Testing

Use this simple but realistic application to test deployment across all providers:

"Task Manager API" - A basic REST API with the following features:

Backend: Node.js/Python Flask API with user authentication
Database: PostgreSQL with user accounts and task lists
Storage: File uploads for task attachments
Frontend: Simple React/Vue.js interface

Why this works for testing:

Covers compute, database, storage, and networking
Realistic enough to reveal platform differences
Simple enough to deploy in a few hours
Demonstrates common patterns teams actually use

GitHub Repository: We recommend using TodoMVC or a similar open-source task management app that includes both frontend and backend components. You can also consider Flask Todo API or Node.js with Express + Sequelize.

Deployment Comparison Checklist

Track these metrics for each provider:

Evaluation Criteria	AWS	Azure	GCP
Time to first deployment	___ minutes	___ minutes	___ minutes
Documentation clarity (1-5)
Number of configuration steps
Ease of service integration (1-5)
Cost for 30-day test period	\$	\$	\$
Team member preference (1-5)			__

Real-World Testing Scenarios

Beyond the basic deployment, test these common scenarios:

Scenario 1: Traffic Spike Simulation

Use load testing tools to simulate 10x normal traffic
Observe auto-scaling behavior and cost impact
Document which platform handles spikes most smoothly

Scenario 2: Data Analytics Workflow

Import sample dataset (CSV/JSON)
Run basic analytics queries
Compare query performance and ease of use

Scenario 3: Team Collaboration

Add team members to the project
Test permission management and collaborative development
Evaluate which platform feels most intuitive for your team size

What to Track During the Experiment

Create a simple scorecard to evaluate each provider objectively:

Evaluation Criteria	AWS Score (1-5)	Azure Score (1-5)	GCP Score (1-5)
Ease of getting started
Documentation quality
Service integration
Pricing transparency
Team productivity
Matches our use case

Pro Tips for Testing Success

Start Small, Think Real: Don't build toy applications. Try building a simplified version of something practical: a company website, a data dashboard, or an API endpoint.

Document Everything: Keep notes about what frustrates teams, what feels intuitive, and where they get stuck. These insights matter more than feature lists.

Involve the Team: Have different team members try different providers. A platform that works for DevOps engineers might frustrate frontend developers.

Test Support: Try each provider's documentation, community forums, and free support options. Teams will be using these resources constantly.

Rather than aiming for expertise in 30 days, the goal is to get a sense of which platform best aligns with the team's real workflows. That experience guides better decisions than any comparison chart ever could.

Key Takeaways

Cloud provider choice isn't permanent. Modern architectures increasingly use multiple providers strategically. Teams should start with the platform that best serves their core use case, then expand strategically as needs evolve.

Remember, the goal isn’t to find the 'perfect' provider, but to choose one that meets current needs while gaining the experience to make smarter decisions as the organization grows.

Each provider brings something different to the table. The choice depends on technical goals, existing tech stack, budget, and how much customization or automation teams need. What matters most is knowing where each one fits best for specific needs.

Project Tutorial: Answering Business Questions Using SQL

Anna Strahl — Thu, 29 May 2025 13:08:24 +0000

In this project walkthrough, we'll explore how to use SQL for data analysis from a digital music store and answer critical business questions. By working with the Chinook database—a sample database that represents a digital media store similar to iTunes—we'll demonstrate how SQL can drive data-informed decision-making in a business context.

The Chinook database contains information about artists, albums, tracks, customers, and sales data. Through strategic SQL queries, we'll help the business understand its market trends, evaluate employee performance, and identify growth opportunities. This project showcases real-world SQL applications that data analysts encounter daily.

We'll take you through writing increasingly complex SQL queries, from basic exploratory analysis to advanced queries using Common Table Expressions (CTEs) and subqueries.

What You'll Learn

By the end of this tutorial, you'll know how to:

Navigate complex relational database schemas with multiple tables
Write SQL queries using joins to connect data across multiple tables
Use Common Table Expressions (CTEs) to organize complex queries
Apply subqueries to calculate percentages and comparative metrics
Analyze business data to provide actionable insights
Connect SQL queries to Python in Jupyter

Before You Start: Pre-Instruction

To make the most of this project walkthrough, follow these preparatory steps:

Review the Project
- Access the project and familiarize yourself with the goals and structure: Answering Business Questions Using SQL Project.
Prepare Your Environment
- If you're using the Dataquest platform, everything is already set up for you
- If you're working locally, you'll need:
  - SQLite or DB Browser for SQLite
  - Jupyter Notebook (optional, for combining SQL with Python)
  - The Chinook database
Get Comfortable with SQL Fundamentals
- You should be familiar with basic SQL keywords: SELECT, FROM, GROUP BY, and JOIN
- Some experience with CTEs and subqueries will be helpful, but not required
- New to Markdown? We recommend learning the basics: Markdown Guide

Setting Up Your Environment

Before we get into our analysis, let's set up our Jupyter environment to work with SQL. We'll use some SQL magic commands that allow us to write SQL directly in Jupyter cells.

%%capture
%load_ext sql
%sql sqlite:///chinook.db

Learning Insight: The %%capture magic command suppresses any output messages from the cell, keeping our notebook clean. The %load_ext sql command loads the SQL extension, and %sql sqlite:///chinook.db connects us to our database.

Now let's verify our connection and explore what tables are available in our database:

%%sql

SELECT name 
FROM sqlite_master 
WHERE type='table';

This special SQLite query shows us all the table names in our database. The Chinook database contains 11 tables representing different aspects of a digital music store:

album: Album details
artist: Artist information
customer: Customer information with assigned support representatives
employee: Store employees, including sales support agents
genre: Music genres
invoice: Sales transactions
invoice_line: Individual items within each invoice
media_type: Format types (MP3, AAC, etc.)
playlist: Curated playlists
playlist_track: Tracks within each playlist
track: Song information

Understanding the Database Schema

Working with relational databases means understanding how tables connect to each other. The Chinook database uses primary and foreign keys to establish these relationships. Here's a simplified view of the key relationships between the tables we'll be working with:

customer is linked to employee through support_rep_id
invoice is linked to customer through customer_id
invoice_line is linked to invoice through invoice_id
track is linked to album, invoice_line, and genre through album_id, track_id, and genre_id, respectively

Let's preview some of our key tables to understand the data we're working with:

%%sql

SELECT * 
FROM track 
LIMIT 5;

track_id	name	album_id	media_type_id	genre_id	composer	milliseconds	bytes	unit_price
1	For Those About To Rock (We Salute You)	1	1	1	Angus Young, Malcolm Young, Brian Johnson	343719	11170334	0.99
2	Balls to the Wall	2	2	1	None	342562	5510424	0.99
3	Fast As a Shark	3	2	1	F. Baltes, S. Kaufman, U. Dirkscneider & W. Hoffman	230619	3990994	0.99
4	Restless and Wild	3	2	1	F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. Dirkscneider & W. Hoffman	252051	4331779	0.99
5	Princess of the Dawn	3	2	1	Deaffy & R.A. Smith-Diesel	375418	6290521	0.99

%%sql

SELECT * 
FROM invoice_line 
LIMIT 5;

invoice_line_id	invoice_id	track_id	unit_price	quantity
1	1	1158	0.99	1
2	1	1159	0.99	1
3	1	1160	0.99	1
4	1	1161	0.99	1
5	1	1162	0.99	1

Learning Insight: When working with a new database, always preview your tables with LIMIT to understand the data structure before writing complex queries. This helps you identify column names, data types, and potential relationships without flooding your output with hundreds of rows.

Business Question 1: Which Music Genres Should We Focus on in the USA?

The Chinook store wants to understand which music genres are most popular in the United States market. This information will help them decide which new albums to add to their catalog. Let's build a query to analyze genre popularity by sales.

Building Our Analysis with a CTE

We'll use a Common Table Expression (CTE) to create a temporary result set that combines data from multiple tables:

%%sql

WITH genre_usa_tracks AS (
    SELECT
        il.invoice_line_id,
        g.name AS genre,
        t.track_id,
        i.billing_country AS country
    FROM track t
    JOIN genre g ON t.genre_id = g.genre_id
    JOIN invoice_line il ON t.track_id = il.track_id
    JOIN invoice i ON il.invoice_id = i.invoice_id
    WHERE i.billing_country = 'USA'
)
SELECT
    genre,
    COUNT(*) AS tracks_sold,
    COUNT(*) * 100.0 / (SELECT COUNT(*) FROM genre_usa_tracks) AS percentage
FROM genre_usa_tracks
GROUP BY genre
ORDER BY tracks_sold DESC;

genre	tracks_sold	percentage
Rock	561	53.37773549000951
Alternative & Punk	130	12.369172216936251
Metal	124	11.798287345385347
R&B/Soul	53	5.042816365366318
Blues	36	3.4253092293054235
Alternative	35	3.330161750713606
Latin	22	2.093244529019981
Pop	22	2.093244529019981
Hip Hop/Rap	20	1.9029495718363463
Jazz	14	1.3320647002854424
Easy Listening	13	1.236917221693625
Reggae	6	0.570884871550904
Electronica/Dance	5	0.47573739295908657
Classical	4	0.38058991436726924
Heavy Metal	3	0.285442435775452
Soundtrack	2	0.19029495718363462
TV Shows	1	0.09514747859181731

Learning Insight: CTEs make complex queries more readable by breaking them into logical steps. Here, we first create a filtered dataset of USA purchases, then analyze it. The 100.0 in our percentage calculation ensures we get decimal results instead of integer division.

Our results show that Rock music dominates the USA market with over 50% of sales, followed by Latin, Metal, and Alternative & Punk. This suggests the store should prioritize these genres when selecting new inventory.

Key Insights from Genre Analysis

Rock dominates: With 561 tracks sold (53.4%), Rock is by far the most popular genre
Latin music surprise: The second most popular genre is Latin (10.3%), indicating a significant market segment
Long tail effect: Many genres have very small percentages, suggesting niche markets

Business Question 2: Analyzing Employee Sales Performance

The company wants to evaluate its sales support agents' performance to identify top performers and areas for improvement. Let's analyze which employees generate the most revenue.

%%sql

SELECT
    e.first_name || ' ' || e.last_name AS employee_name,
    e.hire_date,
    COUNT(DISTINCT c.customer_id) AS customer_count,
    SUM(i.total) AS total_sales_dollars,
    SUM(i.total) / COUNT(DISTINCT c.customer_id) AS avg_dollars_per_customer
FROM customer c
JOIN invoice i ON c.customer_id = i.customer_id
JOIN employee e ON c.support_rep_id = e.employee_id
GROUP BY e.employee_id, e.hire_date
ORDER BY total_sales_dollars DESC;

employee_name	hire_date	customer_count	total_sales_dollars	avg_dollars_per_customer
Jane Peacock	2017-04-01 00:00:00	21	1731.5100000000039	82.45285714285733
Margaret Park	2017-05-03 00:00:00	20	1584.0000000000034	79.20000000000017
Steve Johnson	2017-10-17 00:00:00	18	1393.920000000002	77.44000000000011

Learning Insight: When using GROUP BY with aggregate functions, remember to include all non-aggregated columns in your GROUP BY clause. This is required in most SQL flavors (though SQLite is more forgiving). The || operator concatenates strings in SQLite.

Performance Analysis Results

Our analysis reveals interesting patterns:

Jane Peacock leads with the highest average dollars per customer, despite not having the most customers
Margaret Park's performance is solid, with metrics close to Jane’s, suggesting a consistent level of customer value delivery
Steve Johnson, the newest employee, shows promising performance with metrics similar to more experienced staff

Business Question 3: Combining SQL with Python for Visualization

While SQL excels at data retrieval and transformation, combining it with Python enables powerful visualizations. Let's demonstrate how to pass SQL query results to Python:

import pandas as pd

# Store our query as a string
query = """
SELECT
    genre,
    COUNT(*) AS tracks_sold
FROM genre_usa_tracks
GROUP BY genre
ORDER BY tracks_sold DESC
LIMIT 10;
"""

# Execute the query and store results
result = %sql \$query

# Convert to pandas DataFrame
df = result.DataFrame()

Learning Insight: The %sql inline magic (single percent sign) allows us to execute SQL and capture the results in Python. The dollar sign syntax (\$query) lets us reference Python variables within SQL magic commands.

Challenges and Considerations

During our analysis, we encountered several important SQL concepts worth highlighting:

1. Integer Division Pitfall

When calculating percentages, SQL performs integer division by default:

-- This returns 0 for all percentages
SELECT COUNT(*) / (SELECT COUNT(*) FROM table) AS percentage

-- This returns proper decimals
SELECT COUNT(*) * 100.0 / (SELECT COUNT(*) FROM table) AS percentage

2. JOIN Selection Matters

We used INNER JOIN throughout because we only wanted records that exist in all related tables. If we needed to include customers without invoices, we would use LEFT JOIN instead.

3. Subquery Performance

Our percentage calculation uses a subquery that executes for each row. For larger datasets, consider using window functions or pre-calculating totals in a CTE.

Sharing Your Work with GitHub Gists

GitHub Gists provide an excellent way to share your SQL projects without the complexity of full repositories. Here's how to share your work:

Navigate to gist.github.com
Create a new gist
Name your file with the .ipynb extension for Jupyter notebooks or .sql for SQL scripts
Paste your code and create either a public or secret gist

Gists automatically render Jupyter notebooks with all outputs preserved, making them perfect for sharing analysis results with stakeholders or including in your portfolio of projects.

Summary of Analysis

In this project, we've demonstrated how SQL can answer critical business questions for a digital music store:

Genre Analysis: We identified Rock as the dominant genre in the USA market (53.4%), with Latin music as a surprising second place
Employee Performance: We evaluated sales representatives, discovering that Jane Peacock leads in average revenue per customer
Technical Skills: We applied CTEs, subqueries, multiple joins, and aggregate functions to solve real business problems

These insights enable data-driven decisions about inventory management, employee training, and market strategy.

Next Steps

To extend this analysis and deepen your SQL skills, consider these challenges:

Time-based Analysis: How do sales trends change over time? Add date filtering to identify seasonal patterns
Customer Segmentation: Which customers are the most valuable? Create customer segments based on purchase behavior
Product Recommendations: Which tracks are commonly purchased together? Use self-joins to find associations
International Markets: Expand the genre analysis to compare preferences across different countries

If you're new to SQL and found this project challenging, start with our SQL Fundamentals skill path to build the foundational skills needed for complex analysis. The course covers essential topics like joins, aggregations, and subqueries that we've used throughout this project.

We have some other project walkthrough tutorials you may also enjoy:

Happy querying!