The Dataquest Download

Level up your data and AI skills, one newsletter at a time.

Each week, the Dataquest Download brings the latest behind-the-scenes developments at Dataquest directly to your inbox. Discover our top tutorial of the week to boost your data skills, get the scoop on any course changes, and pick up a useful tip to apply in your projects. We also spotlight standout projects from our students and share their personal learning journeys.

Hello, Dataquesters!

Here’s what we have in store for you in this edition:

Top Read: Learn how to set up PySpark locally with Docker, understand core Spark concepts like RDDs and SparkSession, and build your first distributed data processing app. No cluster needed. Learn more

From the Community: Learn how one learner built a 98% accurate spam filter, another predicted heart disease using logistic regression, and get a clear breakdown of the difference between else and elif.  Join the discussion

DQ Resources: Build and deploy automated ETL pipelines using Apache Airflow, starting locally and scaling to the cloud with AWS. Learn more

Struggling with slow Python scripts and crashing Excel files? It’s time to level up. This beginner-friendly tutorial walks you through setting up PySpark locally, explains Spark’s architecture in plain language, and shows you how to build your first distributed data processing app using Python.

  • Learn the role of SparkSession, RDDs, and SparkContext
  • Set up PySpark in Jupyter without configuration headaches
  • Understand how distributed computing tackles real-world data challenges
  • Run your first Spark job on real data—no cluster required

From the Community

Building a Spam Filter with Naive Bayes: Steve’s machine learning project hits over 98% accuracy using clean, efficient code and well-structured functions. Simple yet powerful.

Predicting Heart Disease Risk with Logistic Regression: Dimitar delivers a full end-to-end project with clear goals, strong EDA, and impactful visualizations, all tied together with great storytelling.

Else vs. Elif in Python: Raisa breaks down the difference between else and elif with clear examples. Great for anyone refining their Python logic.

DQ Resources

Automate and Monitor ETL Pipelines Locally (Part I): Build a fully functional ETL pipeline running locally with Apache Airflow and Docker. Automate data tasks, monitor them through a visual UI, and quickly identify and fix any issues. No more manual runs or missed jobs. Learn more

Launch a Scalable, Cloud-Hosted ETL Pipeline (Part II): Deploy your ETL workflow to the cloud using AWS. Production-ready Airflow setup that includes cloud storage (S3), a relational database (RDS), IAM roles, and secure infrastructure, built to scale and run reliably. Learn more

How to Choose the Right Cloud Service Provider: Not sure whether to go with AWS, Azure, or GCP? This 30-day roadmap helps you test all three using their free tiers so you can make an informed choice based on hands-on experience. Perfect for beginners! Learn more

What We're Reading

Estimating Memory Usage in pandas: Understand how pandas handles memory and file sizes with this clear breakdown. Helpful for anyone working with large datasets or optimizing data pipelines.

Train Your Own Vision Language Model with nanoVLM: Inspired by nanoGPT, nanoVLM lets you explore vision-language modeling in a beginner-friendly way, right from a free Colab notebook. Great for learning the fundamentals of multimodal AI.

Give 20%, Get $20: Time to Refer a Friend!

Give 20% Get $20

Now is the perfect time to share Dataquest with a friend. Gift a 20% discount, and for every friend who subscribes, earn a $20 bonus. Use your bonuses for digital gift cards, prepaid cards, or donate to charity. Your choice! Click here

High-fives from Vik, Celeste, Anna P, Anna S, Anishta, Bruno, Elena, Mike, Daniel, and Brayan.

2025-06-25

Struggling with Slow Python Scripts and Crashing Excel files?

Explore PySpark locally, build your first Spark app, master ETL pipelines with Airflow on AWS, and learn from impressive community projects. Read More
2025-06-19

Build a Linear Regression Model Using Python

Forecast gym visits, explore traffic patterns, test cloud providers hands-on, and build machine learning skills with real healthcare data. Read More
2025-06-11

Build Your First Automated ETL Pipeline with Airflow and Docker

Learn how to automate ETL with Airflow, compare cloud providers hands-on, and discover regex, SQL, and engagement tips from the community. Read More

Learn faster and retain more.
Dataquest is the best way to learn