The Dataquest Download

Level up your data and AI skills, one newsletter at a time.

Each week, the Dataquest Download brings the latest behind-the-scenes developments at Dataquest directly to your inbox. Discover our top tutorial of the week to boost your data skills, get the scoop on any course changes, and pick up a useful tip to apply in your projects. We also spotlight standout projects from our students and share their personal learning journeys.

Hello, Dataquesters!

Here’s what we have for you in this edition:

Top Read: Learn how to use ChromaDB for scalable semantic search. Load thousands of arXiv embeddings, use ANN indexing, and keep queries fast at real-world sizes. Read the blog

From the Community: Excel-powered sales cleaning and reporting, a Power BI learning report with actionable insights, and a deep dive on building more stable conversational chatbots. Join the discussion

What We’re Reading: MIT’s Project Iceberg on AI automation across 151M jobs, a three-year ChatGPT retrospective, chatting with chess games using Python plus LLMs, and the 1B-token recipe for better pre-training mixes. Learn more

In the previous embeddings tutorial, we built a semantic search system that matched research papers by meaning rather than keywords. It worked well for 500 papers. But that approach relied on brute-force comparisons, which slow down dramatically as your dataset grows. At 5,000 papers, performance drops. At 50,000 or 500,000, it becomes unusable.

This tutorial shows how ChromaDB solves that scaling problem. You’ll load thousands of arXiv embeddings, build a vector database with ANN indexing, and run semantic searches that stay fast even with large collections. It’s your next step toward production-ready search systems.

From the Community

Sales Data Cleaning and Manufacturer Analysis: In his portfolio-level individual project, Israel used advanced Excel formulas to transform a highly unstructured dataset into a clean, analysis-ready table and achieved a dynamic end-to-end workflow for automated sales reporting.

Dataquest Learning Report: Nisha’s Power BI project explores the data from multiple angles and features high-quality, insightful visualizations of Dataquest lesson completions, along with actionable recommendations to improve completion rates.

Developing and Training AI-powered Conversational Chatbots: Kritika examines the key factors that shape the performance of AI-driven conversational chatbots and contribute to more stable behavior in real-world interactions.

What We're Reading

MIT’s Project Iceberg: Mapping out  the entire U.S. workforce—151M jobs, 32,000 skills, and it found that 11.7% of jobs are already automatable with current AI tools. This is the clearest data-driven case yet for why reskilling and upskilling matter now.

Three Years of ChatGPT—A Retrospective (2022–2025): This article breaks down the current AI landscape: what’s working, where the challenges lie, and why productivity gains haven’t fully arrived (yet).

Talk to Your Chess Games with Python + LLMs: PyBay 2025 recently published all their sessions on YouTube covering a wide range of Python-related topics. In this video, the presenter showcases his program that connects ChatGPT to chess, allowing you to chat to understand why certain chess engine lines work or don’t work.

The 1 Billion Token Challenge—Finding the Perfect Pre-training Mix: This article breaks down how the right blend of training data can boost model quality without inflating dataset size. It highlights a practical strategy for mixing sources that produces strong results with far less data, which makes it a compelling read for anyone interested in how modern LLMs are really built.

Give 20%, Get $20: Time to Refer a Friend!

Give 20% Get $20

Now is the perfect time to share Dataquest with a friend. Gift a 20% discount, and for every friend who subscribes, earn a $20 bonus. Use your bonuses for digital gift cards, prepaid cards, or donate to charity. Your choice! Click here

High-fives from Vik, Celeste, Anna P, Anna S, Anishta, Bruno, Elena, Mike, Daniel, and Brayan.

2025-12-04

Learn How to Use ChromaDB for Scalable Semantic Search

Learn scalable semantic search with ChromaDB, explore community projects in Excel, Power BI, and chatbots, and read insights on AI work. Read More
2025-11-27

What it takes to build real-world ETL systems

Learn to build an Airflow pipeline with live Amazon data, explore community projects in BI and ML, and read insights on AI coding, NLV, and LangChain. Read More
2025-11-20

Build a real semantic search engine

Learn semantic similarity, build an AI search engine, explore community NLP and traffic insights, and read fresh takes on LLM poisoning. Read More

Learn faster and retain more.
Dataquest is the best way to learn