The Dataquest Download
Level up your data and AI skills, one newsletter at a time.
Hello, Dataquesters!
Here’s what we have for you in this edition:
Top Read: Take your embedding skills further by learning how to measure semantic similarity and build a functional AI-powered search engine—no keyword matching required. Read now
From the Community: Explore multidisciplinary NLP research in Telugu, traffic-pattern analysis, clean-code best practices, domain-specific R projects, and discussions on data type pitfalls and global healthcare innovation. Share your ideas on Python projects for beginners and interview prep forums. Join the discussion
What We’re Reading: LLM poisoning explained, plus a practical talk on building a personalized “Spotify Wrapped” using Elasticsearch and time-series analysis. Learn more

In our previous tutorial, we generated embeddings for 500 arXiv papers—now it’s time to put them to work. In this tutorial, you’ll learn how to measure meaning, not keywords, by comparing vectors using three essential similarity metrics. Then you’ll build a functional semantic search engine that ranks papers based on relevance, not literal matches. If you want to move from “I have embeddings” to “I can build real AI search systems,” this is your next step.
From the Community
The Metrical Poetry in Telugu: Boddu shared a research project featuring a new Python library and a website, implementing the metrical poetry in the Telugu language (India). The project is an excellent example of multidisciplinarity, combining computer science and linguistics to explore the characteristics of a language.
Beating the Queue—Exploring the I-94 Dataset: Melanie’s project showcases a meaningful and eye-catching title, an in-depth exploration of the effects of weather conditions, weekdays, and holiday seasons on traffic, an easy-to-read narrative, and a concise summary of the key factors that help predict traffic patterns.
Forums for Data Science Interview Discussions: Sagar is looking for dedicated communities and platforms where data scientists actively share their interview experiences and lessons learned, to make interview preparation easier and to get a sense of what to expect from companies.
Looking for Project Ideas for Python Beginners: Suheb is asking for ideas on small and simple Python projects that one can build right after learning the basics of Python, to practice new skills and strengthen understanding.
Writing Clean and Readable Python Code: Join your peers in a discussion about best practices for writing Python code that is not only technically correct but also easy to read, understand, and maintain—both for its author and for current and future colleagues.
Data Science Resources and Real-World Projects: Artur shared a collection of helpful data science resources (t-test functions, linear regression, and Quarto), along with three real-world, domain-specific research projects in R on bioinformatics topics that he has personally worked on.
Python Data Type Conversion Pitfalls: Check out this discussion to explore the kinds of mistakes that can occur when converting data in Python from one type to another (such as strings to numbers or floats to integers), and how to prevent such issues.
Building in Healthcare, Longevity Science, and Workforce Development: Venkatesh is working on ambitious global projects aimed at making healthcare affordable for all, combating aging, and creating opportunities for underrepresented entrepreneurs—and is open to collaboration.
What We're Reading
LLM poisoning: Researchers are discovering that even small changes to a model’s training data can secretly influence how it behaves. This article explains how “LLM poisoning” works, why it matters for the safety of AI systems, and what steps might prevent it in the future.
Building my own (accurate!) Spotify Wrapped: EuroPython 2025 conference recently made all their sessions available for viewing on YouTube. In this session, the speaker creates her own version of “Spotify Wrapped” using Elasticsearch to analyze her own musical trends and insights using the user generated data. From queries, filters, aggregations, visualizations, and time series analysis, she explores how search analytics can be used for everyday cases.
Give 20%, Get $20: Time to Refer a Friend!
Give 20% Get $20
Now is the perfect time to share Dataquest with a friend. Gift a 20% discount, and for every friend who subscribes, earn a $20 bonus. Use your bonuses for digital gift cards, prepaid cards, or donate to charity. Your choice! Click here
High-fives from Vik, Celeste, Anna P, Anna S, Anishta, Bruno, Elena, Mike, Daniel, and Brayan.