Vector Database Practice Project: Building a Knowledge Base Search System
You've learned how to use vector databases, chunk documents intelligently, filter with metadata, and combine semantic search with keyword matching. Now it's time to put everything together and build something with all those skills.
This project asks you to create a complete knowledge base search system from scratch. You'll collect your own data, make your own chunking decisions, choose your own database, and build a hybrid search that actually works. When you're done, you'll have a portfolio project that shows employers you can build production-quality vector search systems, not just follow tutorials.
How to Make This Project Portfolio-Ready
When technical recruiters or hiring managers review your version of this project, they're going to be asking themselves specific questions. Understanding what they care about helps you focus your efforts where they matter.
- Can this person make technical decisions?
They want to see that you choose technologies for good reasons. "I used Qdrant because it was in the tutorial," doesn't demonstrate decision-making. "I chose Qdrant because my queries heavily filter on multiple metadata fields, and I needed consistent performance across filter combinations," shows you understand tradeoffs and can match tools to requirements. - Can this person evaluate their work?
Did you measure results systematically? Do you know if your solution actually works? Can you identify what could be improved? Showing evaluation methodology matters more than achieving perfect results. Everyone's first attempt has issues. What separates strong engineers is knowing how to measure and improve. - Can this person handle real-world constraints?
Did you deal with API rate limits, data quality issues, or performance problems? Do you show resilience when things don't work the first time? Can you debug and iterate? Your README should tell this story. "I initially tried fixed-token chunking but discovered it split important context across chunks. After evaluation showed 60% recall, I switched to sentence-based chunking and improved to 87%." - Can this person communicate technical concepts?
Is your documentation clear and useful? Can you explain complex decisions simply? Do you write for your audience? Your code matters, but the story you tell about the code matters more.
Your project should demonstrate all of this if you document it well. Focus on showing your thinking, not just your implementation.
What to Do If You Get Stuck
This project builds on everything you learned in the tutorial series. If you hit a wall on a particular concept or technique, here's where to review:
- Vector database basics and ChromaDB:
If you're struggling with collections, similarity search, or understanding HNSW indexing, review Introduction to Vector Databases using ChromaDB. - Document chunking strategies:
If you're unsure how to chunk your documents or evaluate chunking quality, review Document Chunking Strategies for Vector Databases. - Metadata design and hybrid search:
If you're having trouble with metadata schemas, filtering, or combining BM25 with vector search, review Metadata Filtering and Hybrid Search for Vector Databases. - Production database setup and selection:
If you're stuck on pgvector, Qdrant, or Pinecone configuration, or you need help choosing between them, review Production Vector Databases. - Semantic caching patterns:
If you want to add caching to your system (as an optional extension), review Semantic Caching and Memory Patterns for Vector Databases.
The tutorials give you the foundations. This project asks you to apply them to your own data and domain. When something isn't working, go back to the relevant tutorial, understand the pattern, then adapt it to your situation.
Project Requirements
Build a knowledge base search system that demonstrates mastery of vector databases, chunking strategies, and hybrid search. Your system should include the six components below.
Required Components Overview
- Data Collection: 1,000-3,000 documents from a new domain (not arXiv)
- Chunking Strategy: Implement and justify your approach
- Vector Database: Choose and set up one production database
- Hybrid Search: Combine semantic similarity with keyword matching
- Evaluation: Measure quality and performance with real queries
- Documentation: Make it portfolio-ready
Success Criteria
Your completed project should:
- Handle 1,000+ documents reliably, and report measured query latency (the focus is correctness + evaluation, not chasing speed at small scale)
- Demonstrate that hybrid search improves results (or document why it doesn't)
- Include rich metadata that enables meaningful filtering
- Show a clear evaluation methodology with test queries
- Provide comprehensive documentation that someone else could follow
- Present technical decisions with justified reasoning
Let's build this step by step.
Step 1: Data Collection
What you're building: A dataset of 1,000-3,000 documents from a domain you choose, with rich metadata for filtering.
Getting Your Data
We've created three data collection scripts that handle all the API complexity. Each script manages authentication, rate limiting, retries, and saves output ready for your project.
Use one of the data collection folders from our tutorial repository:
Each option below lives in its own folder and includes:
- a data collection script
- a
requirements.txtfile with the dependencies needed to run it
You only need to use one of these to complete the project.
-
Hugging Face (recommended)
Choose from curated datasets such as IMDb reviews, BBC news, or PubMed abstracts. No API keys and no rate limits. Produces
*_documents.csvand*_documents_full.json(includingbody_text). Best default choice for predictable chunking and evaluation. -
Collects ~2,000 Guardian articles across politics, technology, science, sport, and culture, including full text. Produces
guardian_documents.csvandguardian_documents_full.json. Strong metadata for filtering (category, publication date, author, tags). -
Collects ~1,000 recent articles discovered via NewsAPI. The script attempts to extract full text from each article URL by default, but some sites may block scraping or return partial content. Produces
newsapi_documents.csvandnewsapi_documents_full.json, with additional scrape and debug fields in the CSV.
See the data-collection-scripts/ folder in our tutorial repository for setup instructions and usage details for each option.
Quick note on predictability + runtime: Hugging Face and Guardian provide reliable body_text directly, which makes chunking and evaluation more predictable. NewsAPI’s own content field is often truncated, so this script attempts full-text extraction from each article URL by default. That scraping step can take noticeably longer and will fail on some sites (paywalls/anti-bot/JS-heavy pages). When extraction fails, the script falls back to the article’s description + truncated content, so expect some shorter/noisier documents. That extra step is why Hugging Face is the recommended default.
Using Your Own Data
If you have access to your own dataset (work documentation if permitted, personal knowledge base, or exported notes), you can use it instead. Make sure you have permission, at least 1,000 documents, meaningful metadata you can extract, and text-based content.
Important: Only use data you're legally allowed to store, embed, and publish. Don't commit proprietary or copyrighted raw text to GitHub. If your data contains personally identifiable information (names, emails, addresses), strip it before processing. When in doubt, stick with the provided collection scripts, which use publicly available data sources with appropriate attribution.
Building It
Here's what running a collection script looks like:
# Download and set up your API key in .env file
# Example: Guardian collection
# 1. The script handles everything
# Just run: python collect_guardian_data.py
# 2. You'll get two files:
# guardian_documents.csv - metadata for your vector database
# guardian_documents_full.json - complete text (stored as body_text) for chunking
# 3. Load and explore your data
import pandas as pd
df = pd.read_csv('guardian_documents.csv')
print(f"Collected {len(df)} documents")
print("\nDocuments by category:")
print(df['category'].value_counts())
print(f"\nAverage word count: {df['word_count'].mean():.0f} words")
Take the time to actually read 5-10 sample documents. You need to understand what you're working with before you can chunk it intelligently or decide what metadata matters.
Quality Checks
Before moving to the next step, verify your data looks good:
| Quality Indicator | Red Flag | Green Light |
|---|---|---|
| Document count | Fewer than 500 documents | 1,000+ documents collected |
| Document length | All identical lengths (suggests truncation) OR suspiciously short documents for sources that should provide full text | Reasonable distribution (100–5,000 words typical) |
| Metadata coverage | Missing metadata for 20%+ of documents | Consistent metadata across all documents |
| Metadata fields | Only 1–2 fields available | 3+ meaningful fields for filtering |
| Data quality | Lots of corrupted text, parsing errors | Clean, readable documents |
Common Issues
| Symptom | Cause | Solution |
|---|---|---|
| "Invalid API key" error | Missing or incorrect key in .env file |
Verify you copied the key correctly and check for extra spaces |
| Collection stops partway | Hit API rate limit | Scripts handle this automatically with retries. Wait briefly and it will continue. |
| Articles all from same date | API parameters too restrictive | Check script comments for how to adjust date ranges |
| Missing metadata fields | API or dataset schema change, or source doesn’t provide the field | Double-check the API or dataset schema; some sources genuinely don’t expose author or tags |
Decision Guide: Choosing Your Data Source
Still deciding which script to use? Here's how to choose:
| Choose This | When You Want |
|---|---|
| HuggingFace | Quickest start (about 5 minutes), ability to try different domains, and no API key setup |
| Guardian | Rich journalism with excellent metadata and multi-category content. A strong all-around choice. |
| NewsAPI | Very current headlines across many sources, but full text is often truncated. Getting complete articles may require URL fetching and extraction, and some sources may block access. |
Can't decide? Start with Hugging Face. It downloads complete text immediately (no API keys, no rate limits, no scraping), so chunking and evaluation are much more predictable.
Guardian is a great next choice if you want live journalism data with strong metadata.
Treat NewsAPI as “advanced mode” because the API often returns truncated content and getting full text usually requires scraping article URLs (some sites will block it).
Step 2: Chunking and Embedding
What you're building: Document chunks with embeddings, ready to load into your vector database.
At this point, think in terms of documents and chunks, not rows or tables.
Each document produces multiple chunks, and each chunk becomes a searchable unit.
Choosing Your Chunking Strategy
You learned three approaches to chunking. Here's when to use each with your data:
| Strategy | Best When | Implementation |
|---|---|---|
| Sentence-based | Documents have clear sentence boundaries, such as articles, papers, or documentation | Target 400–600 words per chunk and group complete sentences |
| Fixed token | Documents lack structure, such as chat logs, transcripts, or code, or when consistent chunk sizes are required | Use 512 tokens with a 100-token overlap (about 20%) |
| Structure-aware | Documents have explicit structure, such as markdown headers, HTML tags, or defined sections | Respect section boundaries and split large sections only when necessary |
For most datasets (news articles, documentation, papers), sentence-based chunking works well. It preserves semantic coherence while reducing storage requirements compared to fixed-token approaches.
Building It
Here's a sentence-based chunking implementation:
import pandas as pd
import json
import nltk
nltk.download('punkt', quiet=True)
from nltk.tokenize import sent_tokenize
def chunk_by_sentences(text, target_words=500, min_words=100):
"""
Chunk text by grouping sentences to target word count.
Args:
text: Document text to chunk
target_words: Target words per chunk (default 500)
min_words: Minimum words for valid chunk (default 100)
Returns:
List of text chunks
"""
sentences = sent_tokenize(text)
chunks = []
current_chunk = []
current_count = 0
for sentence in sentences:
sentence_words = len(sentence.split())
# Save current chunk if adding this sentence would exceed target
if current_count > 0 and current_count + sentence_words > target_words:
chunks.append(' '.join(current_chunk))
current_chunk = [sentence]
current_count = sentence_words
else:
current_chunk.append(sentence)
current_count += sentence_words
# Don't forget last chunk
if current_chunk and current_count >= min_words:
chunks.append(' '.join(current_chunk))
return chunks
# Load documents (works for Guardian / HF / NewsAPI outputs)
with open('guardian_documents_full.json', 'r', encoding='utf-8') as f:
docs = json.load(f)
all_chunks = []
chunk_metadata = []
for doc in docs:
text = doc['body_text']
chunks = chunk_by_sentences(text, target_words=500)
# Store each chunk with stable IDs and metadata linking back to source
for chunk_idx, chunk in enumerate(chunks):
chunk_id = f"{doc['id']}::chunk_{chunk_idx}" # stable ID used everywhere
all_chunks.append(chunk)
chunk_metadata.append({
'chunk_id': chunk_id,
'source_id': doc['id'],
'chunk_index': chunk_idx,
'title': doc.get('title', ''),
'category': doc.get('category', 'unknown'),
'publication_date': doc.get('publication_date', ''),
'author': doc.get('author', 'Unknown'),
'url': doc.get('url', ''),
'source': doc.get('source', '')
})
# These maps are your guardrails: you can always recover the right
# metadata/text by chunk_id, even if ordering changes later.
chunk_by_id = {m['chunk_id']: m for m in chunk_metadata}
text_by_id = {m['chunk_id']: all_chunks[i] for i, m in enumerate(chunk_metadata)}
print(f"Created {len(all_chunks)} chunks from {len(docs)} documents")
print(f"Average chunks per document: {len(all_chunks) / len(docs):.1f}")
print(f"Example chunk_id: {chunk_metadata[0]['chunk_id']}")
From here on, chunk_id is the only identifier you should trust. Treat list positions and DataFrame row order as accidental. If anything gets reloaded, merged, shuffled, or filtered, chunk_id is how you keep everything aligned.
Generating Embeddings
Now embed your chunks. This will take a while, depending on how many chunks you have. Expect 30-60 minutes for 5,000-10,000 chunks with proper rate limiting.
from cohere import ClientV2
from dotenv import load_dotenv
import os
import time
import numpy as np
import pandas as pd
load_dotenv()
co = ClientV2(api_key=os.getenv('COHERE_API_KEY'))
# Batch configuration for rate limit handling
batch_size = 15 # Conservative to avoid hitting limits
wait_time = 15 # Seconds between batches
# Your chunk size and overlap choices directly affect embedding costs
# More chunks = more API calls = higher cost and longer indexing time
# This is why evaluating chunking strategies matters
all_embeddings = []
num_batches = (len(all_chunks) + batch_size - 1) // batch_size
print(f"Generating embeddings for {len(all_chunks)} chunks...")
print(f"This will take approximately {(num_batches * wait_time) / 60:.0f} minutes")
for batch_idx in range(num_batches):
start_idx = batch_idx * batch_size
end_idx = min(start_idx + batch_size, len(all_chunks))
batch = all_chunks[start_idx:end_idx]
try:
response = co.embed(
texts=batch,
model='embed-v4.0',
input_type='search_document', # Important: use search_document for stored content
embedding_types=['float']
)
all_embeddings.extend(response.embeddings.float_)
if (batch_idx + 1) % 10 == 0:
print(f" Progress: {batch_idx + 1}/{num_batches} batches")
if batch_idx < num_batches - 1:
time.sleep(wait_time)
except Exception as e:
# In production, implement exp backoff and retry only on rate limits
# Save progress checkpoints to resume if something fails after 40 minutes
print(f" Hit rate limit or error: {e}")
print(f" Waiting 60 seconds before retry...")
time.sleep(60)
# Retry this batch (in production: check error type, use exp backoff)
response = co.embed(
texts=batch,
model='embed-v4.0',
input_type='search_document',
embedding_types=['float']
)
all_embeddings.extend(response.embeddings.float_)
if batch_idx < num_batches - 1:
time.sleep(wait_time)
# ---- Sanity checks (important for beginners) ----
# Embedding dims can vary by model/config. Don't assume they're always 1536.
if len(all_embeddings) != len(all_chunks):
raise ValueError("Embedding count does not match chunk count.")
embedding_dim = len(all_embeddings[0])
if any(len(e) != embedding_dim for e in all_embeddings):
raise ValueError("Inconsistent embedding dimensions detected.")
print(f"\nGenerated {len(all_embeddings)} embeddings")
print(f"Embedding dimension: {embedding_dim}")
# Save embeddings
embeddings_array = np.array(all_embeddings)
np.save('chunk_embeddings.npy', embeddings_array)
# Save metadata (includes chunk_id)
pd.DataFrame(chunk_metadata).to_csv('chunk_metadata.csv', index=False)
print("Saved to chunk_embeddings.npy and chunk_metadata.csv")
Quality Checks
| Quality Indicator | Red Flag | Green Light |
|---|---|---|
| Chunk count | Less than 2 chunks per document on average | 3–10 chunks per document is typical |
| Chunk sizes | Ranges from 10 words to 5,000 words, indicating the strategy broke | Chunks fall within a consistent range for the chosen strategy (for example, 300–700 words for sentence-based) |
| Source tracking | Unable to link chunks back to original documents | Clear metadata connects each chunk to its source document |
| Embedding count | Number of embeddings does not match the number of chunks | Embedding count matches chunks exactly |
| Sample quality | Chunks split mid-sentence or lose important context | Chunks contain complete, coherent thoughts |
Common Issues
| Symptom | Cause | Solution |
|---|---|---|
| "Rate limit exceeded" errors | Generating embeddings too fast | Increase wait_time to 20–30 seconds and reduce batch_size to 10 |
| Lost connection between chunks and docs | Forgot to add source_id to metadata |
Add the document ID to chunk_metadata before embedding |
| Chunks wildly different sizes | Edge cases not handled, such as very short documents | Add minimum and maximum constraints to the chunking logic |
| Embedding generation slow | Normal behavior for large datasets | This is expected. For example, 5,000 chunks with 15-second delays takes roughly 45 minutes. |
Decision Guide: Chunk Size Selection
If you're unsure about chunk size parameters:
| Choose Smaller Chunks (200–300 words) | Choose Larger Chunks (600–800 words) |
|---|---|
| Queries target specific facts | Queries are conceptual or broad |
| Precision matters more than context | Context matters more than precision |
| Storage cost isn’t a concern | You want to minimize storage and API costs |
Most projects work well with 400-600 word chunks. Start there unless you have specific reasons to go smaller or larger.
Step 3: Vector Database Setup
What you're building: A production vector database loaded with your chunks and ready to query.
Choosing Your Database
You need to pick one of three options. Here's the decision framework:
| Choose pgvector If | Choose Qdrant If | Choose Pinecone If |
|---|---|---|
| You need the lowest possible query latency | You rely on heavy filtering across multiple text fields | You want zero operational overhead |
| You already have PostgreSQL infrastructure in place | You can accept HTTP API overhead | You can accept network latency overhead |
| Your team has strong SQL and PostgreSQL skills | You need consistent filter performance | Your team should focus on product features, not operations |
| You primarily filter on integers or dates | You’re comfortable running Docker or containerized services | Your scale may grow unpredictably over time |
Still unsure? Start with pgvector if you have Postgres experience, Qdrant if you don't. Both are solid choices that demonstrate production skills.
Note: The database loading examples below assume all_chunks is still in memory
from Step 2. If you are running this in a fresh session, reload chunk text from
disk (for example, from the same source used to generate chunk_metadata.csv).
Building It: pgvector Example
Here's the complete setup for PostgreSQL with pgvector:
import psycopg2
from pgvector.psycopg2 import register_vector
import numpy as np
import pandas as pd
# This project uses whatever embedding dimension your saved file contains.
# Always set the DB dimension to match your actual embeddings.
embeddings = np.load('chunk_embeddings.npy')
EMBEDDING_DIM = embeddings.shape[1]
metadata = pd.read_csv('chunk_metadata.csv')
# In real systems, you often avoid storing full text in the vector table.
# For this practice project, storing content in Postgres is convenient and fine.
# (Cloud vector DB payloads often have stricter limits.)
# Connect to PostgreSQL
conn = psycopg2.connect(
host="localhost",
database="knowledge_base",
user="postgres",
password="your_password"
)
cur = conn.cursor()
# Enable pgvector extension FIRST
cur.execute("CREATE EXTENSION IF NOT EXISTS vector")
conn.commit()
# THEN register vector type with Python driver
register_vector(conn)
# Create table for chunks
cur.execute(f"""
CREATE TABLE IF NOT EXISTS chunks (
id SERIAL PRIMARY KEY,
chunk_id TEXT UNIQUE, -- stable identifier used across systems
source_id TEXT,
chunk_index INTEGER,
title TEXT,
category TEXT,
publication_date TEXT,
author TEXT,
content TEXT,
embedding vector({EMBEDDING_DIM})
)
""")
conn.commit()
# Insert in batches
batch_size = 500
# This example assumes you're running Step 2 and Step 3 in the same session,
# so all_chunks, chunk_metadata.csv, and chunk_embeddings.npy still line up
# naturally.
# If you reload data in a new session, don't rely on order. Rebuild all_chunks
# by chunk_id from the JSON and join to chunk_metadata.csv by chunk_id.
for i in range(0, len(all_chunks), batch_size):
batch_end = min(i + batch_size, len(all_chunks))
for j in range(i, batch_end):
cur.execute("""
INSERT INTO chunks
(chunk_id, source_id, chunk_index, title, category, publication_date, author, content, embedding)
VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)
ON CONFLICT (chunk_id) DO NOTHING
""", (
metadata.iloc[j]['chunk_id'],
metadata.iloc[j]['source_id'],
int(metadata.iloc[j]['chunk_index']),
metadata.iloc[j]['title'],
metadata.iloc[j]['category'],
metadata.iloc[j]['publication_date'],
metadata.iloc[j]['author'],
all_chunks[j],
embeddings[j]
))
conn.commit()
print(f"Inserted {batch_end}/{len(all_chunks)} chunks")
# Create HNSW index for fast similarity search
print("Creating HNSW index...")
cur.execute("""
CREATE INDEX IF NOT EXISTS chunks_embedding_idx
ON chunks
USING hnsw (embedding vector_cosine_ops)
""")
conn.commit()
print(f"Setup complete! Loaded {len(all_chunks)} chunks into pgvector")
cur.close()
conn.close()
Building It: Qdrant Example
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import numpy as np
import pandas as pd
embeddings = np.load('chunk_embeddings.npy')
EMBEDDING_DIM = embeddings.shape[1]
metadata = pd.read_csv('chunk_metadata.csv')
# Connect to Qdrant (assumes Docker running locally)
client = QdrantClient(host="localhost", port=6333)
collection_name = "knowledge_base"
# Recreate collection for a clean run (optional).
# If you prefer not to delete, check if it exists first.
client.recreate_collection(
collection_name=collection_name,
vectors_config=VectorParams(
size=EMBEDDING_DIM,
distance=Distance.COSINE
)
)
# Payload rule (simple): store IDs + filterable metadata + a short preview.
# Full chunk text stays in your local files (CSV/JSON)
# Fetched by chunk_id when needed.
points = []
for idx in range(len(all_chunks)):
chunk_id = metadata.iloc[idx]['chunk_id']
point = PointStruct(
id=chunk_id,
vector=embeddings[idx].tolist(),
payload={
'chunk_id': chunk_id,
'source_id': metadata.iloc[idx]['source_id'],
'chunk_index': int(metadata.iloc[idx]['chunk_index']),
'title': metadata.iloc[idx]['title'],
'category': metadata.iloc[idx]['category'],
'publication_date': metadata.iloc[idx]['publication_date'],
'author': metadata.iloc[idx]['author'],
'content_preview': all_chunks[idx][:200]
}
)
points.append(point)
# Upload in batches
batch_size = 100
for i in range(0, len(points), batch_size):
batch = points[i:i+batch_size]
client.upsert(collection_name=collection_name, points=batch)
print(f"Uploaded {min(i+batch_size, len(points))}/{len(points)} points")
print(f"Setup complete! Loaded {len(points)} chunks into Qdrant")
Building It: Pinecone Example
from pinecone import Pinecone, ServerlessSpec
import numpy as np
import pandas as pd
from dotenv import load_dotenv
import os
import time
load_dotenv()
pc = Pinecone(api_key=os.getenv('PINECONE_API_KEY'))
embeddings = np.load('chunk_embeddings.npy')
EMBEDDING_DIM = embeddings.shape[1]
metadata = pd.read_csv('chunk_metadata.csv')
index_name = "knowledge-base"
# Create serverless index (del manually in Pinecone console if it already exists)
pc.create_index(
name=index_name,
dimension=EMBEDDING_DIM,
metric="cosine",
spec=ServerlessSpec(
cloud="aws",
region="us-east-1"
)
)
# Wait for index to be ready
while not pc.describe_index(index_name).status['ready']:
print("Waiting for index to be ready...")
time.sleep(1)
index = pc.Index(index_name)
# Keep metadata small: store IDs + filterable fields + a short preview.
# Full chunk text stays in local files and is fetched by chunk_id when needed.
vectors = []
for idx in range(len(all_chunks)):
chunk_id = metadata.iloc[idx]['chunk_id']
vectors.append({
'id': chunk_id,
'values': embeddings[idx].tolist(),
'metadata': {
'chunk_id': chunk_id,
'source_id': metadata.iloc[idx]['source_id'],
'chunk_index': int(metadata.iloc[idx]['chunk_index']),
'title': metadata.iloc[idx]['title'],
'category': metadata.iloc[idx]['category'],
'publication_date': metadata.iloc[idx]['publication_date'],
'author': metadata.iloc[idx]['author'],
'content_preview': all_chunks[idx][:200]
}
})
# Upload in batches
batch_size = 100
for i in range(0, len(vectors), batch_size):
batch = vectors[i:i+batch_size]
index.upsert(vectors=batch)
print(f"Uploaded {min(i+batch_size, len(vectors))}/{len(vectors)} vectors")
print("Waiting for vectors to be indexed...")
time.sleep(10)
print(f"Setup complete! Loaded {len(vectors)} vectors into Pinecone")
Quality Checks
| Quality Indicator | Red Flag | Green Light |
|---|---|---|
| Test query results | Returns random, unrelated chunks | Returns topically relevant chunks |
| Metadata filtering | Unable to filter results by category | Filters work correctly as expected |
| Query latency | Queries feel unreasonably slow for local execution | Queries feel appropriately fast for local execution |
| Result consistency | Different results returned for identical queries | Consistent results when repeating the same query |
| Data loading | Error messages appear during upload | All chunks load successfully without errors |
Common Issues
| Symptom | Cause | Solution |
|---|---|---|
| "Dimension mismatch" error | Embedding size does not match database configuration | Verify that embeddings.shape[1] matches the database or index dimension, then recreate the collection or index if needed. |
| Very slow insertion | Batch size too small or network-related issues | Increase batch size to 500–1,000 and check network performance if using a cloud service. |
| Metadata filters return nothing | Field names don’t match or values are missing | Verify metadata field names exactly and check for null or missing values. |
| pgvector: "extension not found" | Tried to register vectors before creating the extension | Create the extension first, then call register_vector(). |
| Qdrant: "method not found" | Using outdated client methods | Check Qdrant Query API docs for current method names (APIs evolve over time). |
| Pinecone: vectors not queryable | Indexing not complete due to eventual consistency | Wait 30–60 seconds after upload before running queries. |
Step 4: Hybrid Search Implementation
What you're building: A search system that combines semantic similarity with keyword matching and returns ranked results.
Understanding the Components
Hybrid search needs three pieces working together:
- Vector similarity search - Returns semantically similar chunks
- BM25 keyword search - Returns chunks containing query terms
- Score combination - Merges both rankings into final results
Building It: BM25 Component
from rank_bm25 import BM25Okapi
import string
def simple_tokenize(text):
"""Basic tokenization for BM25 (good enough for this project)."""
text = text.lower()
text = text.translate(str.maketrans('', '', string.punctuation))
return text.split()
# Build BM25 index from all chunks
tokenized_chunks = [simple_tokenize(chunk) for chunk in all_chunks]
bm25 = BM25Okapi(tokenized_chunks)
# Map BM25 array indices -> chunk_id (this is how we "join" BM25 to vector DB results)
idx_to_chunk_id = [m['chunk_id'] for m in chunk_metadata]
# Test BM25 search
query = "climate change impacts on agriculture"
tokenized_query = simple_tokenize(query)
bm25_scores = bm25.get_scores(tokenized_query)
top_bm25_indices = bm25_scores.argsort()[::-1][:10]
print("Top 10 BM25 results:")
for idx in top_bm25_indices:
cid = idx_to_chunk_id[int(idx)]
print(f" Score: {bm25_scores[idx]:.2f} - {chunk_by_id[cid]['title']}")
Building It: Vector Search Component
# Example using pgvector (adapt for your chosen database)
# If you chose Qdrant/Pinecone, replace vector_search_pgvector with your
# DB’s query call that returns chunk_id.
import numpy as np
def vector_search_pgvector(cur, query_text, limit=10, category_filter=None):
"""
Search using vector similarity with optional metadata filter.
IMPORTANT:
- This function returns chunk_id (a stable string ID), not a SERIAL integer.
- We use chunk_id to join vector results with BM25 + local metadata.
"""
# Embed query
response = co.embed(
texts=[query_text],
model='embed-v4.0',
input_type='search_query', # Important: different from search_document
embedding_types=['float']
)
query_embedding = np.array(response.embeddings.float_[0])
if category_filter:
sql = """
SELECT chunk_id, title, category, content, embedding <=> %s AS distance
FROM chunks
WHERE category = %s
ORDER BY embedding <=> %s
LIMIT %s
"""
cur.execute(
sql,
(
query_embedding.tolist(),
category_filter,
query_embedding.tolist(),
limit
)
)
else:
sql = """
SELECT chunk_id, title, category, content, embedding <=> %s AS distance
FROM chunks
ORDER BY embedding <=> %s
LIMIT %s
"""
cur.execute(
sql,
(
query_embedding.tolist(),
query_embedding.tolist(),
limit
)
)
results = cur.fetchall()
formatted = []
for doc in results:
formatted.append({
'chunk_id': doc[0],
'title': doc[1],
'category': doc[2],
'content': doc[3],
'distance': float(doc[4])
})
return formatted
Building It: Hybrid Combination
def hybrid_search(cur, query_text, alpha=0.5, limit=10, category_filter=None):
"""
Combine BM25 and vector search with weighted scoring.
ID handling:
- This project uses a stable chunk_id stored in metadata and in the vector database.
- Do NOT assume DB primary keys match Python list indices.
- We "join" BM25 and vector results using chunk_id.
Hybrid scoring note:
- BM25 scores and vector distances are on different scales.
- We convert both into simple 0–1-ish scores for blending.
- Treat the blended score as a ranking heuristic, not a meaningful absolute value.
Candidate set note (important even for this project):
- Don't compute hybrid scores over the entire corpus.
- Retrieve top-K candidates from each retriever, then fuse only those.
"""
# ---- BM25 component (scores by chunk_id) ----
tokenized_query = simple_tokenize(query_text)
bm25_scores = bm25.get_scores(tokenized_query)
# Normalize BM25 to 0-1 (heuristic, OK for this project)
max_bm25 = float(max(bm25_scores)) if max(bm25_scores) > 0 else 1.0
min_bm25 = float(min(bm25_scores))
bm25_norm_by_id = {}
for idx, score in enumerate(bm25_scores):
if max_bm25 > min_bm25:
norm = (float(score) - min_bm25) / (max_bm25 - min_bm25)
else:
norm = 0.0
bm25_norm_by_id[idx_to_chunk_id[idx]] = norm
# ---- Vector component (scores by chunk_id) ----
vector_results = vector_search_pgvector(
cur,
query_text,
limit=100, # retrieve more candidates for fusion
category_filter=category_filter
)
vector_score_by_id = {}
for r in vector_results:
chunk_id = r['chunk_id']
distance = float(r['distance'])
# Distances are ranking values, not universally comparable "similarity scores".
# We convert distance -> bounded score only so BM25 and vector can be blended.
# Do NOT interpret this as a probability or compare across metrics/databases.
vector_score_by_id[chunk_id] = 1 / (1 + distance)
# ---- Candidate set fusion (top-K union) ----
bm25_top_k = 100
top_bm25_indices = bm25_scores.argsort()[::-1][:bm25_top_k]
bm25_candidate_ids = {idx_to_chunk_id[int(i)] for i in top_bm25_indices}
vector_candidate_ids = set(vector_score_by_id.keys())
candidate_ids = bm25_candidate_ids | vector_candidate_ids
# ---- Weighted fusion by chunk_id ----
hybrid_by_id = {}
for chunk_id in candidate_ids:
bm25_part = bm25_norm_by_id.get(chunk_id, 0.0)
vec_part = vector_score_by_id.get(chunk_id, 0.0)
hybrid_by_id[chunk_id] = alpha * bm25_part + (1 - alpha) * vec_part
# ---- Top results ----
top_items = sorted(hybrid_by_id.items(), key=lambda x: x[1], reverse=True)[:limit]
final_results = []
for chunk_id, score in top_items:
meta = chunk_by_id[chunk_id]
text = text_by_id[chunk_id]
final_results.append({
'chunk_id': chunk_id,
'title': meta['title'],
'category': meta.get('category','unknown'),
'content': text[:200] + "...",
'hybrid_score': score,
'bm25_component': bm25_norm_by_id.get(chunk_id, 0.0),
'vector_component': vector_score_by_id.get(chunk_id, 0.0)
})
return final_results
# Test hybrid search (pgvector example)
# --- Reconnect for querying (pgvector) ---
import psycopg2
from pgvector.psycopg2 import register_vector
conn = psycopg2.connect(
host="localhost",
database="knowledge_base",
user="postgres",
password="your_password"
)
register_vector(conn)
cur = conn.cursor()
results = hybrid_search(cur, "climate change impacts on agriculture", alpha=0.5, limit=5)
print("\nTop 5 hybrid search results:")
for r in results:
print(f"\nTitle: {r['title']}")
print(f"Category: {r['category']}")
print(f"Hybrid score: {r['hybrid_score']:.3f} (BM25: {r['bm25_component']:.3f}, Vector: {r['vector_component']:.3f})")
print(f"Content: {r['content']}")
Quality Checks
| Quality Indicator | Red Flag | Green Light |
|---|---|---|
| Component independence | BM25 and vector components return identical results | Each component produces different rankings |
| Score combination | Changing alpha has no effect on result ordering | Different alpha values change the ranking order |
| Filter integration | Filters break hybrid search behavior | Metadata filters work correctly with hybrid scoring |
| Semantic queries | Pure keyword search wins on conceptual queries | Vector component contributes meaningfully on semantic queries |
| Keyword queries | Pure vector search wins on exact term matches | BM25 component improves results on keyword-based queries |
Common Issues
| Symptom | Cause | Solution |
|---|---|---|
| Hybrid identical to pure vector | BM25 component not working | Verify the BM25 index is built correctly and check tokenization. |
| Changing alpha has no effect | Scores are not normalized or one component returns zero values | Check score normalization and verify both components return non-zero scores. |
| All results from one component | Alpha is too extreme (0.0 or 1.0) or score scales differ wildly | Start with alpha = 0.5 and ensure scores are properly normalized. |
| Slow queries (2+ seconds) | Searches are running sequentially | Consider parallelizing vector and BM25 searches. |
| Filters return empty results | Metadata field names don’t match or values are missing | Verify exact field names and check for null or missing metadata. |
Decision Guide: Finding Optimal Alpha
Test different alpha values systematically:
| Alpha Value | Means | Best For |
|---|---|---|
| 0.0 | Pure vector search with no keyword component | Conceptual queries and semantic similarity |
| 0.3 | 30% keyword scoring and 70% vector scoring | Documents with rich vocabulary and a semantic focus |
| 0.5 | Balanced keyword and vector contribution | A safe default for mixed query types |
| 0.7 | 70% keyword scoring and 30% vector scoring | Cases where rare terms matter and exact matches are important |
| 1.0 | Pure BM25 keyword search with no vector component | Situations that require only keyword matching |
Run your test queries at different alpha values and measure which performs best for your domain.
Step 5: Evaluation and Iteration
What you're building: Evidence that your system works, with measurements proving hybrid search adds value.
Creating Test Queries
You need 10-15 queries where you know what the right answers should be. Here's how to create them:
# Read your data and identify diverse queries
# Example: Guardian news articles
test_queries = [
{
'query': 'climate change environmental impacts',
'expected_category': 'science',
'description': 'Semantic query about climate science'
},
{
'query': 'artificial intelligence machine learning development',
'expected_category': 'technology',
'description': 'Technical topic with clear keywords'
},
{
'query': 'election results voting democracy',
'expected_category': 'politics',
'description': 'Political coverage'
},
{
'query': 'olympic games sports competition',
'expected_category': 'sport',
'description': 'Sports coverage with specific terms'
},
{
'query': 'film review cinema entertainment',
'expected_category': 'culture',
'description': 'Arts and culture content'
},
# Add 5-10 more covering different query types
]
Mix query types: semantic queries, keyword-heavy queries, filtered queries, and edge cases.
Measuring Performance
def evaluate_search_strategy(search_func, test_queries, strategy_name, k=5):
"""
Evaluate a search strategy on test queries.
NOTE: This project uses a coarse proxy metric (category match), not true relevance judgments.
We use it here because it’s easy to create “expected” labels without manual annotation.
Args:
search_func: Function that takes query and returns results
test_queries: List of test query dicts
strategy_name: Name for this strategy (e.g., "Pure Vector", "Hybrid α=0.5")
k: Evaluate whether ANY of the top-k results match the expected category
Returns:
Dictionary with proxy success metrics
"""
results = {
'strategy': strategy_name,
'success_count': 0,
'total': len(test_queries),
'details': []
}
for test in test_queries:
query = test['query']
expected_category = test['expected_category'].lower()
search_results = search_func(query)
top_k = search_results[:k]
success = any(r['category'].lower() == expected_category for r in top_k)
if success:
results['success_count'] += 1
top_title = top_k[0]['title'] if top_k else "(no results)"
top_category = top_k[0]['category'] if top_k else "(no results)"
results['details'].append({
'query': query,
'expected_category': test['expected_category'],
'top_category': top_category,
'top_title': top_title,
'success_category_at_k': success
})
results['proxy_success_rate'] = results['success_count'] / results['total']
return results
# Pure vector: just set alpha=0.0
def pure_vector(q):
return hybrid_search(cur, q, alpha=0.0, limit=5)
# Pure BM25: alpha=1.0
def pure_bm25(q):
return hybrid_search(cur, q, alpha=1.0, limit=5)
vector_results = evaluate_search_strategy(pure_vector, test_queries, "Pure Vector", k=5)
print(f"\nPure Vector proxy success (category@5): {vector_results['proxy_success_rate']*100:.0f}%")
bm25_results = evaluate_search_strategy(pure_bm25, test_queries, "Pure BM25", k=5)
print(f"Pure BM25 proxy success (category@5): {bm25_results['proxy_success_rate']*100:.0f}%")
for a in [0.3, 0.5, 0.7]:
def hybrid_alpha(q, alpha=a):
return hybrid_search(cur, q, alpha=alpha, limit=5)
h = evaluate_search_strategy(hybrid_alpha, test_queries, f"Hybrid α={a}", k=5)
print(f"Hybrid α={a} proxy success (category@5): {h['proxy_success_rate']*100:.0f}%")
Measuring Query Latency
import time
def measure_latency(search_func, query, iterations=10):
"""Measure average query latency."""
times = []
# Warmup
for _ in range(3):
_ = search_func(query)
# Actual measurement
for _ in range(iterations):
start = time.time()
_ = search_func(query)
elapsed = (time.time() - start) * 1000 # Convert to ms
times.append(elapsed)
avg_time = sum(times) / len(times)
return avg_time
# Test latency
query = "climate change impacts on agriculture"
latency = measure_latency(lambda q: hybrid_search(cur, q, alpha=0.5), query)
print(f"\nAverage query latency: {latency:.0f}ms")
Quality Checks
| Quality Indicator | Red Flag | Green Light |
|---|---|---|
| Success rate | Below 50% on test queries | Above 70% on well-crafted queries |
| Hybrid improvement | Hybrid performs worse than pure vector | Hybrid improves results on 40% or more of queries |
| Query latency | Consistently over 2 seconds | Under 1 second for typical queries |
| Consistency | Same query produces different results | Repeating the same query gives consistent results |
What to Do When Results Are Poor
If your evaluation shows problems, here's how to diagnose and fix them:
| Poor Results On | Likely Cause | What to Try |
|---|---|---|
| All query types | Chunking broke the document context | Increase chunk size or switch to sentence-based chunking |
| Semantic queries | Embeddings are not working correctly | Verify that queries use the search_query input type and chunks use search_document |
| Keyword queries | BM25 component is not contributing | Confirm the BM25 index is built correctly and verify that alpha is not set too low |
| Filtered queries | Metadata-related issues | Verify metadata field names exactly and check for missing or null values |
When Hybrid Doesn't Help
Sometimes you'll discover pure vector search works just as well or even better than hybrid search.
This is a valid finding. Document it honestly:
# Evaluation Results
After testing pure vector, pure BM25, and hybrid search with α values from 0.3 to 0.7:
- Pure vector: 87% success rate
- Hybrid (α=0.5): 84% success rate
- Pure BM25: 71% success rate
**Conclusion**: Pure vector search performed best for this dataset.
**Why**: Guardian news articles have rich vocabulary and well-written content.
The semantic embeddings captured everything needed. Adding BM25 keyword
matching introduced noise without improving results.
**Decision**: Using pure vector search (hybrid with α=0.0) for final system.
You don’t always “win” by adding complexity. If testing shows a simpler approach works better, that’s a solid outcome, and worth documenting.
Step 6: Documentation and Presentation
What you're building: Professional documentation that makes this project portfolio-ready.
Writing Your README
Your README is the first thing anyone sees. Make it count.
# News Search System with Semantic Understanding
A hybrid search system for 2,000+ news articles that combines vector similarity
with keyword matching to help users find relevant content 10x faster than
traditional keyword search.
## What This Does
This system searches through 2,000 Guardian news articles across politics,
technology, science, sport, and culture. It understands queries semantically
(finding articles about "climate change impacts" even when they use terms like
"environmental consequences") while also supporting traditional keyword search
and metadata filtering.
## Technical Stack
- **Vector Database**: Qdrant (chosen for consistent 1.1x filtering overhead
across complex metadata queries)
- **Embeddings**: Cohere embed-v4.0 (dimension read from chunk_embeddings.npy)
- **Chunking**: Sentence-based with 500-word target (preserves semantic coherence,
reduces storage 44% vs fixed-token)
- **Hybrid Search**: BM25 + vector similarity with α=0.4 weighting
## Why These Choices
### Database: Qdrant
I chose Qdrant because my test queries heavily filter on multiple metadata
fields (category, date, author). Qdrant maintains consistent 1.1x overhead
regardless of filter complexity, compared to pgvector's variable 2.3x overhead
on text filters.
**Tradeoff**: Higher baseline latency (~50ms) due to HTTP API vs pgvector's in-process execution, but this was acceptable for my sub-200ms latency requirement.
### Chunking: Sentence-Based
News articles have clear sentence boundaries. Sentence-based chunking with
500-word targets created 4,500 chunks vs 8,200 with fixed-token approach,
reducing API costs by 45% while maintaining semantic coherence.
### Hybrid Search: α=0.4
Testing showed pure vector (87% success rate) slightly outperformed hybrid
α=0.5 (84%), but hybrid performed better on queries with rare technical terms.
Alpha=0.4 (60% vector, 40% keyword) balanced both cases, achieving 89% success
rate overall.
## Performance Metrics
- Average query latency: 127ms (measured; reported for transparency)
- Proxy quality: category match @10 = 94% on test queries
- Hybrid search improved **proxy** success on 67% of test queries vs pure vector
- Database choice: Qdrant handles complex filters without performance degradation
## Setup Instructions (Example)
This is a minimal example. Update file names, API keys, and database settings for your project.
1. Create and activate a virtual environment (recommended).
2. Install dependencies:
- `pip install -r requirements.txt`
3. Add API keys in a `.env` file (do not commit this file):
- `COHERE_API_KEY=...`
- (If using Pinecone) `PINECONE_API_KEY=...`
4. Collect data (example):
- `python collect_guardian_data.py`
5. Chunk + embed:
- Run the chunking script to create `chunk_metadata.csv`
- Run embedding generation to create `chunk_embeddings.npy`
6. Load into your vector database:
- pgvector: create the DB/table + insert docs + create HNSW index
- Qdrant/Pinecone: create collection/index + upsert vectors
7. Run search + evaluation:
- Build BM25 index
- Run hybrid search experiments and report results + latency
## Example Queries
**Query**: "climate change environmental impacts"
- Top result: "Climate crisis: How rising temperatures affect biodiversity"
- Category: Science (expected ✓)
- Hybrid score: 0.847
**Query**: "artificial intelligence ethics regulations"
- Top result: "EU AI Act: New framework for algorithmic accountability"
- Category: Technology (expected ✓)
- Hybrid score: 0.792
[Include 3-5 more examples...]
## What I Learned
The biggest surprise was discovering pure vector search nearly matched hybrid
performance. Guardian's professional journalism has rich vocabulary, making
semantic search highly effective. I kept hybrid search because it improved
results on technical queries, but simpler domains might not need it.
Handling rate limits during embedding generation was challenging. Implementing
exponential backoff reduced failures from 40% to under 5%.
Creating Your Architecture Diagram
Use one of these free tools:
- draw.io - Full-featured diagramming tool
- Excalidraw - Simple, hand-drawn style diagrams
- tldraw - Minimalist drawing tool
Your diagram should show:
- Data input (API → 2,000 articles)
- Processing pipeline (chunking → 4,500 chunks → Cohere embeddings)
- Storage (Qdrant with metadata)
- Query flow (user query → embedding → vector + BM25 → hybrid ranking)
- Results (top-k ranked chunks)
Focus on clarity over beauty. A simple box-and-arrow diagram works great.
Choosing Your Presentation Format
Pick ONE format that plays to your strengths:
| Format | Best When | Example |
|---|---|---|
| Demo Video (3–5 min) | You built a web interface that’s worth showing visually | Screen-record yourself querying the system and showing results update in real time |
| Blog Post | You made interesting technical or architectural decisions | Write up your database selection process and evaluation findings |
| Jupyter Notebook | You want to demonstrate methodology rigorously | Walk through evaluation with charts, showing code and results together |
| Web Interface | You want something others can actually use | Deploy a Streamlit app where people can try queries themselves |
For blog posts, see Building and Presenting Your Data Portfolio for excellent guidance on structure and presentation.
For web interfaces, Build a Web Interface for Your Chatbot with Streamlit walks through creating interactive demos.
Quality Checks
| Quality Indicator | Red Flag | Green Light |
|---|---|---|
| README clarity | Full of jargon with no clear explanation of what the project does | Someone unfamiliar with the project understands what you built |
| Technical justification | Technologies are listed without explaining why they were chosen | Decisions are explained with clear, specific reasoning |
| Setup instructions | Steps are missing or assume prior knowledge | Someone else can follow the instructions and run the project |
| Results demonstration | No examples, screenshots, or evaluation metrics provided | Concrete examples with real results are clearly shown |
| Professional polish | Code includes debug prints or commented-out sections | Clean code, organized structure, and working examples |
Common Issues
| Symptom | Cause | Solution |
|---|---|---|
| README too technical | Writing for yourself instead of the audience | Test it on a non-technical friend and simplify jargon where needed |
| No one can run your code | Hardcoded paths or missing dependencies | Test setup on a fresh machine and add a requirements.txt |
| Unclear what makes this special | Describes what was built, but not why | Add a Technical Decisions section explaining your reasoning |
| Too much detail | Including every implementation detail | Focus on the most interesting decisions and trade-offs, not every line of code |
Final Checklist
Before you consider this project complete:
Technical Implementation:
- Collected 1,000+ documents from new domain
- Implemented and justified chunking strategy
- Generated embeddings for all chunks
- Set up and loaded vector database
- Built hybrid search with metadata filtering
- Created 10+ test queries with known expected results
- Measured query latency and reported it (no fixed target at this scale)
- Compared hybrid vs pure approaches with actual data
Evaluation and Analysis:
- Success rate above 70% on test queries
- Hybrid search demonstrably improves results (or documented why it doesn't)
- Query latency measured and reported transparently
- Tested different alpha values and chose optimal
- Identified what works well and what could improve
Documentation:
- Comprehensive README explaining project
- Architecture diagram showing system flow
- Technical decisions documented with reasoning
- Setup instructions someone can follow
- Example queries with actual results
- Performance metrics and evaluation results
- One presentation format (video/blog/notebook/web interface)
Code Quality:
- Code organized and commented
- No hardcoded paths or API keys in code
- Requirements.txt or environment.yml included
- Can run on fresh environment following your instructions
Portfolio Readiness:
- Project public on GitHub
- README renders correctly on GitHub
- Someone unfamiliar could understand what you built
- Technical decisions justified, not just stated
- You can confidently explain this project in an interview
When you can check all these boxes, you have a portfolio project that demonstrates real vector database skills.
Next Steps
Congratulations on completing this project! You’ve built a working knowledge base search system using real data, from collection and chunking through embedding, indexing, hybrid retrieval, and evaluation. That’s a substantial piece of work, and it reflects the kinds of problems you’d encounter when building search systems outside of tutorials.
If you’d like to extend the project further, here are a few suggestions that mirror how these systems are often improved in practice:
-
Add Query Expansion
Use an LLM to reformulate user queries before searching to reduce vocabulary mismatches.
# Original query: "reducing storage requirements" # Expanded: "reducing storage requirements compression optimization # resource savings data efficiency"This can help surface relevant documents that use different terminology than the original query.
-
Build Continuous Document Ingestion
Create a pipeline that automatically chunks, embeds, and indexes new documents as they arrive, instead of loading data once.
-
Implement Reranking
Retrieve a larger candidate set with fast vector search, then rerank the top results with a more accurate but slower model.
-
Create a Search Quality Dashboard
Track queries, results, and user interactions over time to measure quality and test improvements.
-
Add Multi-Language Support
Detect document language and use appropriate embedding models so the system works beyond English-only content.
-
Implement Document Versioning
Handle document updates without breaking search results or losing historical context.
As a portfolio project, this gives you a concrete system to talk through with interviewers, reviewers, or peers. You can explain how the system is structured, why you made certain design choices, how you evaluated the results, and what you’d improve with different constraints.
Before moving on, take some time to clean up the code and write clear documentation. That final pass is what turns this from a completed project into something you can confidently share and discuss.