Badge photo

New Year Career Upgrade: Turn Resolutions into Results with 57% Off Lifetime.

Sale Ends in
00
days
00
hours
00
mins
00
secs
January 18, 2026

Vector Database Practice Project: Building a Knowledge Base Search System

You've learned how to use vector databases, chunk documents intelligently, filter with metadata, and combine semantic search with keyword matching. Now it's time to put everything together and build something with all those skills.

This project asks you to create a complete knowledge base search system from scratch. You'll collect your own data, make your own chunking decisions, choose your own database, and build a hybrid search that actually works. When you're done, you'll have a portfolio project that shows employers you can build production-quality vector search systems, not just follow tutorials.

How to Make This Project Portfolio-Ready

When technical recruiters or hiring managers review your version of this project, they're going to be asking themselves specific questions. Understanding what they care about helps you focus your efforts where they matter.

  • Can this person make technical decisions?
    They want to see that you choose technologies for good reasons. "I used Qdrant because it was in the tutorial," doesn't demonstrate decision-making. "I chose Qdrant because my queries heavily filter on multiple metadata fields, and I needed consistent performance across filter combinations," shows you understand tradeoffs and can match tools to requirements.
  • Can this person evaluate their work?
    Did you measure results systematically? Do you know if your solution actually works? Can you identify what could be improved? Showing evaluation methodology matters more than achieving perfect results. Everyone's first attempt has issues. What separates strong engineers is knowing how to measure and improve.
  • Can this person handle real-world constraints?
    Did you deal with API rate limits, data quality issues, or performance problems? Do you show resilience when things don't work the first time? Can you debug and iterate? Your README should tell this story. "I initially tried fixed-token chunking but discovered it split important context across chunks. After evaluation showed 60% recall, I switched to sentence-based chunking and improved to 87%."
  • Can this person communicate technical concepts?
    Is your documentation clear and useful? Can you explain complex decisions simply? Do you write for your audience? Your code matters, but the story you tell about the code matters more.

Your project should demonstrate all of this if you document it well. Focus on showing your thinking, not just your implementation.

What to Do If You Get Stuck

This project builds on everything you learned in the tutorial series. If you hit a wall on a particular concept or technique, here's where to review:

  1. Vector database basics and ChromaDB:
    If you're struggling with collections, similarity search, or understanding HNSW indexing, review Introduction to Vector Databases using ChromaDB.
  2. Document chunking strategies:
    If you're unsure how to chunk your documents or evaluate chunking quality, review Document Chunking Strategies for Vector Databases.
  3. Metadata design and hybrid search:
    If you're having trouble with metadata schemas, filtering, or combining BM25 with vector search, review Metadata Filtering and Hybrid Search for Vector Databases.
  4. Production database setup and selection:
    If you're stuck on pgvector, Qdrant, or Pinecone configuration, or you need help choosing between them, review Production Vector Databases.
  5. Semantic caching patterns:
    If you want to add caching to your system (as an optional extension), review Semantic Caching and Memory Patterns for Vector Databases.

The tutorials give you the foundations. This project asks you to apply them to your own data and domain. When something isn't working, go back to the relevant tutorial, understand the pattern, then adapt it to your situation.

Project Requirements

Build a knowledge base search system that demonstrates mastery of vector databases, chunking strategies, and hybrid search. Your system should include the six components below.

Required Components Overview

  1. Data Collection: 1,000-3,000 documents from a new domain (not arXiv)
  2. Chunking Strategy: Implement and justify your approach
  3. Vector Database: Choose and set up one production database
  4. Hybrid Search: Combine semantic similarity with keyword matching
  5. Evaluation: Measure quality and performance with real queries
  6. Documentation: Make it portfolio-ready

Success Criteria

Your completed project should:

  • Handle 1,000+ documents reliably, and report measured query latency (the focus is correctness + evaluation, not chasing speed at small scale)
  • Demonstrate that hybrid search improves results (or document why it doesn't)
  • Include rich metadata that enables meaningful filtering
  • Show a clear evaluation methodology with test queries
  • Provide comprehensive documentation that someone else could follow
  • Present technical decisions with justified reasoning

Let's build this step by step.


Step 1: Data Collection

What you're building: A dataset of 1,000-3,000 documents from a domain you choose, with rich metadata for filtering.

Getting Your Data

We've created three data collection scripts that handle all the API complexity. Each script manages authentication, rate limiting, retries, and saves output ready for your project.

Use one of the data collection folders from our tutorial repository:

Each option below lives in its own folder and includes:

  • a data collection script
  • a requirements.txt file with the dependencies needed to run it

You only need to use one of these to complete the project.

  • Hugging Face (recommended)

    Choose from curated datasets such as IMDb reviews, BBC news, or PubMed abstracts. No API keys and no rate limits. Produces *_documents.csv and *_documents_full.json (including body_text). Best default choice for predictable chunking and evaluation.

  • The Guardian

    Collects ~2,000 Guardian articles across politics, technology, science, sport, and culture, including full text. Produces guardian_documents.csv and guardian_documents_full.json. Strong metadata for filtering (category, publication date, author, tags).

  • NewsAPI

    Collects ~1,000 recent articles discovered via NewsAPI. The script attempts to extract full text from each article URL by default, but some sites may block scraping or return partial content. Produces newsapi_documents.csv and newsapi_documents_full.json, with additional scrape and debug fields in the CSV.

See the data-collection-scripts/ folder in our tutorial repository for setup instructions and usage details for each option.

Quick note on predictability + runtime: Hugging Face and Guardian provide reliable body_text directly, which makes chunking and evaluation more predictable. NewsAPI’s own content field is often truncated, so this script attempts full-text extraction from each article URL by default. That scraping step can take noticeably longer and will fail on some sites (paywalls/anti-bot/JS-heavy pages). When extraction fails, the script falls back to the article’s description + truncated content, so expect some shorter/noisier documents. That extra step is why Hugging Face is the recommended default.

Using Your Own Data

If you have access to your own dataset (work documentation if permitted, personal knowledge base, or exported notes), you can use it instead. Make sure you have permission, at least 1,000 documents, meaningful metadata you can extract, and text-based content.

Important: Only use data you're legally allowed to store, embed, and publish. Don't commit proprietary or copyrighted raw text to GitHub. If your data contains personally identifiable information (names, emails, addresses), strip it before processing. When in doubt, stick with the provided collection scripts, which use publicly available data sources with appropriate attribution.

Building It

Here's what running a collection script looks like:

# Download and set up your API key in .env file
# Example: Guardian collection

# 1. The script handles everything
# Just run: python collect_guardian_data.py

# 2. You'll get two files:
# guardian_documents.csv - metadata for your vector database
# guardian_documents_full.json - complete text (stored as body_text) for chunking

# 3. Load and explore your data
import pandas as pd

df = pd.read_csv('guardian_documents.csv')
print(f"Collected {len(df)} documents")
print("\nDocuments by category:")
print(df['category'].value_counts())
print(f"\nAverage word count: {df['word_count'].mean():.0f} words")

Take the time to actually read 5-10 sample documents. You need to understand what you're working with before you can chunk it intelligently or decide what metadata matters.

Quality Checks

Before moving to the next step, verify your data looks good:

Quality Indicator Red Flag Green Light
Document count Fewer than 500 documents 1,000+ documents collected
Document length All identical lengths (suggests truncation) OR suspiciously short documents for sources that should provide full text Reasonable distribution (100–5,000 words typical)
Metadata coverage Missing metadata for 20%+ of documents Consistent metadata across all documents
Metadata fields Only 1–2 fields available 3+ meaningful fields for filtering
Data quality Lots of corrupted text, parsing errors Clean, readable documents

Common Issues

Symptom Cause Solution
"Invalid API key" error Missing or incorrect key in .env file Verify you copied the key correctly and check for extra spaces
Collection stops partway Hit API rate limit Scripts handle this automatically with retries. Wait briefly and it will continue.
Articles all from same date API parameters too restrictive Check script comments for how to adjust date ranges
Missing metadata fields API or dataset schema change, or source doesn’t provide the field Double-check the API or dataset schema; some sources genuinely don’t expose author or tags

Decision Guide: Choosing Your Data Source

Still deciding which script to use? Here's how to choose:

Choose This When You Want
HuggingFace Quickest start (about 5 minutes), ability to try different domains, and no API key setup
Guardian Rich journalism with excellent metadata and multi-category content. A strong all-around choice.
NewsAPI Very current headlines across many sources, but full text is often truncated. Getting complete articles may require URL fetching and extraction, and some sources may block access.

Can't decide? Start with Hugging Face. It downloads complete text immediately (no API keys, no rate limits, no scraping), so chunking and evaluation are much more predictable.

Guardian is a great next choice if you want live journalism data with strong metadata.

Treat NewsAPI as “advanced mode” because the API often returns truncated content and getting full text usually requires scraping article URLs (some sites will block it).


Step 2: Chunking and Embedding

What you're building: Document chunks with embeddings, ready to load into your vector database.

At this point, think in terms of documents and chunks, not rows or tables.
Each document produces multiple chunks, and each chunk becomes a searchable unit.

Choosing Your Chunking Strategy

You learned three approaches to chunking. Here's when to use each with your data:

Strategy Best When Implementation
Sentence-based Documents have clear sentence boundaries, such as articles, papers, or documentation Target 400–600 words per chunk and group complete sentences
Fixed token Documents lack structure, such as chat logs, transcripts, or code, or when consistent chunk sizes are required Use 512 tokens with a 100-token overlap (about 20%)
Structure-aware Documents have explicit structure, such as markdown headers, HTML tags, or defined sections Respect section boundaries and split large sections only when necessary

For most datasets (news articles, documentation, papers), sentence-based chunking works well. It preserves semantic coherence while reducing storage requirements compared to fixed-token approaches.

Building It

Here's a sentence-based chunking implementation:

import pandas as pd
import json
import nltk
nltk.download('punkt', quiet=True)
from nltk.tokenize import sent_tokenize

def chunk_by_sentences(text, target_words=500, min_words=100):
    """
    Chunk text by grouping sentences to target word count.

    Args:
        text: Document text to chunk
        target_words: Target words per chunk (default 500)
        min_words: Minimum words for valid chunk (default 100)

    Returns:
        List of text chunks
    """
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = []
    current_count = 0

    for sentence in sentences:
        sentence_words = len(sentence.split())

        # Save current chunk if adding this sentence would exceed target
        if current_count > 0 and current_count + sentence_words > target_words:
            chunks.append(' '.join(current_chunk))
            current_chunk = [sentence]
            current_count = sentence_words
        else:
            current_chunk.append(sentence)
            current_count += sentence_words

    # Don't forget last chunk
    if current_chunk and current_count >= min_words:
        chunks.append(' '.join(current_chunk))

    return chunks

# Load documents (works for Guardian / HF / NewsAPI outputs)
with open('guardian_documents_full.json', 'r', encoding='utf-8') as f:
    docs = json.load(f)

all_chunks = []
chunk_metadata = []

for doc in docs:
    text = doc['body_text']
    chunks = chunk_by_sentences(text, target_words=500)

    # Store each chunk with stable IDs and metadata linking back to source
    for chunk_idx, chunk in enumerate(chunks):
        chunk_id = f"{doc['id']}::chunk_{chunk_idx}"  # stable ID used everywhere

        all_chunks.append(chunk)
        chunk_metadata.append({
            'chunk_id': chunk_id,
            'source_id': doc['id'],
            'chunk_index': chunk_idx,
            'title': doc.get('title', ''),
            'category': doc.get('category', 'unknown'),
            'publication_date': doc.get('publication_date', ''),
            'author': doc.get('author', 'Unknown'),
            'url': doc.get('url', ''),
            'source': doc.get('source', '')
        })

# These maps are your guardrails: you can always recover the right 
# metadata/text by chunk_id, even if ordering changes later.
chunk_by_id = {m['chunk_id']: m for m in chunk_metadata}
text_by_id = {m['chunk_id']: all_chunks[i] for i, m in enumerate(chunk_metadata)}

print(f"Created {len(all_chunks)} chunks from {len(docs)} documents")
print(f"Average chunks per document: {len(all_chunks) / len(docs):.1f}")
print(f"Example chunk_id: {chunk_metadata[0]['chunk_id']}")

From here on, chunk_id is the only identifier you should trust. Treat list positions and DataFrame row order as accidental. If anything gets reloaded, merged, shuffled, or filtered, chunk_id is how you keep everything aligned.

Generating Embeddings

Now embed your chunks. This will take a while, depending on how many chunks you have. Expect 30-60 minutes for 5,000-10,000 chunks with proper rate limiting.

from cohere import ClientV2
from dotenv import load_dotenv
import os
import time
import numpy as np
import pandas as pd

load_dotenv()
co = ClientV2(api_key=os.getenv('COHERE_API_KEY'))

# Batch configuration for rate limit handling
batch_size = 15  # Conservative to avoid hitting limits
wait_time = 15   # Seconds between batches

# Your chunk size and overlap choices directly affect embedding costs
# More chunks = more API calls = higher cost and longer indexing time
# This is why evaluating chunking strategies matters

all_embeddings = []
num_batches = (len(all_chunks) + batch_size - 1) // batch_size

print(f"Generating embeddings for {len(all_chunks)} chunks...")
print(f"This will take approximately {(num_batches * wait_time) / 60:.0f} minutes")

for batch_idx in range(num_batches):
    start_idx = batch_idx * batch_size
    end_idx = min(start_idx + batch_size, len(all_chunks))
    batch = all_chunks[start_idx:end_idx]

    try:
        response = co.embed(
            texts=batch,
            model='embed-v4.0',
            input_type='search_document',  # Important: use search_document for stored content
            embedding_types=['float']
        )
        all_embeddings.extend(response.embeddings.float_)

        if (batch_idx + 1) % 10 == 0:
            print(f"  Progress: {batch_idx + 1}/{num_batches} batches")

        if batch_idx < num_batches - 1:
            time.sleep(wait_time)

    except Exception as e:
        # In production, implement exp backoff and retry only on rate limits
        # Save progress checkpoints to resume if something fails after 40 minutes
        print(f"  Hit rate limit or error: {e}")
        print(f"  Waiting 60 seconds before retry...")
        time.sleep(60)

        # Retry this batch (in production: check error type, use exp backoff)
        response = co.embed(
            texts=batch,
            model='embed-v4.0',
            input_type='search_document',
            embedding_types=['float']
        )
        all_embeddings.extend(response.embeddings.float_)

        if batch_idx < num_batches - 1:
            time.sleep(wait_time)

# ---- Sanity checks (important for beginners) ----
# Embedding dims can vary by model/config. Don't assume they're always 1536.
if len(all_embeddings) != len(all_chunks):
    raise ValueError("Embedding count does not match chunk count.")

embedding_dim = len(all_embeddings[0])
if any(len(e) != embedding_dim for e in all_embeddings):
    raise ValueError("Inconsistent embedding dimensions detected.")

print(f"\nGenerated {len(all_embeddings)} embeddings")
print(f"Embedding dimension: {embedding_dim}")

# Save embeddings
embeddings_array = np.array(all_embeddings)
np.save('chunk_embeddings.npy', embeddings_array)

# Save metadata (includes chunk_id)
pd.DataFrame(chunk_metadata).to_csv('chunk_metadata.csv', index=False)

print("Saved to chunk_embeddings.npy and chunk_metadata.csv")

Quality Checks

Quality Indicator Red Flag Green Light
Chunk count Less than 2 chunks per document on average 3–10 chunks per document is typical
Chunk sizes Ranges from 10 words to 5,000 words, indicating the strategy broke Chunks fall within a consistent range for the chosen strategy (for example, 300–700 words for sentence-based)
Source tracking Unable to link chunks back to original documents Clear metadata connects each chunk to its source document
Embedding count Number of embeddings does not match the number of chunks Embedding count matches chunks exactly
Sample quality Chunks split mid-sentence or lose important context Chunks contain complete, coherent thoughts

Common Issues

Symptom Cause Solution
"Rate limit exceeded" errors Generating embeddings too fast Increase wait_time to 20–30 seconds and reduce batch_size to 10
Lost connection between chunks and docs Forgot to add source_id to metadata Add the document ID to chunk_metadata before embedding
Chunks wildly different sizes Edge cases not handled, such as very short documents Add minimum and maximum constraints to the chunking logic
Embedding generation slow Normal behavior for large datasets This is expected. For example, 5,000 chunks with 15-second delays takes roughly 45 minutes.

Decision Guide: Chunk Size Selection

If you're unsure about chunk size parameters:

Choose Smaller Chunks (200–300 words) Choose Larger Chunks (600–800 words)
Queries target specific facts Queries are conceptual or broad
Precision matters more than context Context matters more than precision
Storage cost isn’t a concern You want to minimize storage and API costs

Most projects work well with 400-600 word chunks. Start there unless you have specific reasons to go smaller or larger.


Step 3: Vector Database Setup

What you're building: A production vector database loaded with your chunks and ready to query.

Choosing Your Database

You need to pick one of three options. Here's the decision framework:

Choose pgvector If Choose Qdrant If Choose Pinecone If
You need the lowest possible query latency You rely on heavy filtering across multiple text fields You want zero operational overhead
You already have PostgreSQL infrastructure in place You can accept HTTP API overhead You can accept network latency overhead
Your team has strong SQL and PostgreSQL skills You need consistent filter performance Your team should focus on product features, not operations
You primarily filter on integers or dates You’re comfortable running Docker or containerized services Your scale may grow unpredictably over time

Still unsure? Start with pgvector if you have Postgres experience, Qdrant if you don't. Both are solid choices that demonstrate production skills.

Note: The database loading examples below assume all_chunks is still in memory
from Step 2. If you are running this in a fresh session, reload chunk text from
disk (for example, from the same source used to generate chunk_metadata.csv).

Building It: pgvector Example

Here's the complete setup for PostgreSQL with pgvector:

import psycopg2
from pgvector.psycopg2 import register_vector
import numpy as np
import pandas as pd

# This project uses whatever embedding dimension your saved file contains.
# Always set the DB dimension to match your actual embeddings.
embeddings = np.load('chunk_embeddings.npy')
EMBEDDING_DIM = embeddings.shape[1]

metadata = pd.read_csv('chunk_metadata.csv')

# In real systems, you often avoid storing full text in the vector table.
# For this practice project, storing content in Postgres is convenient and fine.
# (Cloud vector DB payloads often have stricter limits.)

# Connect to PostgreSQL
conn = psycopg2.connect(
    host="localhost",
    database="knowledge_base",
    user="postgres",
    password="your_password"
)
cur = conn.cursor()

# Enable pgvector extension FIRST
cur.execute("CREATE EXTENSION IF NOT EXISTS vector")
conn.commit()

# THEN register vector type with Python driver
register_vector(conn)

# Create table for chunks
cur.execute(f"""
    CREATE TABLE IF NOT EXISTS chunks (
        id SERIAL PRIMARY KEY,
        chunk_id TEXT UNIQUE,          -- stable identifier used across systems
        source_id TEXT,
        chunk_index INTEGER,
        title TEXT,
        category TEXT,
        publication_date TEXT,
        author TEXT,
        content TEXT,
        embedding vector({EMBEDDING_DIM})
    )
""")
conn.commit()

# Insert in batches
batch_size = 500

# This example assumes you're running Step 2 and Step 3 in the same session, 
# so all_chunks, chunk_metadata.csv, and chunk_embeddings.npy still line up 
# naturally.
# If you reload data in a new session, don't rely on order. Rebuild all_chunks 
# by chunk_id from the JSON and join to chunk_metadata.csv by chunk_id.
for i in range(0, len(all_chunks), batch_size):
    batch_end = min(i + batch_size, len(all_chunks))

    for j in range(i, batch_end):
        cur.execute("""
            INSERT INTO chunks
            (chunk_id, source_id, chunk_index, title, category, publication_date, author, content, embedding)
            VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)
            ON CONFLICT (chunk_id) DO NOTHING
        """, (
            metadata.iloc[j]['chunk_id'],
            metadata.iloc[j]['source_id'],
            int(metadata.iloc[j]['chunk_index']),
            metadata.iloc[j]['title'],
            metadata.iloc[j]['category'],
            metadata.iloc[j]['publication_date'],
            metadata.iloc[j]['author'],
            all_chunks[j],
            embeddings[j]
        ))

    conn.commit()
    print(f"Inserted {batch_end}/{len(all_chunks)} chunks")

# Create HNSW index for fast similarity search
print("Creating HNSW index...")
cur.execute("""
    CREATE INDEX IF NOT EXISTS chunks_embedding_idx
    ON chunks
    USING hnsw (embedding vector_cosine_ops)
""")
conn.commit()

print(f"Setup complete! Loaded {len(all_chunks)} chunks into pgvector")

cur.close()
conn.close()

Building It: Qdrant Example

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import numpy as np
import pandas as pd

embeddings = np.load('chunk_embeddings.npy')
EMBEDDING_DIM = embeddings.shape[1]

metadata = pd.read_csv('chunk_metadata.csv')

# Connect to Qdrant (assumes Docker running locally)
client = QdrantClient(host="localhost", port=6333)
collection_name = "knowledge_base"

# Recreate collection for a clean run (optional).
# If you prefer not to delete, check if it exists first.
client.recreate_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(
        size=EMBEDDING_DIM,
        distance=Distance.COSINE
    )
)

# Payload rule (simple): store IDs + filterable metadata + a short preview.
# Full chunk text stays in your local files (CSV/JSON) 
# Fetched by chunk_id when needed.
points = []
for idx in range(len(all_chunks)):
    chunk_id = metadata.iloc[idx]['chunk_id']
    point = PointStruct(
        id=chunk_id,  
        vector=embeddings[idx].tolist(),
        payload={
            'chunk_id': chunk_id,
            'source_id': metadata.iloc[idx]['source_id'],
            'chunk_index': int(metadata.iloc[idx]['chunk_index']),
            'title': metadata.iloc[idx]['title'],
            'category': metadata.iloc[idx]['category'],
            'publication_date': metadata.iloc[idx]['publication_date'],
            'author': metadata.iloc[idx]['author'],
            'content_preview': all_chunks[idx][:200]
        }
    )
    points.append(point)

# Upload in batches
batch_size = 100
for i in range(0, len(points), batch_size):
    batch = points[i:i+batch_size]
    client.upsert(collection_name=collection_name, points=batch)
    print(f"Uploaded {min(i+batch_size, len(points))}/{len(points)} points")

print(f"Setup complete! Loaded {len(points)} chunks into Qdrant")

Building It: Pinecone Example

from pinecone import Pinecone, ServerlessSpec
import numpy as np
import pandas as pd
from dotenv import load_dotenv
import os
import time

load_dotenv()
pc = Pinecone(api_key=os.getenv('PINECONE_API_KEY'))

embeddings = np.load('chunk_embeddings.npy')
EMBEDDING_DIM = embeddings.shape[1]

metadata = pd.read_csv('chunk_metadata.csv')

index_name = "knowledge-base"

# Create serverless index (del manually in Pinecone console if it already exists)
pc.create_index(
    name=index_name,
    dimension=EMBEDDING_DIM,
    metric="cosine",
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    )
)

# Wait for index to be ready
while not pc.describe_index(index_name).status['ready']:
    print("Waiting for index to be ready...")
    time.sleep(1)

index = pc.Index(index_name)

# Keep metadata small: store IDs + filterable fields + a short preview.
# Full chunk text stays in local files and is fetched by chunk_id when needed.
vectors = []
for idx in range(len(all_chunks)):
    chunk_id = metadata.iloc[idx]['chunk_id']
    vectors.append({
        'id': chunk_id,
        'values': embeddings[idx].tolist(),
        'metadata': {
            'chunk_id': chunk_id,
            'source_id': metadata.iloc[idx]['source_id'],
            'chunk_index': int(metadata.iloc[idx]['chunk_index']),
            'title': metadata.iloc[idx]['title'],
            'category': metadata.iloc[idx]['category'],
            'publication_date': metadata.iloc[idx]['publication_date'],
            'author': metadata.iloc[idx]['author'],
            'content_preview': all_chunks[idx][:200]
        }
    })

# Upload in batches
batch_size = 100
for i in range(0, len(vectors), batch_size):
    batch = vectors[i:i+batch_size]
    index.upsert(vectors=batch)
    print(f"Uploaded {min(i+batch_size, len(vectors))}/{len(vectors)} vectors")

print("Waiting for vectors to be indexed...")
time.sleep(10)

print(f"Setup complete! Loaded {len(vectors)} vectors into Pinecone")

Quality Checks

Quality Indicator Red Flag Green Light
Test query results Returns random, unrelated chunks Returns topically relevant chunks
Metadata filtering Unable to filter results by category Filters work correctly as expected
Query latency Queries feel unreasonably slow for local execution Queries feel appropriately fast for local execution
Result consistency Different results returned for identical queries Consistent results when repeating the same query
Data loading Error messages appear during upload All chunks load successfully without errors

Common Issues

Symptom Cause Solution
"Dimension mismatch" error Embedding size does not match database configuration Verify that embeddings.shape[1] matches the database or index dimension, then recreate the collection or index if needed.
Very slow insertion Batch size too small or network-related issues Increase batch size to 500–1,000 and check network performance if using a cloud service.
Metadata filters return nothing Field names don’t match or values are missing Verify metadata field names exactly and check for null or missing values.
pgvector: "extension not found" Tried to register vectors before creating the extension Create the extension first, then call register_vector().
Qdrant: "method not found" Using outdated client methods Check Qdrant Query API docs for current method names (APIs evolve over time).
Pinecone: vectors not queryable Indexing not complete due to eventual consistency Wait 30–60 seconds after upload before running queries.

Step 4: Hybrid Search Implementation

What you're building: A search system that combines semantic similarity with keyword matching and returns ranked results.

Understanding the Components

Hybrid search needs three pieces working together:

  1. Vector similarity search - Returns semantically similar chunks
  2. BM25 keyword search - Returns chunks containing query terms
  3. Score combination - Merges both rankings into final results

Building It: BM25 Component

from rank_bm25 import BM25Okapi
import string

def simple_tokenize(text):
    """Basic tokenization for BM25 (good enough for this project)."""
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    return text.split()

# Build BM25 index from all chunks
tokenized_chunks = [simple_tokenize(chunk) for chunk in all_chunks]
bm25 = BM25Okapi(tokenized_chunks)

# Map BM25 array indices -> chunk_id (this is how we "join" BM25 to vector DB results)
idx_to_chunk_id = [m['chunk_id'] for m in chunk_metadata]

# Test BM25 search
query = "climate change impacts on agriculture"
tokenized_query = simple_tokenize(query)
bm25_scores = bm25.get_scores(tokenized_query)

top_bm25_indices = bm25_scores.argsort()[::-1][:10]
print("Top 10 BM25 results:")
for idx in top_bm25_indices:
    cid = idx_to_chunk_id[int(idx)]
    print(f"  Score: {bm25_scores[idx]:.2f} - {chunk_by_id[cid]['title']}")

Building It: Vector Search Component

# Example using pgvector (adapt for your chosen database)
# If you chose Qdrant/Pinecone, replace vector_search_pgvector with your 
# DB’s query call that returns chunk_id.

import numpy as np

def vector_search_pgvector(cur, query_text, limit=10, category_filter=None):
    """
    Search using vector similarity with optional metadata filter.

    IMPORTANT:
    - This function returns chunk_id (a stable string ID), not a SERIAL integer.
    - We use chunk_id to join vector results with BM25 + local metadata.
    """
    # Embed query
    response = co.embed(
        texts=[query_text],
        model='embed-v4.0',
        input_type='search_query',  # Important: different from search_document
        embedding_types=['float']
    )
    query_embedding = np.array(response.embeddings.float_[0])

    if category_filter:
        sql = """
            SELECT chunk_id, title, category, content, embedding <=> %s AS distance
            FROM chunks
            WHERE category = %s
            ORDER BY embedding <=> %s
            LIMIT %s
        """
        cur.execute(
            sql,
            (
                query_embedding.tolist(), 
                category_filter, 
                query_embedding.tolist(), 
                limit
            )
        )
    else:
        sql = """
            SELECT chunk_id, title, category, content, embedding <=> %s AS distance
            FROM chunks
            ORDER BY embedding <=> %s
            LIMIT %s
        """
        cur.execute(
            sql,
            (
                query_embedding.tolist(),
                query_embedding.tolist(),
                limit
            )
        )

    results = cur.fetchall()

    formatted = []
    for doc in results:
        formatted.append({
            'chunk_id': doc[0],
            'title': doc[1],
            'category': doc[2],
            'content': doc[3],
            'distance': float(doc[4])
        })

    return formatted

Building It: Hybrid Combination

def hybrid_search(cur, query_text, alpha=0.5, limit=10, category_filter=None):
    """
    Combine BM25 and vector search with weighted scoring.

    ID handling:
    - This project uses a stable chunk_id stored in metadata and in the vector database.
    - Do NOT assume DB primary keys match Python list indices.
    - We "join" BM25 and vector results using chunk_id.

    Hybrid scoring note:
    - BM25 scores and vector distances are on different scales.
    - We convert both into simple 0–1-ish scores for blending.
    - Treat the blended score as a ranking heuristic, not a meaningful absolute value.

    Candidate set note (important even for this project):
    - Don't compute hybrid scores over the entire corpus.
    - Retrieve top-K candidates from each retriever, then fuse only those.
    """
    # ---- BM25 component (scores by chunk_id) ----
    tokenized_query = simple_tokenize(query_text)
    bm25_scores = bm25.get_scores(tokenized_query)

    # Normalize BM25 to 0-1 (heuristic, OK for this project)
    max_bm25 = float(max(bm25_scores)) if max(bm25_scores) > 0 else 1.0
    min_bm25 = float(min(bm25_scores))

    bm25_norm_by_id = {}
    for idx, score in enumerate(bm25_scores):
        if max_bm25 > min_bm25:
            norm = (float(score) - min_bm25) / (max_bm25 - min_bm25)
        else:
            norm = 0.0
        bm25_norm_by_id[idx_to_chunk_id[idx]] = norm

    # ---- Vector component (scores by chunk_id) ----
    vector_results = vector_search_pgvector(
        cur,
        query_text,
        limit=100,  # retrieve more candidates for fusion
        category_filter=category_filter
    )

    vector_score_by_id = {}
    for r in vector_results:
        chunk_id = r['chunk_id']
        distance = float(r['distance'])

        # Distances are ranking values, not universally comparable "similarity scores".
        # We convert distance -> bounded score only so BM25 and vector can be blended.
        # Do NOT interpret this as a probability or compare across metrics/databases.
        vector_score_by_id[chunk_id] = 1 / (1 + distance)

    # ---- Candidate set fusion (top-K union) ----
    bm25_top_k = 100
    top_bm25_indices = bm25_scores.argsort()[::-1][:bm25_top_k]
    bm25_candidate_ids = {idx_to_chunk_id[int(i)] for i in top_bm25_indices}

    vector_candidate_ids = set(vector_score_by_id.keys())

    candidate_ids = bm25_candidate_ids | vector_candidate_ids

    # ---- Weighted fusion by chunk_id ----
    hybrid_by_id = {}
    for chunk_id in candidate_ids:
        bm25_part = bm25_norm_by_id.get(chunk_id, 0.0)
        vec_part = vector_score_by_id.get(chunk_id, 0.0)
        hybrid_by_id[chunk_id] = alpha * bm25_part + (1 - alpha) * vec_part

    # ---- Top results ----
    top_items = sorted(hybrid_by_id.items(), key=lambda x: x[1], reverse=True)[:limit]

    final_results = []
    for chunk_id, score in top_items:
        meta = chunk_by_id[chunk_id]
        text = text_by_id[chunk_id]

        final_results.append({
            'chunk_id': chunk_id,
            'title': meta['title'],
            'category': meta.get('category','unknown'), 
            'content': text[:200] + "...",
            'hybrid_score': score,
            'bm25_component': bm25_norm_by_id.get(chunk_id, 0.0),
            'vector_component': vector_score_by_id.get(chunk_id, 0.0)
        })

    return final_results

# Test hybrid search (pgvector example)
# --- Reconnect for querying (pgvector) ---
import psycopg2
from pgvector.psycopg2 import register_vector

conn = psycopg2.connect(
    host="localhost",
    database="knowledge_base",
    user="postgres",
    password="your_password"
)
register_vector(conn)
cur = conn.cursor()

results = hybrid_search(cur, "climate change impacts on agriculture", alpha=0.5, limit=5)
print("\nTop 5 hybrid search results:")
for r in results:
    print(f"\nTitle: {r['title']}")
    print(f"Category: {r['category']}")
    print(f"Hybrid score: {r['hybrid_score']:.3f} (BM25: {r['bm25_component']:.3f}, Vector: {r['vector_component']:.3f})")
    print(f"Content: {r['content']}")

Quality Checks

Quality Indicator Red Flag Green Light
Component independence BM25 and vector components return identical results Each component produces different rankings
Score combination Changing alpha has no effect on result ordering Different alpha values change the ranking order
Filter integration Filters break hybrid search behavior Metadata filters work correctly with hybrid scoring
Semantic queries Pure keyword search wins on conceptual queries Vector component contributes meaningfully on semantic queries
Keyword queries Pure vector search wins on exact term matches BM25 component improves results on keyword-based queries

Common Issues

Symptom Cause Solution
Hybrid identical to pure vector BM25 component not working Verify the BM25 index is built correctly and check tokenization.
Changing alpha has no effect Scores are not normalized or one component returns zero values Check score normalization and verify both components return non-zero scores.
All results from one component Alpha is too extreme (0.0 or 1.0) or score scales differ wildly Start with alpha = 0.5 and ensure scores are properly normalized.
Slow queries (2+ seconds) Searches are running sequentially Consider parallelizing vector and BM25 searches.
Filters return empty results Metadata field names don’t match or values are missing Verify exact field names and check for null or missing metadata.

Decision Guide: Finding Optimal Alpha

Test different alpha values systematically:

Alpha Value Means Best For
0.0 Pure vector search with no keyword component Conceptual queries and semantic similarity
0.3 30% keyword scoring and 70% vector scoring Documents with rich vocabulary and a semantic focus
0.5 Balanced keyword and vector contribution A safe default for mixed query types
0.7 70% keyword scoring and 30% vector scoring Cases where rare terms matter and exact matches are important
1.0 Pure BM25 keyword search with no vector component Situations that require only keyword matching

Run your test queries at different alpha values and measure which performs best for your domain.


Step 5: Evaluation and Iteration

What you're building: Evidence that your system works, with measurements proving hybrid search adds value.

Creating Test Queries

You need 10-15 queries where you know what the right answers should be. Here's how to create them:

# Read your data and identify diverse queries
# Example: Guardian news articles

test_queries = [
    {
        'query': 'climate change environmental impacts',
        'expected_category': 'science',
        'description': 'Semantic query about climate science'
    },
    {
        'query': 'artificial intelligence machine learning development',
        'expected_category': 'technology',
        'description': 'Technical topic with clear keywords'
    },
    {
        'query': 'election results voting democracy',
        'expected_category': 'politics',
        'description': 'Political coverage'
    },
    {
        'query': 'olympic games sports competition',
        'expected_category': 'sport',
        'description': 'Sports coverage with specific terms'
    },
    {
        'query': 'film review cinema entertainment',
        'expected_category': 'culture',
        'description': 'Arts and culture content'
    },
    # Add 5-10 more covering different query types
]

Mix query types: semantic queries, keyword-heavy queries, filtered queries, and edge cases.

Measuring Performance

def evaluate_search_strategy(search_func, test_queries, strategy_name, k=5):
    """
    Evaluate a search strategy on test queries.

    NOTE: This project uses a coarse proxy metric (category match), not true relevance judgments.
    We use it here because it’s easy to create “expected” labels without manual annotation.

    Args:
        search_func: Function that takes query and returns results
        test_queries: List of test query dicts
        strategy_name: Name for this strategy (e.g., "Pure Vector", "Hybrid α=0.5")
        k: Evaluate whether ANY of the top-k results match the expected category

    Returns:
        Dictionary with proxy success metrics
    """
    results = {
        'strategy': strategy_name,
        'success_count': 0,
        'total': len(test_queries),
        'details': []
    }

    for test in test_queries:
        query = test['query']
        expected_category = test['expected_category'].lower()

        search_results = search_func(query)

        top_k = search_results[:k]
        success = any(r['category'].lower() == expected_category for r in top_k)

        if success:
            results['success_count'] += 1

        top_title = top_k[0]['title'] if top_k else "(no results)"
        top_category = top_k[0]['category'] if top_k else "(no results)"

        results['details'].append({
            'query': query,
            'expected_category': test['expected_category'],
            'top_category': top_category,
            'top_title': top_title,
            'success_category_at_k': success
        })

    results['proxy_success_rate'] = results['success_count'] / results['total']
    return results

# Pure vector: just set alpha=0.0
def pure_vector(q):
    return hybrid_search(cur, q, alpha=0.0, limit=5)

# Pure BM25: alpha=1.0
def pure_bm25(q):
    return hybrid_search(cur, q, alpha=1.0, limit=5)

vector_results = evaluate_search_strategy(pure_vector, test_queries, "Pure Vector", k=5)
print(f"\nPure Vector proxy success (category@5): {vector_results['proxy_success_rate']*100:.0f}%")

bm25_results = evaluate_search_strategy(pure_bm25, test_queries, "Pure BM25", k=5)
print(f"Pure BM25 proxy success (category@5): {bm25_results['proxy_success_rate']*100:.0f}%")

for a in [0.3, 0.5, 0.7]:
    def hybrid_alpha(q, alpha=a):
        return hybrid_search(cur, q, alpha=alpha, limit=5)

    h = evaluate_search_strategy(hybrid_alpha, test_queries, f"Hybrid α={a}", k=5)
    print(f"Hybrid α={a} proxy success (category@5): {h['proxy_success_rate']*100:.0f}%")

Measuring Query Latency

import time

def measure_latency(search_func, query, iterations=10):
    """Measure average query latency."""
    times = []

    # Warmup
    for _ in range(3):
        _ = search_func(query)

    # Actual measurement
    for _ in range(iterations):
        start = time.time()
        _ = search_func(query)
        elapsed = (time.time() - start) * 1000  # Convert to ms
        times.append(elapsed)

    avg_time = sum(times) / len(times)
    return avg_time

# Test latency
query = "climate change impacts on agriculture"
latency = measure_latency(lambda q: hybrid_search(cur, q, alpha=0.5), query)
print(f"\nAverage query latency: {latency:.0f}ms")

Quality Checks

Quality Indicator Red Flag Green Light
Success rate Below 50% on test queries Above 70% on well-crafted queries
Hybrid improvement Hybrid performs worse than pure vector Hybrid improves results on 40% or more of queries
Query latency Consistently over 2 seconds Under 1 second for typical queries
Consistency Same query produces different results Repeating the same query gives consistent results

What to Do When Results Are Poor

If your evaluation shows problems, here's how to diagnose and fix them:

Poor Results On Likely Cause What to Try
All query types Chunking broke the document context Increase chunk size or switch to sentence-based chunking
Semantic queries Embeddings are not working correctly Verify that queries use the search_query input type and chunks use search_document
Keyword queries BM25 component is not contributing Confirm the BM25 index is built correctly and verify that alpha is not set too low
Filtered queries Metadata-related issues Verify metadata field names exactly and check for missing or null values

When Hybrid Doesn't Help

Sometimes you'll discover pure vector search works just as well or even better than hybrid search.

This is a valid finding. Document it honestly:

# Evaluation Results

After testing pure vector, pure BM25, and hybrid search with α values from 0.3 to 0.7:

- Pure vector: 87% success rate
- Hybrid (α=0.5): 84% success rate
- Pure BM25: 71% success rate

**Conclusion**: Pure vector search performed best for this dataset.

**Why**: Guardian news articles have rich vocabulary and well-written content.
The semantic embeddings captured everything needed. Adding BM25 keyword
matching introduced noise without improving results.

**Decision**: Using pure vector search (hybrid with α=0.0) for final system.

You don’t always “win” by adding complexity. If testing shows a simpler approach works better, that’s a solid outcome, and worth documenting.


Step 6: Documentation and Presentation

What you're building: Professional documentation that makes this project portfolio-ready.

Writing Your README

Your README is the first thing anyone sees. Make it count.

# News Search System with Semantic Understanding

A hybrid search system for 2,000+ news articles that combines vector similarity
with keyword matching to help users find relevant content 10x faster than
traditional keyword search.

## What This Does

This system searches through 2,000 Guardian news articles across politics,
technology, science, sport, and culture. It understands queries semantically
(finding articles about "climate change impacts" even when they use terms like
"environmental consequences") while also supporting traditional keyword search
and metadata filtering.

## Technical Stack

- **Vector Database**: Qdrant (chosen for consistent 1.1x filtering overhead
  across complex metadata queries)
- **Embeddings**: Cohere embed-v4.0 (dimension read from chunk_embeddings.npy)
- **Chunking**: Sentence-based with 500-word target (preserves semantic coherence,
  reduces storage 44% vs fixed-token)
- **Hybrid Search**: BM25 + vector similarity with α=0.4 weighting

## Why These Choices

### Database: Qdrant

I chose Qdrant because my test queries heavily filter on multiple metadata
fields (category, date, author). Qdrant maintains consistent 1.1x overhead
regardless of filter complexity, compared to pgvector's variable 2.3x overhead
on text filters.

**Tradeoff**: Higher baseline latency (~50ms) due to HTTP API vs pgvector's in-process execution, but this was acceptable for my sub-200ms latency requirement.

### Chunking: Sentence-Based

News articles have clear sentence boundaries. Sentence-based chunking with
500-word targets created 4,500 chunks vs 8,200 with fixed-token approach,
reducing API costs by 45% while maintaining semantic coherence.

### Hybrid Search: α=0.4

Testing showed pure vector (87% success rate) slightly outperformed hybrid
α=0.5 (84%), but hybrid performed better on queries with rare technical terms.
Alpha=0.4 (60% vector, 40% keyword) balanced both cases, achieving 89% success
rate overall.

## Performance Metrics

- Average query latency: 127ms (measured; reported for transparency)
- Proxy quality: category match @10 = 94% on test queries
- Hybrid search improved **proxy** success on 67% of test queries vs pure vector
- Database choice: Qdrant handles complex filters without performance degradation

## Setup Instructions (Example)

This is a minimal example. Update file names, API keys, and database settings for your project.

1. Create and activate a virtual environment (recommended).
2. Install dependencies:
   - `pip install -r requirements.txt`
3. Add API keys in a `.env` file (do not commit this file):
   - `COHERE_API_KEY=...`
   - (If using Pinecone) `PINECONE_API_KEY=...`
4. Collect data (example):
   - `python collect_guardian_data.py`
5. Chunk + embed:
   - Run the chunking script to create `chunk_metadata.csv`
   - Run embedding generation to create `chunk_embeddings.npy`
6. Load into your vector database:
   - pgvector: create the DB/table + insert docs + create HNSW index
   - Qdrant/Pinecone: create collection/index + upsert vectors
7. Run search + evaluation:
   - Build BM25 index
   - Run hybrid search experiments and report results + latency

## Example Queries

**Query**: "climate change environmental impacts"
- Top result: "Climate crisis: How rising temperatures affect biodiversity"
- Category: Science (expected ✓)
- Hybrid score: 0.847

**Query**: "artificial intelligence ethics regulations"
- Top result: "EU AI Act: New framework for algorithmic accountability"
- Category: Technology (expected ✓)
- Hybrid score: 0.792

[Include 3-5 more examples...]

## What I Learned

The biggest surprise was discovering pure vector search nearly matched hybrid
performance. Guardian's professional journalism has rich vocabulary, making
semantic search highly effective. I kept hybrid search because it improved
results on technical queries, but simpler domains might not need it.

Handling rate limits during embedding generation was challenging. Implementing
exponential backoff reduced failures from 40% to under 5%.

Creating Your Architecture Diagram

Use one of these free tools:

  • draw.io - Full-featured diagramming tool
  • Excalidraw - Simple, hand-drawn style diagrams
  • tldraw - Minimalist drawing tool

Your diagram should show:

  1. Data input (API → 2,000 articles)
  2. Processing pipeline (chunking → 4,500 chunks → Cohere embeddings)
  3. Storage (Qdrant with metadata)
  4. Query flow (user query → embedding → vector + BM25 → hybrid ranking)
  5. Results (top-k ranked chunks)

Focus on clarity over beauty. A simple box-and-arrow diagram works great.

Choosing Your Presentation Format

Pick ONE format that plays to your strengths:

Format Best When Example
Demo Video (3–5 min) You built a web interface that’s worth showing visually Screen-record yourself querying the system and showing results update in real time
Blog Post You made interesting technical or architectural decisions Write up your database selection process and evaluation findings
Jupyter Notebook You want to demonstrate methodology rigorously Walk through evaluation with charts, showing code and results together
Web Interface You want something others can actually use Deploy a Streamlit app where people can try queries themselves

For blog posts, see Building and Presenting Your Data Portfolio for excellent guidance on structure and presentation.

For web interfaces, Build a Web Interface for Your Chatbot with Streamlit walks through creating interactive demos.

Quality Checks

Quality Indicator Red Flag Green Light
README clarity Full of jargon with no clear explanation of what the project does Someone unfamiliar with the project understands what you built
Technical justification Technologies are listed without explaining why they were chosen Decisions are explained with clear, specific reasoning
Setup instructions Steps are missing or assume prior knowledge Someone else can follow the instructions and run the project
Results demonstration No examples, screenshots, or evaluation metrics provided Concrete examples with real results are clearly shown
Professional polish Code includes debug prints or commented-out sections Clean code, organized structure, and working examples

Common Issues

Symptom Cause Solution
README too technical Writing for yourself instead of the audience Test it on a non-technical friend and simplify jargon where needed
No one can run your code Hardcoded paths or missing dependencies Test setup on a fresh machine and add a requirements.txt
Unclear what makes this special Describes what was built, but not why Add a Technical Decisions section explaining your reasoning
Too much detail Including every implementation detail Focus on the most interesting decisions and trade-offs, not every line of code

Final Checklist

Before you consider this project complete:

Technical Implementation:

  • Collected 1,000+ documents from new domain
  • Implemented and justified chunking strategy
  • Generated embeddings for all chunks
  • Set up and loaded vector database
  • Built hybrid search with metadata filtering
  • Created 10+ test queries with known expected results
  • Measured query latency and reported it (no fixed target at this scale)
  • Compared hybrid vs pure approaches with actual data

Evaluation and Analysis:

  • Success rate above 70% on test queries
  • Hybrid search demonstrably improves results (or documented why it doesn't)
  • Query latency measured and reported transparently
  • Tested different alpha values and chose optimal
  • Identified what works well and what could improve

Documentation:

  • Comprehensive README explaining project
  • Architecture diagram showing system flow
  • Technical decisions documented with reasoning
  • Setup instructions someone can follow
  • Example queries with actual results
  • Performance metrics and evaluation results
  • One presentation format (video/blog/notebook/web interface)

Code Quality:

  • Code organized and commented
  • No hardcoded paths or API keys in code
  • Requirements.txt or environment.yml included
  • Can run on fresh environment following your instructions

Portfolio Readiness:

  • Project public on GitHub
  • README renders correctly on GitHub
  • Someone unfamiliar could understand what you built
  • Technical decisions justified, not just stated
  • You can confidently explain this project in an interview

When you can check all these boxes, you have a portfolio project that demonstrates real vector database skills.


Next Steps

Congratulations on completing this project! You’ve built a working knowledge base search system using real data, from collection and chunking through embedding, indexing, hybrid retrieval, and evaluation. That’s a substantial piece of work, and it reflects the kinds of problems you’d encounter when building search systems outside of tutorials.

If you’d like to extend the project further, here are a few suggestions that mirror how these systems are often improved in practice:

  1. Add Query Expansion

    Use an LLM to reformulate user queries before searching to reduce vocabulary mismatches.

    # Original query: "reducing storage requirements"
    # Expanded: "reducing storage requirements compression optimization
    #            resource savings data efficiency"

    This can help surface relevant documents that use different terminology than the original query.

  2. Build Continuous Document Ingestion

    Create a pipeline that automatically chunks, embeds, and indexes new documents as they arrive, instead of loading data once.

  3. Implement Reranking

    Retrieve a larger candidate set with fast vector search, then rerank the top results with a more accurate but slower model.

  4. Create a Search Quality Dashboard

    Track queries, results, and user interactions over time to measure quality and test improvements.

  5. Add Multi-Language Support

    Detect document language and use appropriate embedding models so the system works beyond English-only content.

  6. Implement Document Versioning

    Handle document updates without breaking search results or losing historical context.

As a portfolio project, this gives you a concrete system to talk through with interviewers, reviewers, or peers. You can explain how the system is structured, why you made certain design choices, how you evaluated the results, and what you’d improve with different constraints.

Before moving on, take some time to clean up the code and write clear documentation. That final pass is what turns this from a completed project into something you can confidently share and discuss.

Mike Levy

About the author

Mike Levy

Mike is a life-long learner who is passionate about mathematics, coding, and teaching. When he's not sitting at the keyboard, he can be found in his garden or at a natural hot spring.