Badge photo

Cyber Week Deal: Save $670 Off Lifetime Plan & Build a Future-Proof Data Career

Sale Ends in
00
days
00
hours
00
mins
00
secs
December 5, 2025

Document Chunking Strategies for Vector Databases

In the previous tutorial, we built a vector database with ChromaDB and ran semantic similarity searches across 5,000 arXiv papers. Our dataset consisted of paper abstracts, each about 200 words long. These abstracts were perfect for embedding as single units: short enough to fit comfortably in an embedding model's context window, yet long enough to capture meaningful semantic content.

But here's the challenge we didn't face yet: What happens when you need to search through full research papers, technical documentation, or long articles? A typical research paper contains 10,000 words. A comprehensive technical guide might have 50,000 words. These documents are far too long to embed as single vectors.

When documents are too long, you need to break them into chunks. This tutorial teaches you how to implement different chunking strategies, evaluate their performance systematically, and understand the tradeoffs between approaches. By the end, you'll know how to make informed decisions about chunking for your own projects.

Why Chunking Still Matters

You might be thinking: "Modern LLMs can handle massive amounts of data. Can't I just embed entire documents?"

There are three reasons why chunking remains essential:

1. Embedding Models Have Context Limits

Many embedding models still have much smaller context limits than modern chat models, and long inputs are also more expensive to embed. Even when a model can technically handle a whole paper, you usually don't want one giant vector: smaller chunks give you better retrieval and lower cost.

2. Retrieval Quality Depends on Granularity

Imagine someone searches for "robotic manipulation techniques." If you embedded an entire 10,000-word paper as a single vector, that search would match the whole paper, even if only one 400-word section actually discusses robotic manipulation. Chunking lets you retrieve the specific relevant section rather than forcing the user to read an entire paper.

3. Semantic Coherence Matters

A single document might cover multiple distinct topics. A paper about machine learning for healthcare might discuss neural network architectures in one section and patient privacy in another. These topics deserve separate embeddings so each can be retrieved independently when relevant.

The question isn't whether to chunk, but how to chunk intelligently. That's what we're going to figure out together.

What You'll Learn

By the end of this tutorial, you'll be able to:

  • Understand why chunking strategies affect retrieval quality
  • Implement two practical chunking approaches: fixed token windows and sentence-based chunking
  • Generate embeddings for chunks and store them in ChromaDB
  • Build a systematic evaluation framework to compare strategies
  • Interpret real performance data showing when each strategy excels
  • Make informed decisions about chunk size and strategy for your projects
  • Recognize that query quality matters more than chunking strategy

Most importantly, you'll learn how to evaluate your chunking decisions using real measurements rather than guesses.

Dataset and Setup

For this tutorial, we're working with 20 full research papers from the same arXiv dataset we used previously. These papers are balanced across five computer science categories:

  • cs.CL (Computational Linguistics): 4 papers
  • cs.CV (Computer Vision): 4 papers
  • cs.DB (Databases): 4 papers
  • cs.LG (Machine Learning): 4 papers
  • cs.SE (Software Engineering): 4 papers

We extracted the full text from these papers, and here's what makes them perfect for learning about chunking:

  • Total corpus: 196,181 words
  • Average paper length: 9,809 words (compared to 200-word abstracts)
  • Range: 2,735 to 20,763 words per paper
  • Content: Real academic papers with typical formatting artifacts

These papers are long enough to require chunking, diverse enough to test strategies across topics, and messy enough to reflect real-world document processing.

Required Files

Download arxiv_metadata_and_papers.zip and extract it to your working directory. This archive contains:

  • arxiv_20papers_metadata.csv - Metadata, including: title, abstract, authors, published date, category, and arXiv IDs for the 20 selected papers
  • arxiv_fulltext_papers/ - Directory with the 20 text files (one per corresponding paper)

You'll also need the same Python environment from the previous tutorial, plus two additional packages:

# If you're starting fresh, create a virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install required packages
# Python 3.12 with these versions:
# chromadb==1.3.4
# numpy==2.0.2
# pandas==2.2.2
# cohere==5.20.0
# python-dotenv==1.1.1
# nltk==3.9.1
# tiktoken==0.12.0

pip install chromadb numpy pandas cohere python-dotenv nltk tiktoken

Make sure you have a .env file with your Cohere API key:

COHERE_API_KEY=your_key_here

Loading the Papers

Let's load our papers and examine what we're working with:

import pandas as pd
from pathlib import Path

# Load paper metadata
df = pd.read_csv('arxiv_20papers_metadata.csv')
papers_dir = Path('arxiv_fulltext_papers')

print(f"Loaded {len(df)} papers")
print(f"\nPapers per category:")
print(df['category'].value_counts().sort_index())

# Calculate corpus statistics
total_words = 0
word_counts = []

for arxiv_id in df['arxiv_id']:
    with open(papers_dir / f"{arxiv_id}.txt", 'r', encoding='utf-8') as f:
        text = f.read()
        words = len(text.split())
        word_counts.append(words)
        total_words += words

print(f"\nCorpus statistics:")
print(f"  Total words: {total_words:,}")
print(f"  Average words per paper: {sum(word_counts) / len(word_counts):.0f}")
print(f"  Range: {min(word_counts):,} to {max(word_counts):,} words")

# Show a sample paper
sample_id = df['arxiv_id'].iloc[0]
with open(papers_dir / f"{sample_id}.txt", 'r', encoding='utf-8') as f:
    text = f.read()
    print(f"\nSample paper ({sample_id}):")
    print(f"  Title: {df[df['arxiv_id'] == sample_id]['title'].iloc[0]}")
    print(f"  Category: {df[df['arxiv_id'] == sample_id]['category'].iloc[0]}")
    print(f"  Length: {len(text.split()):,} words")
    print(f"  Preview: {text[:300]}...")
Loaded 20 papers

Papers per category:
category
cs.CL    4
cs.CV    4
cs.DB    4
cs.LG    4
cs.SE    4
Name: count, dtype: int64

Corpus statistics:
  Total words: 196,181
  Average words per paper: 9809
  Range: 2,735 to 20,763 words

Sample paper (2511.09708v1):
  Title: Efficient Hyperdimensional Computing with Modular Composite Representations
  Category: cs.LG
  Length: 11,293 words
  Preview: 1
Efficient Hyperdimensional Computing with Modular Composite Representations
Marco Angioli, Christopher J. Kymn, Antonello Rosato, Amy L outfi, Mauro Olivieri, and Denis Kleyko
Abstract —The modular composite representation (MCR) is
a computing model that represents information with high-
dimensional...

We have 20 papers averaging nearly 10,000 words each. Compare this to abstracts at 200 words we used previously, and you can see why chunking becomes necessary. A 10,000-word paper cannot be embedded as a single unit without losing the ability to retrieve specific relevant sections.

A Note About Paper Extraction

The papers you're working with were extracted from PDFs using PyPDF2. We've provided the extracted text files so you can focus on chunking strategies rather than PDF processing. The extraction process is straightforward but involves details that aren't central to learning about chunking.

If you're curious about how we downloaded the PDFs and extracted the text, or if you want to extend this work with different papers, you'll find the complete code in the Appendix at the end of this tutorial. For now, just know that we:

  1. Downloaded 20 papers from arXiv (4 from each category)
  2. Extracted text from each PDF using PyPDF2
  3. Saved the extracted text to individual files

The extracted text has minor formatting artifacts like extra spaces or split words, but that's realistic. Real-world document processing always involves some noise. The chunking strategies we'll implement handle this gracefully.

Strategy 1: Fixed Token Windows with Overlap

Let's start with the most common chunking approach in production systems: sliding a fixed-size window across the document with some overlap.

The Concept

Imagine reading a book through a window that shows exactly 500 words at a time. When you finish one window, you slide it forward by 400 words, creating a 100-word overlap with the previous window. This continues until you reach the end of the book.

Fixed token windows work the same way:

  1. Choose a chunk size (we'll use 512 tokens)
  2. Choose an overlap (we'll use 100 tokens, about 20%)
  3. Slide the window through the document
  4. Each window becomes one chunk

Why overlap? Concepts often span boundaries between chunks. If we chunk without overlap, we might split a crucial sentence or paragraph, losing semantic coherence. The 20% overlap ensures that even if something gets split, it appears complete in at least one chunk.

Implementation

Let's implement this strategy. We'll use tiktoken for accurate token counting:

import tiktoken

def chunk_text_fixed_tokens(text, chunk_size=512, overlap=100):
    """
    Chunk text using fixed token windows with overlap.

    Args:
        text: The document text to chunk
        chunk_size: Number of tokens per chunk (default 512)
        overlap: Number of tokens to overlap between chunks (default 100)

    Returns:
        List of text chunks
    """
    # We'll use tiktoken just to approximate token lengths.
    # In production, you'd usually match the tokenizer to your embedding model.
    encoding = tiktoken.get_encoding("cl100k_base")

    # Tokenize the entire text
    tokens = encoding.encode(text)

    chunks = []
    start_idx = 0

    while start_idx < len(tokens):
        # Get chunk_size tokens starting from start_idx
        end_idx = start_idx + chunk_size
        chunk_tokens = tokens[start_idx:end_idx]

        # Decode tokens back to text
        chunk_text = encoding.decode(chunk_tokens)
        chunks.append(chunk_text)

        # Move start_idx forward by (chunk_size - overlap)
        # This creates the overlap between consecutive chunks
        start_idx += (chunk_size - overlap)

        # Stop if we've reached the end
        if end_idx >= len(tokens):
            break

    return chunks

# Test on a sample paper
sample_id = df['arxiv_id'].iloc[0]
with open(papers_dir / f"{sample_id}.txt", 'r', encoding='utf-8') as f:
    sample_text = f.read()

sample_chunks = chunk_text_fixed_tokens(sample_text)
print(f"Sample paper chunks: {len(sample_chunks)}")
print(f"First chunk length: {len(sample_chunks[0].split())} words")
print(f"First chunk preview: {sample_chunks[0][:200]}...")
Sample paper chunks: 51
First chunk length: 323 words
First chunk preview: 1 Efficient Hyperdimensional Computing with Modular Composite Representations
Marco Angioli, Christopher J. Kymn, Antonello Rosato, Amy L outfi, Mauro Olivieri, and Denis Kleyko
Abstract —The modular co...

Our sample paper produced 51 chunks with the first chunk containing 323 words. The implementation is working as expected.

Processing All Papers

Now let's apply this chunking strategy to all 20 papers:

# Process all papers and collect chunks
all_chunks = []
chunk_metadata = []

for idx, row in df.iterrows():
    arxiv_id = row['arxiv_id']

    # Load paper text
    with open(papers_dir / f"{arxiv_id}.txt", 'r', encoding='utf-8') as f:
        text = f.read()

    # Chunk the paper
    chunks = chunk_text_fixed_tokens(text, chunk_size=512, overlap=100)

    # Store each chunk with metadata
    for chunk_idx, chunk in enumerate(chunks):
        all_chunks.append(chunk)
        chunk_metadata.append({
            'arxiv_id': arxiv_id,
            'title': row['title'],
            'category': row['category'],
            'chunk_index': chunk_idx,
            'total_chunks': len(chunks),
            'chunking_strategy': 'fixed_token_windows'
        })

print(f"Fixed token chunking results:")
print(f"  Total chunks created: {len(all_chunks)}")
print(f"  Average chunks per paper: {len(all_chunks) / len(df):.1f}")
print(f"  Average words per chunk: {sum(len(c.split()) for c in all_chunks) / len(all_chunks):.0f}")

# Check chunk size distribution
chunk_word_counts = [len(chunk.split()) for chunk in all_chunks]
print(f"  Chunk size range: {min(chunk_word_counts)} to {max(chunk_word_counts)} words")
Fixed token chunking results:
  Total chunks created: 914
  Average chunks per paper: 45.7
  Average words per chunk: 266
  Chunk size range: 16 to 438 words

We created 914 chunks from our 20 papers. Each paper produced about 46 chunks, averaging 266 words each. Notice the wide range: 16 to 438 words. This happens because tokens don't map exactly to words, and our stopping condition creates a small final chunk for some papers.

Edge Cases and Real-World Behavior

That 16-word chunk? It's not a bug. It's what happens when the final portion of a paper contains fewer tokens than our chunk size. In production, you might choose to:

  • Merge tiny final chunks with the previous chunk
  • Set a minimum chunk size threshold
  • Accept them as is (they're rare and often don't hurt retrieval)

We're keeping them to show real-world chunking behavior. Perfect uniformity isn't always necessary or beneficial.

Generating Embeddings

Now we need to embed our 914 chunks using Cohere's API. This is where we need to be careful about rate limits:

from cohere import ClientV2
from dotenv import load_dotenv
import os
import time
import numpy as np

# Load API key
load_dotenv()
cohere_api_key = os.getenv('COHERE_API_KEY')
co = ClientV2(api_key=cohere_api_key)

# Configure batching to respect rate limits
# Cohere trial and free keys have strict rate limits.
# We'll use small batches and short pauses so we don't spam the API.
batch_size = 15  # Small batches to stay well under rate limits
wait_time = 15   # Seconds between batches

print("Generating embeddings for fixed token chunks...")
print(f"Total chunks: {len(all_chunks)}")
print(f"Batch size: {batch_size}")

all_embeddings = []
num_batches = (len(all_chunks) + batch_size - 1) // batch_size

for batch_idx in range(num_batches):
    start_idx = batch_idx * batch_size
    end_idx = min(start_idx + batch_size, len(all_chunks))
    batch = all_chunks[start_idx:end_idx]

    print(f"  Processing batch {batch_idx + 1}/{num_batches} (chunks {start_idx} to {end_idx})...")

    try:
        response = co.embed(
            texts=batch,
            model='embed-v4.0',
            input_type='search_document',
            embedding_types=['float']
        )
        all_embeddings.extend(response.embeddings.float_)

        # Wait between batches to avoid rate limits
        if batch_idx < num_batches - 1:
            time.sleep(wait_time)

    except Exception as e:
        print(f"  ⚠ Hit rate limit or error: {e}")
        print(f"  Waiting 60 seconds before retry...")
        time.sleep(60)

        # Retry the same batch
        response = co.embed(
            texts=batch,
            model='embed-v4.0',
            input_type='search_document',
            embedding_types=['float']
        )
        all_embeddings.extend(response.embeddings.float_)

        if batch_idx < num_batches - 1:
            time.sleep(wait_time)

print(f"✓ Generated {len(all_embeddings)} embeddings")

# Convert to numpy array for storage
embeddings_array = np.array(all_embeddings)
print(f"Embeddings shape: {embeddings_array.shape}")
Generating embeddings for fixed token chunks...
Total chunks: 914
Batch size: 15
  Processing batch 1/61 (chunks 0 to 15)...
  Processing batch 2/61 (chunks 15 to 30)...
  ...
  Processing batch 35/61 (chunks 510 to 525)...
  ⚠ Hit rate limit or error: Rate limit exceeded
  Waiting 60 seconds before retry...
  Processing batch 36/61 (chunks 525 to 540)...
  ...
✓ Generated 914 embeddings
Embeddings shape: (914, 1536)

Important note about rate limiting: We hit Cohere's rate limit during embedding generation. This isn't a failure or something to hide. It's a production reality. Our code handled it with a 60-second wait and retry. Good production code always anticipates and handles rate limits gracefully.

Exact limits depend on your plan and may change over time, so always check the provider docs and be ready to handle 429 "rate limit" errors.

Storing in ChromaDB

Now let's store our chunks in ChromaDB. Remember that ChromaDB won't let you create a collection that already exists. During development, you'll often regenerate chunks with different parameters, so we'll delete any existing collection first:

import chromadb

# Initialize ChromaDB client
client = chromadb.Client()  # In-memory client

# This in-memory client resets whenever you start a fresh Python session.
# Your collections and data will disappear when the script ends. Later tutorials
# will show you how to persist data across sessions using PersistentClient.

# Delete collection if it exists (useful for experimentation)
try:
    client.delete_collection(name="fixed_token_chunks")
    print("Deleted existing collection")
except:
    pass  # Collection didn't exist, that's fine

# Create fresh collection
collection = client.create_collection(
    name="fixed_token_chunks",
    metadata={
        "description": "20 arXiv papers chunked with fixed token windows",
        "chunking_strategy": "fixed_token_windows",
        "chunk_size": 512,
        "overlap": 100
    }
)

# Prepare data for insertion
ids = [f"chunk_{i}" for i in range(len(all_chunks))]

print(f"Inserting {len(all_chunks)} chunks into ChromaDB...")

collection.add(
    ids=ids,
    embeddings=embeddings_array.tolist(),
    documents=all_chunks,
    metadatas=chunk_metadata
)

print(f"✓ Collection contains {collection.count()} chunks")
Deleted existing collection
Inserting 914 chunks into ChromaDB...
✓ Collection contains 914 chunks

Why delete and recreate? During development, you'll iterate on chunking strategies. Maybe you'll try different chunk sizes or overlap values. ChromaDB requires unique collection names, so the cleanest pattern is to delete the old version before creating the new one. This is standard practice while experimenting.

Our fixed token strategy is now complete: 914 chunks embedded and stored in ChromaDB.

Strategy 2: Sentence-Based Chunking

Let's implement our second approach: chunking based on sentence boundaries rather than arbitrary token positions.

The Concept

Instead of sliding a fixed window through tokens, sentence-based chunking respects the natural structure of language:

  1. Split text into sentences
  2. Group sentences together until reaching a target word count
  3. Never split a sentence in the middle
  4. Create a new chunk when adding the next sentence would exceed the target

This approach prioritizes semantic coherence over size consistency. A chunk might be 400 or 600 words, but it will always contain complete sentences that form a coherent thought.

Why sentence boundaries matter: Splitting mid-sentence destroys meaning. The sentence "Neural networks require careful tuning of hyperparameters to achieve optimal performance" loses critical context if split after "hyperparameters." Sentence-based chunking prevents this.

Implementation

We'll use NLTK for sentence tokenization:

import nltk

# Download required NLTK data
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)

from nltk.tokenize import sent_tokenize

A quick note: Sentence tokenization on PDF-extracted text isn't always perfect, especially for technical papers with equations, citations, or unusual formatting. It works well enough for this tutorial, but if you experiment with your own papers, you might see occasional issues with sentences getting split or merged incorrectly.

def chunk_text_by_sentences(text, target_words=400, min_words=100):
    """
    Chunk text by grouping sentences until reaching target word count.

    Args:
        text: The document text to chunk
        target_words: Target words per chunk (default 400)
        min_words: Minimum words for a valid chunk (default 100)

    Returns:
        List of text chunks
    """
    # Split text into sentences
    sentences = sent_tokenize(text)

    chunks = []
    current_chunk = []
    current_word_count = 0

    for sentence in sentences:
        sentence_words = len(sentence.split())

        # If adding this sentence would exceed target, save current chunk
        if current_word_count > 0 and current_word_count + sentence_words > target_words:
            chunks.append(' '.join(current_chunk))
            current_chunk = [sentence]
            current_word_count = sentence_words
        else:
            current_chunk.append(sentence)
            current_word_count += sentence_words

    # Don't forget the last chunk
    if current_chunk and current_word_count >= min_words:
        chunks.append(' '.join(current_chunk))

    return chunks

# Test on the same sample paper
sample_chunks_sent = chunk_text_by_sentences(sample_text, target_words=400)
print(f"Sample paper chunks (sentence-based): {len(sample_chunks_sent)}")
print(f"First chunk length: {len(sample_chunks_sent[0].split())} words")
print(f"First chunk preview: {sample_chunks_sent[0][:200]}...")
Sample paper chunks (sentence-based): 29
First chunk length: 392 words
First chunk preview: 1
Efficient Hyperdimensional Computing with
Modular Composite Representations
Marco Angioli, Christopher J. Kymn, Antonello Rosato, Amy L outfi, Mauro Olivieri, and Denis Kleyko
Abstract —The modular co...

The same paper that produced 51 fixed-token chunks now produces 29 sentence-based chunks. The first chunk is 392 words, close to our 400-word target but not exact.

Processing All Papers

Let's apply sentence-based chunking to all 20 papers:

# Process all papers with sentence-based chunking
all_chunks_sent = []
chunk_metadata_sent = []

for idx, row in df.iterrows():
    arxiv_id = row['arxiv_id']

    # Load paper text
    with open(papers_dir / f"{arxiv_id}.txt", 'r', encoding='utf-8') as f:
        text = f.read()

    # Chunk by sentences
    chunks = chunk_text_by_sentences(text, target_words=400, min_words=100)

    # Store each chunk with metadata
    for chunk_idx, chunk in enumerate(chunks):
        all_chunks_sent.append(chunk)
        chunk_metadata_sent.append({
            'arxiv_id': arxiv_id,
            'title': row['title'],
            'category': row['category'],
            'chunk_index': chunk_idx,
            'total_chunks': len(chunks),
            'chunking_strategy': 'sentence_based'
        })

print(f"Sentence-based chunking results:")
print(f"  Total chunks created: {len(all_chunks_sent)}")
print(f"  Average chunks per paper: {len(all_chunks_sent) / len(df):.1f}")
print(f"  Average words per chunk: {sum(len(c.split()) for c in all_chunks_sent) / len(all_chunks_sent):.0f}")

# Check chunk size distribution
chunk_word_counts_sent = [len(chunk.split()) for chunk in all_chunks_sent]
print(f"  Chunk size range: {min(chunk_word_counts_sent)} to {max(chunk_word_counts_sent)} words")
Sentence-based chunking results:
  Total chunks created: 513
  Average chunks per paper: 25.6
  Average words per chunk: 382
  Chunk size range: 110 to 548 words

Sentence-based chunking produced 513 chunks compared to fixed token's 914. That's about 44% fewer chunks. Each chunk averages 382 words instead of 266. This isn't better or worse, it's a different tradeoff:

Fixed Token (914 chunks):

  • More chunks, smaller sizes
  • Consistent token counts
  • More embeddings to generate and store
  • Finer-grained retrieval granularity

Sentence-Based (513 chunks):

  • Fewer chunks, larger sizes
  • Variable sizes respecting sentences
  • Less storage and fewer embeddings
  • Preserves semantic coherence

Comparing Strategies Side-by-Side

Let's create a comparison table:

import pandas as pd

comparison_df = pd.DataFrame({
    'Metric': ['Total Chunks', 'Chunks per Paper', 'Avg Words per Chunk',
               'Min Words', 'Max Words'],
    'Fixed Token': [914, 45.7, 266, 16, 438],
    'Sentence-Based': [513, 25.6, 382, 110, 548]
})

print(comparison_df.to_string(index=False))
              Metric  Fixed Token  Sentence-Based
        Total Chunks          914             513
   Chunks per Paper          45.7            25.6
Avg Words per Chunk           266             382
           Min Words           16             110
           Max Words          438             548

Sentence-based chunking creates 44% fewer chunks. This means:

  • Lower costs: 44% fewer embeddings to generate
  • Less storage: 44% less data to store and query
  • Larger context: Each chunk contains more complete thoughts
  • Better coherence: Never splits mid-sentence

But remember, this isn't automatically "better." Smaller chunks can enable more precise retrieval. The choice depends on your use case.

Generating Embeddings for Sentence-Based Chunks

We'll use the same embedding process as before, with the same rate limiting pattern:

print("Generating embeddings for sentence-based chunks...")
print(f"Total chunks: {len(all_chunks_sent)}")

all_embeddings_sent = []
num_batches = (len(all_chunks_sent) + batch_size - 1) // batch_size

for batch_idx in range(num_batches):
    start_idx = batch_idx * batch_size
    end_idx = min(start_idx + batch_size, len(all_chunks_sent))
    batch = all_chunks_sent[start_idx:end_idx]

    print(f"  Processing batch {batch_idx + 1}/{num_batches} (chunks {start_idx} to {end_idx})...")

    try:
        response = co.embed(
            texts=batch,
            model='embed-v4.0',
            input_type='search_document',
            embedding_types=['float']
        )
        all_embeddings_sent.extend(response.embeddings.float_)

        if batch_idx < num_batches - 1:
            time.sleep(wait_time)

    except Exception as e:
        print(f"  ⚠ Hit rate limit: {e}")
        print(f"  Waiting 60 seconds...")
        time.sleep(60)

        response = co.embed(
            texts=batch,
            model='embed-v4.0',
            input_type='search_document',
            embedding_types=['float']
        )
        all_embeddings_sent.extend(response.embeddings.float_)

        if batch_idx < num_batches - 1:
            time.sleep(wait_time)

print(f"✓ Generated {len(all_embeddings_sent)} embeddings")

embeddings_array_sent = np.array(all_embeddings_sent)
print(f"Embeddings shape: {embeddings_array_sent.shape}")
Generating embeddings for sentence-based chunks...
Total chunks: 513
  Processing batch 1/35 (chunks 0 to 15)...
  ...
✓ Generated 513 embeddings
Embeddings shape: (513, 1536)

With 513 chunks instead of 914, embedding generation is faster and costs less. This is a concrete benefit of the sentence-based approach.

Storing Sentence-Based Chunks in ChromaDB

We'll create a separate collection for sentence-based chunks:

# Delete existing collection if present
try:
    client.delete_collection(name="sentence_chunks")
except:
    pass

# Create sentence-based collection
collection_sent = client.create_collection(
    name="sentence_chunks",
    metadata={
        "description": "20 arXiv papers chunked by sentences",
        "chunking_strategy": "sentence_based",
        "target_words": 400,
        "min_words": 100
    }
)

# Prepare and insert data
ids_sent = [f"chunk_{i}" for i in range(len(all_chunks_sent))]

print(f"Inserting {len(all_chunks_sent)} chunks into ChromaDB...")

collection_sent.add(
    ids=ids_sent,
    embeddings=embeddings_array_sent.tolist(),
    documents=all_chunks_sent,
    metadatas=chunk_metadata_sent
)

print(f"✓ Collection contains {collection_sent.count()} chunks")
Inserting 513 chunks into ChromaDB...
✓ Collection contains 513 chunks

Now we have two collections:

  • fixed_token_chunks with 914 chunks
  • sentence_chunks with 513 chunks

Both contain the same 20 papers, just chunked differently. Now comes the critical question: which strategy actually retrieves relevant content better?

Building an Evaluation Framework

We've created two chunking strategies and embedded all the chunks. But how do we know which one works better? We need a systematic way to measure retrieval quality.

The Evaluation Approach

Our evaluation framework works like this:

  1. Create test queries for specific papers we know should be retrieved
  2. Run each query against both chunking strategies
  3. Check if the expected paper appears in the top results
  4. Compare performance across strategies

The key is having ground truth: knowing which papers should match which queries.

Creating Good Test Queries

Here's something we learned the hard way during development: bad queries make any chunking strategy look bad.

When we first built this evaluation, we tried queries like "reinforcement learning optimization" for a paper that was actually about masked diffusion models. Both chunking strategies "failed" because we gave them an impossible task. The problem wasn't the chunking, it was our poor understanding of the documents.

The fix: Before creating queries, read the paper abstracts. Understand what each paper actually discusses. Then create queries that match real content.

Let's create five test queries based on actual paper content:

# Test queries designed from actual paper content
test_queries = [
    {
        "text": "knowledge editing in language models",
        "expected_paper": "2510.25798v1",  # MemEIC paper (cs.CL)
        "description": "Knowledge editing"
    },
    {
        "text": "masked diffusion models for inference optimization",
        "expected_paper": "2511.04647v2",  # Masked diffusion (cs.LG)
        "description": "Optimal inference schedules"
    },
    {
        "text": "robotic manipulation with spatial representations",
        "expected_paper": "2511.09555v1",  # SpatialActor (cs.CV)
        "description": "Robot manipulation"
    },
    {
        "text": "blockchain ledger technology for database integrity",
        "expected_paper": "2507.13932v1",  # Chain Table (cs.DB)
        "description": "Blockchain database integrity"
    },
    {
        "text": "automated test generation and oracle synthesis",
        "expected_paper": "2510.26423v1",  # Nexus (cs.SE)
        "description": "Multi-agent test oracles"
    }
]

print("Test queries:")
for i, query in enumerate(test_queries, 1):
    print(f"{i}. {query['text']}")
    print(f"   Expected paper: {query['expected_paper']}")
    print()
Test queries:
1. knowledge editing in language models
   Expected paper: 2510.25798v1

2. masked diffusion models for inference optimization
   Expected paper: 2511.04647v2

3. robotic manipulation with spatial representations
   Expected paper: 2511.09555v1

4. blockchain ledger technology for database integrity
   Expected paper: 2507.13932v1

5. automated test generation and oracle synthesis
   Expected paper: 2510.26423v1

These queries are specific enough to target particular papers but general enough to represent realistic search behavior. Each query matches actual content from its expected paper.

Implementing the Evaluation Loop

Now let's run these queries against both chunking strategies and compare results:

def evaluate_chunking_strategy(collection, test_queries, strategy_name):
    """
    Evaluate a chunking strategy using test queries.

    Returns:
        Dictionary with success rate and detailed results
    """
    results = []

    for query_info in test_queries:
        query_text = query_info['text']
        expected_paper = query_info['expected_paper']

        # Embed the query
        response = co.embed(
            texts=[query_text],
            model='embed-v4.0',
            input_type='search_query',
            embedding_types=['float']
        )
        query_embedding = np.array(response.embeddings.float_[0])

        # Search the collection
        search_results = collection.query(
            query_embeddings=[query_embedding.tolist()],
            n_results=5
        )

        # Extract paper IDs from chunks
        retrieved_papers = []
        for metadata in search_results['metadatas'][0]:
            paper_id = metadata['arxiv_id']
            if paper_id not in retrieved_papers:
                retrieved_papers.append(paper_id)

        # Check if expected paper was found
        found = expected_paper in retrieved_papers
        position = retrieved_papers.index(expected_paper) + 1 if found else None
        best_distance = search_results['distances'][0][0]

        results.append({
            'query': query_text,
            'expected_paper': expected_paper,
            'found': found,
            'position': position,
            'best_distance': best_distance,
            'retrieved_papers': retrieved_papers[:3]  # Top 3 for comparison
        })

    # Calculate success rate
    success_rate = sum(1 for r in results if r['found']) / len(results)

    return {
        'strategy': strategy_name,
        'success_rate': success_rate,
        'results': results
    }

# Evaluate both strategies
print("Evaluating fixed token strategy...")
fixed_token_eval = evaluate_chunking_strategy(
    collection,
    test_queries,
    "Fixed Token Windows"
)

print("Evaluating sentence-based strategy...")
sentence_eval = evaluate_chunking_strategy(
    collection_sent,
    test_queries,
    "Sentence-Based"
)

print("\n" + "="*80)
print("EVALUATION RESULTS")
print("="*80)
Evaluating fixed token strategy...
Evaluating sentence-based strategy...

================================================================================
EVALUATION RESULTS
================================================================================

Comparing Results

Let's examine how each strategy performed:

def print_evaluation_results(eval_results):
    """Print evaluation results in a readable format"""
    strategy = eval_results['strategy']
    success_rate = eval_results['success_rate']
    results = eval_results['results']

    print(f"\n{strategy}")
    print("-" * 80)
    print(f"Success Rate: {len([r for r in results if r['found']])}/{len(results)} queries ({success_rate*100:.0f}%)")
    print()

    for i, result in enumerate(results, 1):
        status = "✓" if result['found'] else "✗"
        position = f"(position #{result['position']})" if result['found'] else ""

        print(f"{i}. {status} {result['query']}")
        print(f"   Expected: {result['expected_paper']}")
        print(f"   Found: {result['found']} {position}")
        print(f"   Best match distance: {result['best_distance']:.4f}")
        print(f"   Top 3 papers: {', '.join(result['retrieved_papers'][:3])}")
        print()

# Print results for both strategies
print_evaluation_results(fixed_token_eval)
print_evaluation_results(sentence_eval)

# Compare directly
print("\n" + "="*80)
print("DIRECT COMPARISON")
print("="*80)
print(f"{'Query':<60} {'Fixed':<10} {'Sentence':<10}")
print("-" * 80)

for i in range(len(test_queries)):
    query = test_queries[i]['text'][:55]
    fixed_pos = fixed_token_eval['results'][i]['position']
    sent_pos = sentence_eval['results'][i]['position']

    fixed_str = f"#{fixed_pos}" if fixed_pos else "Not found"
    sent_str = f"#{sent_pos}" if sent_pos else "Not found"

    print(f"{query:<60} {fixed_str:<10} {sent_str:<10}")
Fixed Token Windows
--------------------------------------------------------------------------------
Success Rate: 5/5 queries (100%)

1. ✓ knowledge editing in language models
   Expected: 2510.25798v1
   Found: True (position #1)
   Best match distance: 0.8865
   Top 3 papers: 2510.25798v1

2. ✓ masked diffusion models for inference optimization
   Expected: 2511.04647v2
   Found: True (position #1)
   Best match distance: 0.9526
   Top 3 papers: 2511.04647v2

3. ✓ robotic manipulation with spatial representations
   Expected: 2511.09555v1
   Found: True (position #1)
   Best match distance: 0.9209
   Top 3 papers: 2511.09555v1

4. ✓ blockchain ledger technology for database integrity
   Expected: 2507.13932v1
   Found: True (position #1)
   Best match distance: 0.6678
   Top 3 papers: 2507.13932v1

5. ✓ automated test generation and oracle synthesis
   Expected: 2510.26423v1
   Found: True (position #1)
   Best match distance: 0.9395
   Top 3 papers: 2510.26423v1

Sentence-Based
--------------------------------------------------------------------------------
Success Rate: 5/5 queries (100%)

1. ✓ knowledge editing in language models
   Expected: 2510.25798v1
   Found: True (position #1)
   Best match distance: 0.8831
   Top 3 papers: 2510.25798v1

2. ✓ masked diffusion models for inference optimization
   Expected: 2511.04647v2
   Found: True (position #1)
   Best match distance: 0.9586
   Top 3 papers: 2511.04647v2, 2511.07930v1

3. ✓ robotic manipulation with spatial representations
   Expected: 2511.09555v1
   Found: True (position #1)
   Best match distance: 0.8863
   Top 3 papers: 2511.09555v1

4. ✓ blockchain ledger technology for database integrity
   Expected: 2507.13932v1
   Found: True (position #1)
   Best match distance: 0.6746
   Top 3 papers: 2507.13932v1

5. ✓ automated test generation and oracle synthesis
   Expected: 2510.26423v1
   Found: True (position #1)
   Best match distance: 0.9320
   Top 3 papers: 2510.26423v1

================================================================================
DIRECT COMPARISON
================================================================================
Query                                                        Fixed      Sentence  
--------------------------------------------------------------------------------
knowledge editing in language models                         #1         #1        
masked diffusion models for inference optimization           #1         #1        
robotic manipulation with spatial representations            #1         #1        
blockchain ledger technology for database integrity          #1         #1        
automated test generation and oracle synthesis               #1         #1       

Understanding the Results

Let's break down what these results tell us:

Key Finding 1: Both Strategies Work Well

Both chunking strategies achieved 100% success rate. Every test query successfully retrieved its expected paper at position #1. This is the most important result.

When you have good queries that match actual document content, chunking strategy matters less than you might think. Both approaches work because they both preserve the semantic meaning of the content, just in slightly different ways.

Key Finding 2: Sentence-Based Has Better Distances

Look at the distance values. ChromaDB uses squared Euclidean distance by default, where lower values indicate higher similarity:

Query 1 (knowledge editing):

  • Fixed token: 0.8865
  • Sentence-based: 0.8831 (better)

Query 3 (robotic manipulation):

  • Fixed token: 0.9209
  • Sentence-based: 0.8863 (better)

Query 5 (automated test generation):

  • Fixed token: 0.9395
  • Sentence-based: 0.9320 (better)

In 3 out of 5 queries, sentence-based chunking produced lower distances, meaning higher similarity scores. This suggests that preserving sentence boundaries helps maintain semantic coherence, resulting in embeddings that better capture document meaning.

Key Finding 3: Low Agreement in Secondary Results

While both strategies found the right paper at #1, look at the papers in positions #2 and #3. They often differ between strategies:

Query 1: Both found the same top 3 papers
Query 2: Top paper matches, but #2 and #3 differ
Query 5: Only the top paper matches; #2 and #3 are completely different

This happens because chunk size affects which papers surface as similar. Neither is "wrong," they just have different perspectives on what else might be relevant. The important thing is they both got the most relevant paper right.

What This Means for Your Projects

So which chunking strategy should you choose? The answer is: it depends on your constraints and priorities.

Choose Fixed Token Windows when:

  • You need consistent chunk sizes for batch processing or downstream tasks
  • Storage isn't a concern and you want finer-grained retrieval
  • Your documents lack clear sentence structure (logs, code, transcripts)
  • You're working with multilingual content where sentence detection is unreliable

Choose Sentence-Based Chunking when:

  • You want to minimize storage costs (44% fewer chunks)
  • Semantic coherence is more important than size consistency
  • Your documents have clear sentence boundaries (articles, papers, documentation)
  • You want better similarity scores (as our results suggest)

The honest truth: Both strategies work well. If you implement either one properly, you'll get good retrieval results. The choice is less about "which is better" and more about which tradeoffs align with your project constraints.

Beyond These Two Strategies

We've implemented two practical chunking strategies, but there's a third approach worth knowing about: structure-aware chunking.

The Concept

Instead of chunking based on arbitrary token boundaries or sentence groupings, structure-aware chunking respects the logical organization of documents:

  • Academic papers have sections: Introduction, Methods, Results, Discussion
  • Technical documentation has headers, code blocks, and lists
  • Web pages have HTML structure: headings, paragraphs, articles
  • Markdown files have explicit hierarchy markers

Structure-aware chunking says: "Don't just group words or sentences. Recognize that this is an Introduction section, and this is a Methods section, and keep them separate."

Simple Implementation Example

Here's what structure-aware chunking might look like for markdown documents:

def chunk_by_markdown_sections(text, min_words=100):
    """
    Chunk text by markdown section headers.
    Each section becomes one chunk (or multiple if very long).
    """
    chunks = []
    current_section = []

    for line in text.split('\n'):
        # Detect section headers (lines starting with #)
        if line.startswith('#'):
            # Save previous section if it exists
            if current_section:
                section_text = '\n'.join(current_section)
                if len(section_text.split()) >= min_words:
                    chunks.append(section_text)
            # Start new section
            current_section = [line]
        else:
            current_section.append(line)

    # Don't forget the last section
    if current_section:
        section_text = '\n'.join(current_section)
        if len(section_text.split()) >= min_words:
            chunks.append(section_text)

    return chunks

This is pseudocode-level simplicity, but it illustrates the concept: identify structure markers, use them to define chunk boundaries.

When Structure-Aware Chunking Helps

Structure-aware chunking excels when:

  • Document structure matches query patterns: If users search for "Methods," they probably want the Methods section, not a random 512-token window that happens to include some methods
  • Context boundaries are important: Code with comments, FAQs with Q&A pairs, API documentation with endpoints
  • Sections have distinct topics: A paper discussing both neural networks and patient privacy should keep those sections separate

Why We Didn't Implement It Fully

The evaluation framework we built works for any chunking strategy. You have all the tools needed to implement and test structure-aware chunking yourself:

  1. Write a chunking function that respects document structure
  2. Generate embeddings for your chunks
  3. Store them in ChromaDB
  4. Use our evaluation framework to compare against the strategies we built

The process is identical. The only difference is how you define chunk boundaries.

Hyperparameter Tuning Guidance

We made specific choices for our chunking parameters:

  • Fixed token: 512 tokens with 100-token overlap (20%)
  • Sentence-based: 400-word target with 100-word minimum

Are these optimal? Maybe, maybe not. They're reasonable defaults that worked well for academic papers. But your documents might benefit from different values.

When to Experiment with Different Parameters

Try smaller chunks (256 tokens or 200 words) when:

  • Queries target specific facts rather than broad concepts
  • Precision matters more than context
  • Storage costs aren't a constraint

Try larger chunks (1024 tokens or 600 words) when:

  • Context matters more than precision
  • Your queries are conceptual rather than factual
  • You want to reduce the total number of embeddings

Adjust overlap when:

  • Concepts frequently span chunk boundaries (increase overlap to 30-40%)
  • Storage costs are critical (reduce overlap to 10%)
  • You notice important information getting split

How to Experiment Systematically

The evaluation framework we built makes experimentation straightforward:

  1. Modify chunking parameters
  2. Generate new chunks and embeddings
  3. Store in a new ChromaDB collection
  4. Run your test queries
  5. Compare results

Don't spend hours tuning parameters before you know if chunking helps at all. Start with reasonable defaults (like ours), measure performance, then tune if needed. Most projects never need aggressive parameter tuning.

Practical Exercise

Now it's your turn to experiment. Here are some variations to try:

Option 1: Modify Fixed Token Strategy

Change the chunk size to 256 or 1024 tokens. How does this affect:

  • Total number of chunks?
  • Success rate on test queries?
  • Average similarity distances?
# Try this
chunks_small = chunk_text_fixed_tokens(sample_text, chunk_size=256, overlap=50)
chunks_large = chunk_text_fixed_tokens(sample_text, chunk_size=1024, overlap=200)

Option 2: Modify Sentence-Based Strategy

Adjust the target word count to 200 or 600 words:

# Try this
chunks_small_sent = chunk_text_by_sentences(sample_text, target_words=200)
chunks_large_sent = chunk_text_by_sentences(sample_text, target_words=600)

Option 3: Implement Structure-Aware Chunking

If your papers have clear section markers, try implementing a structure-aware chunker. Use the evaluation framework to compare it against our two strategies.

Reflection Questions

As you experiment, consider:

  • When would you choose fixed token over sentence-based chunking?
  • How would you chunk code documentation? Chat logs? News articles?
  • What chunk size makes sense for a chatbot knowledge base? For legal documents?
  • How does overlap affect retrieval quality in your tests?

Summary and Next Steps

We've built and evaluated two complete chunking strategies for vector databases. Here's what we accomplished:

Core Skills Gained

Implementation:

  • Fixed token window chunking with overlap (914 chunks from 20 papers)
  • Sentence-based chunking respecting linguistic boundaries (513 chunks)
  • Batch processing with rate limit handling
  • ChromaDB collection management for experimentation

Evaluation:

  • Systematic evaluation framework with ground truth queries
  • Measuring success rate and ranking position
  • Comparing strategies quantitatively using real performance data
  • Understanding that query quality matters more than chunking strategy

Key Takeaways

  • No Universal "Best" Chunking Strategy: Both strategies achieved 100% success when given good queries. The choice depends on your constraints (storage, semantic coherence, document structure) rather than one approach being objectively better.
  • Query Quality Matters Most: Bad queries make any chunking strategy look bad. Before evaluating chunking, understand your documents and create queries that match actual content. This lesson applies to all retrieval systems, not just chunking.
  • Sentence-Based Provides Better Distances: In 3 out of 5 test queries, sentence-based chunking had lower distances (higher similarity). Preserving sentence boundaries helps maintain semantic coherence in embeddings.
  • Tradeoffs Are Real: Fixed token creates 1.8x more chunks than sentence-based (914 vs 513). This means more storage and more embeddings to generate (which gets expensive at scale). But you get finer retrieval granularity. Neither is wrong, they optimize for different things. Remember that with overlap, you're paying for every chunk: smaller chunks plus overlap means significantly higher API costs when embedding large document collections.
  • Edge Cases Are Normal: That 16-word chunk from fixed token chunking? The 601-word chunk from sentence-based? These are real-world behaviors, not bugs. Production systems handle imperfect inputs gracefully.

Looking Ahead

We now know how to chunk documents and store them in ChromaDB. But what if we want to enhance our searches? What if we need to filter results by publication year? Search only computer vision papers? Combine semantic similarity with traditional keyword matching?

An upcoming tutorial will teach:

  • Designing metadata schemas for effective filtering
  • Combining vector similarity with metadata constraints
  • Implementing hybrid search (BM25 + vector similarity)
  • Understanding performance tradeoffs of different filtering approaches
  • Making metadata work at scale

Before moving on, make sure you understand:

  • How fixed token and sentence-based chunking differ
  • When to choose each strategy based on project needs
  • How to evaluate chunking systematically with test queries
  • Why query quality matters more than chunking strategy
  • How to handle rate limits and ChromaDB collection management

When you're comfortable with these chunking fundamentals, you're ready to enhance your vector search with metadata and hybrid approaches.


Appendix: Dataset Preparation Code

This appendix provides the complete code we used to prepare the dataset for this tutorial. You don't need to run this code to complete the tutorial, but it's here if you want to:

  • Understand how we selected and downloaded papers from arXiv
  • Extract text from your own PDF files
  • Extend the dataset with different papers or categories

Downloading Papers from arXiv

We selected 20 papers (4 from each category) from the 5,000-paper dataset used in the previous tutorial. Here's how we downloaded the PDFs:

import urllib.request
import pandas as pd
from pathlib import Path
import time

def download_arxiv_pdf(arxiv_id, save_dir):
    """
    Download a paper PDF from arXiv.

    Args:
        arxiv_id: The arXiv ID (e.g., '2510.25798v1')
        save_dir: Directory to save the PDF

    Returns:
        Path to downloaded PDF or None if failed
    """
    # Create save directory if it doesn't exist
    save_dir = Path(save_dir)
    save_dir.mkdir(exist_ok=True)

    # Construct arXiv PDF URL
    # arXiv URLs follow pattern: https://arxiv.org/pdf/{id}.pdf
    pdf_url = f"https://arxiv.org/pdf/{arxiv_id}.pdf"
    save_path = save_dir / f"{arxiv_id}.pdf"

    try:
        print(f"Downloading {arxiv_id}...")
        urllib.request.urlretrieve(pdf_url, save_path)
        print(f"  ✓ Saved to {save_path}")
        return save_path
    except Exception as e:
        print(f"  ✗ Failed: {e}")
        return None

# Example: Download papers from our metadata file
df = pd.read_csv('arxiv_20papers_metadata.csv')

pdf_dir = Path('arxiv_pdfs')
for arxiv_id in df['arxiv_id']:
    download_arxiv_pdf(arxiv_id, pdf_dir)
    time.sleep(1)  # Be respectful to arXiv servers

Important: The code above respects arXiv's servers by adding a 1-second delay between downloads. For larger downloads, consider using arXiv's bulk data access or API.

Extracting Text from PDFs

Once we had the PDFs, we extracted text using PyPDF2:

import PyPDF2
from pathlib import Path

def extract_paper_text(pdf_path):
    """
    Extract text from a PDF file.

    Args:
        pdf_path: Path to the PDF file

    Returns:
        Extracted text as a string
    """
    try:
        with open(pdf_path, 'rb') as file:
            reader = PyPDF2.PdfReader(file)

            # Extract text from all pages
            text = ""
            for page in reader.pages:
                text += page.extract_text()

            return text
    except Exception as e:
        print(f"Error extracting {pdf_path}: {e}")
        return None

def extract_all_papers(pdf_dir, output_dir):
    """
    Extract text from all PDFs in a directory.

    Args:
        pdf_dir: Directory containing PDF files
        output_dir: Directory to save extracted text files
    """
    pdf_dir = Path(pdf_dir)
    output_dir = Path(output_dir)
    output_dir.mkdir(exist_ok=True)

    pdf_files = list(pdf_dir.glob('*.pdf'))
    print(f"Found {len(pdf_files)} PDF files")

    success_count = 0
    for pdf_path in pdf_files:
        print(f"Extracting {pdf_path.name}...")

        text = extract_paper_text(pdf_path)

        if text:
            # Save as text file with same name
            output_path = output_dir / f"{pdf_path.stem}.txt"
            with open(output_path, 'w', encoding='utf-8') as f:
                f.write(text)

            word_count = len(text.split())
            print(f"  ✓ Extracted {word_count:,} words")
            success_count += 1
        else:
            print(f"  ✗ Failed to extract")

    print(f"\nSuccessfully extracted {success_count}/{len(pdf_files)} papers")

# Example: Extract all papers
extract_all_papers('arxiv_pdfs', 'arxiv_fulltext_papers')

Paper Selection Process

We selected 20 papers from the 5,000-paper dataset used in the previous tutorial. The selection criteria were:

import pandas as pd
import numpy as np

# Load the original 5k dataset
df_5k = pd.read_csv('arxiv_papers_5k.csv')

# Select 4 papers from each category
categories = ['cs.CL', 'cs.CV', 'cs.DB', 'cs.LG', 'cs.SE']
selected_papers = []

np.random.seed(42)  # For reproducibility

for category in categories:
    # Get papers from this category
    category_papers = df_5k[df_5k['category'] == category]

    # Randomly sample 4 papers
    # In practice, we also checked that abstracts were substantial
    sampled = category_papers.sample(n=4, random_state=42)
    selected_papers.append(sampled)

# Combine all selected papers
df_selected = pd.concat(selected_papers, ignore_index=True)

# Save to new metadata file
df_selected.to_csv('arxiv_20papers_metadata.csv', index=False)
print(f"Selected {len(df_selected)} papers:")
print(df_selected['category'].value_counts().sort_index())

Text Quality Considerations

PDF extraction isn't perfect. Common issues include:

Formatting artifacts:

  • Extra spaces between words
  • Line breaks in unexpected places
  • Mathematical symbols rendered as Unicode
  • Headers/footers appearing in body text

Handling these issues:

def clean_extracted_text(text):
    """
    Basic cleaning for extracted PDF text.
    """
    # Remove excessive whitespace
    text = ' '.join(text.split())

    # Remove common artifacts (customize based on your PDFs)
    text = text.replace('ï¬', 'fi')  # Common ligature issue
    text = text.replace('’', "'")   # Apostrophe encoding issue

    return text

# Apply cleaning when extracting
text = extract_paper_text(pdf_path)
if text:
    text = clean_extracted_text(text)
    # Now save cleaned text

We kept cleaning minimal for this tutorial to show realistic extraction results. In production, you might implement more aggressive cleaning depending on your PDF sources.

Why We Chose These 20 Papers

The 20 papers in this tutorial were selected to provide:

  1. Diversity across topics: 4 papers each from Machine Learning, Computer Vision, Computational Linguistics, Databases, and Software Engineering
  2. Variety in length: Papers range from 2,735 to 20,763 words
  3. Realistic content: Papers published in 2024-2025 with modern topics
  4. Successful extraction: All 20 papers extracted cleanly with readable text

This diversity ensures that chunking strategies are tested across different writing styles, document lengths, and technical domains rather than being optimized for a single type of content.


You now have all the code needed to prepare your own document chunking datasets. The same pattern works for any PDF collection: download, extract, clean, and chunk.

Key Reminders:

  • Both chunking strategies work well (100% success rate) with proper queries
  • Sentence-based requires 44% less storage (513 vs 914 chunks)
  • Sentence-based shows slightly better similarity distances
  • Fixed token provides more consistent sizes and finer granularity
  • Query quality matters more than chunking strategy
  • Rate limiting is normal production behavior, handle it gracefully
  • ChromaDB collection deletion is standard during experimentation
  • Edge cases (tiny chunks, variable sizes) are expected and usually fine
  • Evaluation frameworks transfer to any chunking strategy
  • Choose based on your constraints (storage, coherence, structure) not on "best practice"
Mike Levy

About the author

Mike Levy

Mike is a life-long learner who is passionate about mathematics, coding, and teaching. When he's not sitting at the keyboard, he can be found in his garden or at a natural hot spring.