Document Chunking Strategies for Vector Databases
In the previous tutorial, we built a vector database with ChromaDB and ran semantic similarity searches across 5,000 arXiv papers. Our dataset consisted of paper abstracts, each about 200 words long. These abstracts were perfect for embedding as single units: short enough to fit comfortably in an embedding model's context window, yet long enough to capture meaningful semantic content.
But here's the challenge we didn't face yet: What happens when you need to search through full research papers, technical documentation, or long articles? A typical research paper contains 10,000 words. A comprehensive technical guide might have 50,000 words. These documents are far too long to embed as single vectors.
When documents are too long, you need to break them into chunks. This tutorial teaches you how to implement different chunking strategies, evaluate their performance systematically, and understand the tradeoffs between approaches. By the end, you'll know how to make informed decisions about chunking for your own projects.
Why Chunking Still Matters
You might be thinking: "Modern LLMs can handle massive amounts of data. Can't I just embed entire documents?"
There are three reasons why chunking remains essential:
1. Embedding Models Have Context Limits
Many embedding models still have much smaller context limits than modern chat models, and long inputs are also more expensive to embed. Even when a model can technically handle a whole paper, you usually don't want one giant vector: smaller chunks give you better retrieval and lower cost.
2. Retrieval Quality Depends on Granularity
Imagine someone searches for "robotic manipulation techniques." If you embedded an entire 10,000-word paper as a single vector, that search would match the whole paper, even if only one 400-word section actually discusses robotic manipulation. Chunking lets you retrieve the specific relevant section rather than forcing the user to read an entire paper.
3. Semantic Coherence Matters
A single document might cover multiple distinct topics. A paper about machine learning for healthcare might discuss neural network architectures in one section and patient privacy in another. These topics deserve separate embeddings so each can be retrieved independently when relevant.
The question isn't whether to chunk, but how to chunk intelligently. That's what we're going to figure out together.
What You'll Learn
By the end of this tutorial, you'll be able to:
- Understand why chunking strategies affect retrieval quality
- Implement two practical chunking approaches: fixed token windows and sentence-based chunking
- Generate embeddings for chunks and store them in ChromaDB
- Build a systematic evaluation framework to compare strategies
- Interpret real performance data showing when each strategy excels
- Make informed decisions about chunk size and strategy for your projects
- Recognize that query quality matters more than chunking strategy
Most importantly, you'll learn how to evaluate your chunking decisions using real measurements rather than guesses.
Dataset and Setup
For this tutorial, we're working with 20 full research papers from the same arXiv dataset we used previously. These papers are balanced across five computer science categories:
- cs.CL (Computational Linguistics): 4 papers
- cs.CV (Computer Vision): 4 papers
- cs.DB (Databases): 4 papers
- cs.LG (Machine Learning): 4 papers
- cs.SE (Software Engineering): 4 papers
We extracted the full text from these papers, and here's what makes them perfect for learning about chunking:
- Total corpus: 196,181 words
- Average paper length: 9,809 words (compared to 200-word abstracts)
- Range: 2,735 to 20,763 words per paper
- Content: Real academic papers with typical formatting artifacts
These papers are long enough to require chunking, diverse enough to test strategies across topics, and messy enough to reflect real-world document processing.
Required Files
Download arxiv_metadata_and_papers.zip and extract it to your working directory. This archive contains:
arxiv_20papers_metadata.csv- Metadata, including: title, abstract, authors, published date, category, and arXiv IDs for the 20 selected papersarxiv_fulltext_papers/- Directory with the 20 text files (one per corresponding paper)
You'll also need the same Python environment from the previous tutorial, plus two additional packages:
# If you're starting fresh, create a virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install required packages
# Python 3.12 with these versions:
# chromadb==1.3.4
# numpy==2.0.2
# pandas==2.2.2
# cohere==5.20.0
# python-dotenv==1.1.1
# nltk==3.9.1
# tiktoken==0.12.0
pip install chromadb numpy pandas cohere python-dotenv nltk tiktoken
Make sure you have a .env file with your Cohere API key:
COHERE_API_KEY=your_key_here
Loading the Papers
Let's load our papers and examine what we're working with:
import pandas as pd
from pathlib import Path
# Load paper metadata
df = pd.read_csv('arxiv_20papers_metadata.csv')
papers_dir = Path('arxiv_fulltext_papers')
print(f"Loaded {len(df)} papers")
print(f"\nPapers per category:")
print(df['category'].value_counts().sort_index())
# Calculate corpus statistics
total_words = 0
word_counts = []
for arxiv_id in df['arxiv_id']:
with open(papers_dir / f"{arxiv_id}.txt", 'r', encoding='utf-8') as f:
text = f.read()
words = len(text.split())
word_counts.append(words)
total_words += words
print(f"\nCorpus statistics:")
print(f" Total words: {total_words:,}")
print(f" Average words per paper: {sum(word_counts) / len(word_counts):.0f}")
print(f" Range: {min(word_counts):,} to {max(word_counts):,} words")
# Show a sample paper
sample_id = df['arxiv_id'].iloc[0]
with open(papers_dir / f"{sample_id}.txt", 'r', encoding='utf-8') as f:
text = f.read()
print(f"\nSample paper ({sample_id}):")
print(f" Title: {df[df['arxiv_id'] == sample_id]['title'].iloc[0]}")
print(f" Category: {df[df['arxiv_id'] == sample_id]['category'].iloc[0]}")
print(f" Length: {len(text.split()):,} words")
print(f" Preview: {text[:300]}...")
Loaded 20 papers
Papers per category:
category
cs.CL 4
cs.CV 4
cs.DB 4
cs.LG 4
cs.SE 4
Name: count, dtype: int64
Corpus statistics:
Total words: 196,181
Average words per paper: 9809
Range: 2,735 to 20,763 words
Sample paper (2511.09708v1):
Title: Efficient Hyperdimensional Computing with Modular Composite Representations
Category: cs.LG
Length: 11,293 words
Preview: 1
Efficient Hyperdimensional Computing with Modular Composite Representations
Marco Angioli, Christopher J. Kymn, Antonello Rosato, Amy L outfi, Mauro Olivieri, and Denis Kleyko
Abstract —The modular composite representation (MCR) is
a computing model that represents information with high-
dimensional...
We have 20 papers averaging nearly 10,000 words each. Compare this to abstracts at 200 words we used previously, and you can see why chunking becomes necessary. A 10,000-word paper cannot be embedded as a single unit without losing the ability to retrieve specific relevant sections.
A Note About Paper Extraction
The papers you're working with were extracted from PDFs using PyPDF2. We've provided the extracted text files so you can focus on chunking strategies rather than PDF processing. The extraction process is straightforward but involves details that aren't central to learning about chunking.
If you're curious about how we downloaded the PDFs and extracted the text, or if you want to extend this work with different papers, you'll find the complete code in the Appendix at the end of this tutorial. For now, just know that we:
- Downloaded 20 papers from arXiv (4 from each category)
- Extracted text from each PDF using PyPDF2
- Saved the extracted text to individual files
The extracted text has minor formatting artifacts like extra spaces or split words, but that's realistic. Real-world document processing always involves some noise. The chunking strategies we'll implement handle this gracefully.
Strategy 1: Fixed Token Windows with Overlap
Let's start with the most common chunking approach in production systems: sliding a fixed-size window across the document with some overlap.
The Concept
Imagine reading a book through a window that shows exactly 500 words at a time. When you finish one window, you slide it forward by 400 words, creating a 100-word overlap with the previous window. This continues until you reach the end of the book.
Fixed token windows work the same way:
- Choose a chunk size (we'll use 512 tokens)
- Choose an overlap (we'll use 100 tokens, about 20%)
- Slide the window through the document
- Each window becomes one chunk
Why overlap? Concepts often span boundaries between chunks. If we chunk without overlap, we might split a crucial sentence or paragraph, losing semantic coherence. The 20% overlap ensures that even if something gets split, it appears complete in at least one chunk.
Implementation
Let's implement this strategy. We'll use tiktoken for accurate token counting:
import tiktoken
def chunk_text_fixed_tokens(text, chunk_size=512, overlap=100):
"""
Chunk text using fixed token windows with overlap.
Args:
text: The document text to chunk
chunk_size: Number of tokens per chunk (default 512)
overlap: Number of tokens to overlap between chunks (default 100)
Returns:
List of text chunks
"""
# We'll use tiktoken just to approximate token lengths.
# In production, you'd usually match the tokenizer to your embedding model.
encoding = tiktoken.get_encoding("cl100k_base")
# Tokenize the entire text
tokens = encoding.encode(text)
chunks = []
start_idx = 0
while start_idx < len(tokens):
# Get chunk_size tokens starting from start_idx
end_idx = start_idx + chunk_size
chunk_tokens = tokens[start_idx:end_idx]
# Decode tokens back to text
chunk_text = encoding.decode(chunk_tokens)
chunks.append(chunk_text)
# Move start_idx forward by (chunk_size - overlap)
# This creates the overlap between consecutive chunks
start_idx += (chunk_size - overlap)
# Stop if we've reached the end
if end_idx >= len(tokens):
break
return chunks
# Test on a sample paper
sample_id = df['arxiv_id'].iloc[0]
with open(papers_dir / f"{sample_id}.txt", 'r', encoding='utf-8') as f:
sample_text = f.read()
sample_chunks = chunk_text_fixed_tokens(sample_text)
print(f"Sample paper chunks: {len(sample_chunks)}")
print(f"First chunk length: {len(sample_chunks[0].split())} words")
print(f"First chunk preview: {sample_chunks[0][:200]}...")
Sample paper chunks: 51
First chunk length: 323 words
First chunk preview: 1 Efficient Hyperdimensional Computing with Modular Composite Representations
Marco Angioli, Christopher J. Kymn, Antonello Rosato, Amy L outfi, Mauro Olivieri, and Denis Kleyko
Abstract —The modular co...
Our sample paper produced 51 chunks with the first chunk containing 323 words. The implementation is working as expected.
Processing All Papers
Now let's apply this chunking strategy to all 20 papers:
# Process all papers and collect chunks
all_chunks = []
chunk_metadata = []
for idx, row in df.iterrows():
arxiv_id = row['arxiv_id']
# Load paper text
with open(papers_dir / f"{arxiv_id}.txt", 'r', encoding='utf-8') as f:
text = f.read()
# Chunk the paper
chunks = chunk_text_fixed_tokens(text, chunk_size=512, overlap=100)
# Store each chunk with metadata
for chunk_idx, chunk in enumerate(chunks):
all_chunks.append(chunk)
chunk_metadata.append({
'arxiv_id': arxiv_id,
'title': row['title'],
'category': row['category'],
'chunk_index': chunk_idx,
'total_chunks': len(chunks),
'chunking_strategy': 'fixed_token_windows'
})
print(f"Fixed token chunking results:")
print(f" Total chunks created: {len(all_chunks)}")
print(f" Average chunks per paper: {len(all_chunks) / len(df):.1f}")
print(f" Average words per chunk: {sum(len(c.split()) for c in all_chunks) / len(all_chunks):.0f}")
# Check chunk size distribution
chunk_word_counts = [len(chunk.split()) for chunk in all_chunks]
print(f" Chunk size range: {min(chunk_word_counts)} to {max(chunk_word_counts)} words")
Fixed token chunking results:
Total chunks created: 914
Average chunks per paper: 45.7
Average words per chunk: 266
Chunk size range: 16 to 438 words
We created 914 chunks from our 20 papers. Each paper produced about 46 chunks, averaging 266 words each. Notice the wide range: 16 to 438 words. This happens because tokens don't map exactly to words, and our stopping condition creates a small final chunk for some papers.
Edge Cases and Real-World Behavior
That 16-word chunk? It's not a bug. It's what happens when the final portion of a paper contains fewer tokens than our chunk size. In production, you might choose to:
- Merge tiny final chunks with the previous chunk
- Set a minimum chunk size threshold
- Accept them as is (they're rare and often don't hurt retrieval)
We're keeping them to show real-world chunking behavior. Perfect uniformity isn't always necessary or beneficial.
Generating Embeddings
Now we need to embed our 914 chunks using Cohere's API. This is where we need to be careful about rate limits:
from cohere import ClientV2
from dotenv import load_dotenv
import os
import time
import numpy as np
# Load API key
load_dotenv()
cohere_api_key = os.getenv('COHERE_API_KEY')
co = ClientV2(api_key=cohere_api_key)
# Configure batching to respect rate limits
# Cohere trial and free keys have strict rate limits.
# We'll use small batches and short pauses so we don't spam the API.
batch_size = 15 # Small batches to stay well under rate limits
wait_time = 15 # Seconds between batches
print("Generating embeddings for fixed token chunks...")
print(f"Total chunks: {len(all_chunks)}")
print(f"Batch size: {batch_size}")
all_embeddings = []
num_batches = (len(all_chunks) + batch_size - 1) // batch_size
for batch_idx in range(num_batches):
start_idx = batch_idx * batch_size
end_idx = min(start_idx + batch_size, len(all_chunks))
batch = all_chunks[start_idx:end_idx]
print(f" Processing batch {batch_idx + 1}/{num_batches} (chunks {start_idx} to {end_idx})...")
try:
response = co.embed(
texts=batch,
model='embed-v4.0',
input_type='search_document',
embedding_types=['float']
)
all_embeddings.extend(response.embeddings.float_)
# Wait between batches to avoid rate limits
if batch_idx < num_batches - 1:
time.sleep(wait_time)
except Exception as e:
print(f" ⚠ Hit rate limit or error: {e}")
print(f" Waiting 60 seconds before retry...")
time.sleep(60)
# Retry the same batch
response = co.embed(
texts=batch,
model='embed-v4.0',
input_type='search_document',
embedding_types=['float']
)
all_embeddings.extend(response.embeddings.float_)
if batch_idx < num_batches - 1:
time.sleep(wait_time)
print(f"✓ Generated {len(all_embeddings)} embeddings")
# Convert to numpy array for storage
embeddings_array = np.array(all_embeddings)
print(f"Embeddings shape: {embeddings_array.shape}")
Generating embeddings for fixed token chunks...
Total chunks: 914
Batch size: 15
Processing batch 1/61 (chunks 0 to 15)...
Processing batch 2/61 (chunks 15 to 30)...
...
Processing batch 35/61 (chunks 510 to 525)...
⚠ Hit rate limit or error: Rate limit exceeded
Waiting 60 seconds before retry...
Processing batch 36/61 (chunks 525 to 540)...
...
✓ Generated 914 embeddings
Embeddings shape: (914, 1536)
Important note about rate limiting: We hit Cohere's rate limit during embedding generation. This isn't a failure or something to hide. It's a production reality. Our code handled it with a 60-second wait and retry. Good production code always anticipates and handles rate limits gracefully.
Exact limits depend on your plan and may change over time, so always check the provider docs and be ready to handle 429 "rate limit" errors.
Storing in ChromaDB
Now let's store our chunks in ChromaDB. Remember that ChromaDB won't let you create a collection that already exists. During development, you'll often regenerate chunks with different parameters, so we'll delete any existing collection first:
import chromadb
# Initialize ChromaDB client
client = chromadb.Client() # In-memory client
# This in-memory client resets whenever you start a fresh Python session.
# Your collections and data will disappear when the script ends. Later tutorials
# will show you how to persist data across sessions using PersistentClient.
# Delete collection if it exists (useful for experimentation)
try:
client.delete_collection(name="fixed_token_chunks")
print("Deleted existing collection")
except:
pass # Collection didn't exist, that's fine
# Create fresh collection
collection = client.create_collection(
name="fixed_token_chunks",
metadata={
"description": "20 arXiv papers chunked with fixed token windows",
"chunking_strategy": "fixed_token_windows",
"chunk_size": 512,
"overlap": 100
}
)
# Prepare data for insertion
ids = [f"chunk_{i}" for i in range(len(all_chunks))]
print(f"Inserting {len(all_chunks)} chunks into ChromaDB...")
collection.add(
ids=ids,
embeddings=embeddings_array.tolist(),
documents=all_chunks,
metadatas=chunk_metadata
)
print(f"✓ Collection contains {collection.count()} chunks")
Deleted existing collection
Inserting 914 chunks into ChromaDB...
✓ Collection contains 914 chunks
Why delete and recreate? During development, you'll iterate on chunking strategies. Maybe you'll try different chunk sizes or overlap values. ChromaDB requires unique collection names, so the cleanest pattern is to delete the old version before creating the new one. This is standard practice while experimenting.
Our fixed token strategy is now complete: 914 chunks embedded and stored in ChromaDB.
Strategy 2: Sentence-Based Chunking
Let's implement our second approach: chunking based on sentence boundaries rather than arbitrary token positions.
The Concept
Instead of sliding a fixed window through tokens, sentence-based chunking respects the natural structure of language:
- Split text into sentences
- Group sentences together until reaching a target word count
- Never split a sentence in the middle
- Create a new chunk when adding the next sentence would exceed the target
This approach prioritizes semantic coherence over size consistency. A chunk might be 400 or 600 words, but it will always contain complete sentences that form a coherent thought.
Why sentence boundaries matter: Splitting mid-sentence destroys meaning. The sentence "Neural networks require careful tuning of hyperparameters to achieve optimal performance" loses critical context if split after "hyperparameters." Sentence-based chunking prevents this.
Implementation
We'll use NLTK for sentence tokenization:
import nltk
# Download required NLTK data
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)
from nltk.tokenize import sent_tokenize
A quick note: Sentence tokenization on PDF-extracted text isn't always perfect, especially for technical papers with equations, citations, or unusual formatting. It works well enough for this tutorial, but if you experiment with your own papers, you might see occasional issues with sentences getting split or merged incorrectly.
def chunk_text_by_sentences(text, target_words=400, min_words=100):
"""
Chunk text by grouping sentences until reaching target word count.
Args:
text: The document text to chunk
target_words: Target words per chunk (default 400)
min_words: Minimum words for a valid chunk (default 100)
Returns:
List of text chunks
"""
# Split text into sentences
sentences = sent_tokenize(text)
chunks = []
current_chunk = []
current_word_count = 0
for sentence in sentences:
sentence_words = len(sentence.split())
# If adding this sentence would exceed target, save current chunk
if current_word_count > 0 and current_word_count + sentence_words > target_words:
chunks.append(' '.join(current_chunk))
current_chunk = [sentence]
current_word_count = sentence_words
else:
current_chunk.append(sentence)
current_word_count += sentence_words
# Don't forget the last chunk
if current_chunk and current_word_count >= min_words:
chunks.append(' '.join(current_chunk))
return chunks
# Test on the same sample paper
sample_chunks_sent = chunk_text_by_sentences(sample_text, target_words=400)
print(f"Sample paper chunks (sentence-based): {len(sample_chunks_sent)}")
print(f"First chunk length: {len(sample_chunks_sent[0].split())} words")
print(f"First chunk preview: {sample_chunks_sent[0][:200]}...")
Sample paper chunks (sentence-based): 29
First chunk length: 392 words
First chunk preview: 1
Efficient Hyperdimensional Computing with
Modular Composite Representations
Marco Angioli, Christopher J. Kymn, Antonello Rosato, Amy L outfi, Mauro Olivieri, and Denis Kleyko
Abstract —The modular co...
The same paper that produced 51 fixed-token chunks now produces 29 sentence-based chunks. The first chunk is 392 words, close to our 400-word target but not exact.
Processing All Papers
Let's apply sentence-based chunking to all 20 papers:
# Process all papers with sentence-based chunking
all_chunks_sent = []
chunk_metadata_sent = []
for idx, row in df.iterrows():
arxiv_id = row['arxiv_id']
# Load paper text
with open(papers_dir / f"{arxiv_id}.txt", 'r', encoding='utf-8') as f:
text = f.read()
# Chunk by sentences
chunks = chunk_text_by_sentences(text, target_words=400, min_words=100)
# Store each chunk with metadata
for chunk_idx, chunk in enumerate(chunks):
all_chunks_sent.append(chunk)
chunk_metadata_sent.append({
'arxiv_id': arxiv_id,
'title': row['title'],
'category': row['category'],
'chunk_index': chunk_idx,
'total_chunks': len(chunks),
'chunking_strategy': 'sentence_based'
})
print(f"Sentence-based chunking results:")
print(f" Total chunks created: {len(all_chunks_sent)}")
print(f" Average chunks per paper: {len(all_chunks_sent) / len(df):.1f}")
print(f" Average words per chunk: {sum(len(c.split()) for c in all_chunks_sent) / len(all_chunks_sent):.0f}")
# Check chunk size distribution
chunk_word_counts_sent = [len(chunk.split()) for chunk in all_chunks_sent]
print(f" Chunk size range: {min(chunk_word_counts_sent)} to {max(chunk_word_counts_sent)} words")
Sentence-based chunking results:
Total chunks created: 513
Average chunks per paper: 25.6
Average words per chunk: 382
Chunk size range: 110 to 548 words
Sentence-based chunking produced 513 chunks compared to fixed token's 914. That's about 44% fewer chunks. Each chunk averages 382 words instead of 266. This isn't better or worse, it's a different tradeoff:
Fixed Token (914 chunks):
- More chunks, smaller sizes
- Consistent token counts
- More embeddings to generate and store
- Finer-grained retrieval granularity
Sentence-Based (513 chunks):
- Fewer chunks, larger sizes
- Variable sizes respecting sentences
- Less storage and fewer embeddings
- Preserves semantic coherence
Comparing Strategies Side-by-Side
Let's create a comparison table:
import pandas as pd
comparison_df = pd.DataFrame({
'Metric': ['Total Chunks', 'Chunks per Paper', 'Avg Words per Chunk',
'Min Words', 'Max Words'],
'Fixed Token': [914, 45.7, 266, 16, 438],
'Sentence-Based': [513, 25.6, 382, 110, 548]
})
print(comparison_df.to_string(index=False))
Metric Fixed Token Sentence-Based
Total Chunks 914 513
Chunks per Paper 45.7 25.6
Avg Words per Chunk 266 382
Min Words 16 110
Max Words 438 548
Sentence-based chunking creates 44% fewer chunks. This means:
- Lower costs: 44% fewer embeddings to generate
- Less storage: 44% less data to store and query
- Larger context: Each chunk contains more complete thoughts
- Better coherence: Never splits mid-sentence
But remember, this isn't automatically "better." Smaller chunks can enable more precise retrieval. The choice depends on your use case.
Generating Embeddings for Sentence-Based Chunks
We'll use the same embedding process as before, with the same rate limiting pattern:
print("Generating embeddings for sentence-based chunks...")
print(f"Total chunks: {len(all_chunks_sent)}")
all_embeddings_sent = []
num_batches = (len(all_chunks_sent) + batch_size - 1) // batch_size
for batch_idx in range(num_batches):
start_idx = batch_idx * batch_size
end_idx = min(start_idx + batch_size, len(all_chunks_sent))
batch = all_chunks_sent[start_idx:end_idx]
print(f" Processing batch {batch_idx + 1}/{num_batches} (chunks {start_idx} to {end_idx})...")
try:
response = co.embed(
texts=batch,
model='embed-v4.0',
input_type='search_document',
embedding_types=['float']
)
all_embeddings_sent.extend(response.embeddings.float_)
if batch_idx < num_batches - 1:
time.sleep(wait_time)
except Exception as e:
print(f" ⚠ Hit rate limit: {e}")
print(f" Waiting 60 seconds...")
time.sleep(60)
response = co.embed(
texts=batch,
model='embed-v4.0',
input_type='search_document',
embedding_types=['float']
)
all_embeddings_sent.extend(response.embeddings.float_)
if batch_idx < num_batches - 1:
time.sleep(wait_time)
print(f"✓ Generated {len(all_embeddings_sent)} embeddings")
embeddings_array_sent = np.array(all_embeddings_sent)
print(f"Embeddings shape: {embeddings_array_sent.shape}")
Generating embeddings for sentence-based chunks...
Total chunks: 513
Processing batch 1/35 (chunks 0 to 15)...
...
✓ Generated 513 embeddings
Embeddings shape: (513, 1536)
With 513 chunks instead of 914, embedding generation is faster and costs less. This is a concrete benefit of the sentence-based approach.
Storing Sentence-Based Chunks in ChromaDB
We'll create a separate collection for sentence-based chunks:
# Delete existing collection if present
try:
client.delete_collection(name="sentence_chunks")
except:
pass
# Create sentence-based collection
collection_sent = client.create_collection(
name="sentence_chunks",
metadata={
"description": "20 arXiv papers chunked by sentences",
"chunking_strategy": "sentence_based",
"target_words": 400,
"min_words": 100
}
)
# Prepare and insert data
ids_sent = [f"chunk_{i}" for i in range(len(all_chunks_sent))]
print(f"Inserting {len(all_chunks_sent)} chunks into ChromaDB...")
collection_sent.add(
ids=ids_sent,
embeddings=embeddings_array_sent.tolist(),
documents=all_chunks_sent,
metadatas=chunk_metadata_sent
)
print(f"✓ Collection contains {collection_sent.count()} chunks")
Inserting 513 chunks into ChromaDB...
✓ Collection contains 513 chunks
Now we have two collections:
fixed_token_chunkswith 914 chunkssentence_chunkswith 513 chunks
Both contain the same 20 papers, just chunked differently. Now comes the critical question: which strategy actually retrieves relevant content better?
Building an Evaluation Framework
We've created two chunking strategies and embedded all the chunks. But how do we know which one works better? We need a systematic way to measure retrieval quality.
The Evaluation Approach
Our evaluation framework works like this:
- Create test queries for specific papers we know should be retrieved
- Run each query against both chunking strategies
- Check if the expected paper appears in the top results
- Compare performance across strategies
The key is having ground truth: knowing which papers should match which queries.
Creating Good Test Queries
Here's something we learned the hard way during development: bad queries make any chunking strategy look bad.
When we first built this evaluation, we tried queries like "reinforcement learning optimization" for a paper that was actually about masked diffusion models. Both chunking strategies "failed" because we gave them an impossible task. The problem wasn't the chunking, it was our poor understanding of the documents.
The fix: Before creating queries, read the paper abstracts. Understand what each paper actually discusses. Then create queries that match real content.
Let's create five test queries based on actual paper content:
# Test queries designed from actual paper content
test_queries = [
{
"text": "knowledge editing in language models",
"expected_paper": "2510.25798v1", # MemEIC paper (cs.CL)
"description": "Knowledge editing"
},
{
"text": "masked diffusion models for inference optimization",
"expected_paper": "2511.04647v2", # Masked diffusion (cs.LG)
"description": "Optimal inference schedules"
},
{
"text": "robotic manipulation with spatial representations",
"expected_paper": "2511.09555v1", # SpatialActor (cs.CV)
"description": "Robot manipulation"
},
{
"text": "blockchain ledger technology for database integrity",
"expected_paper": "2507.13932v1", # Chain Table (cs.DB)
"description": "Blockchain database integrity"
},
{
"text": "automated test generation and oracle synthesis",
"expected_paper": "2510.26423v1", # Nexus (cs.SE)
"description": "Multi-agent test oracles"
}
]
print("Test queries:")
for i, query in enumerate(test_queries, 1):
print(f"{i}. {query['text']}")
print(f" Expected paper: {query['expected_paper']}")
print()
Test queries:
1. knowledge editing in language models
Expected paper: 2510.25798v1
2. masked diffusion models for inference optimization
Expected paper: 2511.04647v2
3. robotic manipulation with spatial representations
Expected paper: 2511.09555v1
4. blockchain ledger technology for database integrity
Expected paper: 2507.13932v1
5. automated test generation and oracle synthesis
Expected paper: 2510.26423v1
These queries are specific enough to target particular papers but general enough to represent realistic search behavior. Each query matches actual content from its expected paper.
Implementing the Evaluation Loop
Now let's run these queries against both chunking strategies and compare results:
def evaluate_chunking_strategy(collection, test_queries, strategy_name):
"""
Evaluate a chunking strategy using test queries.
Returns:
Dictionary with success rate and detailed results
"""
results = []
for query_info in test_queries:
query_text = query_info['text']
expected_paper = query_info['expected_paper']
# Embed the query
response = co.embed(
texts=[query_text],
model='embed-v4.0',
input_type='search_query',
embedding_types=['float']
)
query_embedding = np.array(response.embeddings.float_[0])
# Search the collection
search_results = collection.query(
query_embeddings=[query_embedding.tolist()],
n_results=5
)
# Extract paper IDs from chunks
retrieved_papers = []
for metadata in search_results['metadatas'][0]:
paper_id = metadata['arxiv_id']
if paper_id not in retrieved_papers:
retrieved_papers.append(paper_id)
# Check if expected paper was found
found = expected_paper in retrieved_papers
position = retrieved_papers.index(expected_paper) + 1 if found else None
best_distance = search_results['distances'][0][0]
results.append({
'query': query_text,
'expected_paper': expected_paper,
'found': found,
'position': position,
'best_distance': best_distance,
'retrieved_papers': retrieved_papers[:3] # Top 3 for comparison
})
# Calculate success rate
success_rate = sum(1 for r in results if r['found']) / len(results)
return {
'strategy': strategy_name,
'success_rate': success_rate,
'results': results
}
# Evaluate both strategies
print("Evaluating fixed token strategy...")
fixed_token_eval = evaluate_chunking_strategy(
collection,
test_queries,
"Fixed Token Windows"
)
print("Evaluating sentence-based strategy...")
sentence_eval = evaluate_chunking_strategy(
collection_sent,
test_queries,
"Sentence-Based"
)
print("\n" + "="*80)
print("EVALUATION RESULTS")
print("="*80)
Evaluating fixed token strategy...
Evaluating sentence-based strategy...
================================================================================
EVALUATION RESULTS
================================================================================
Comparing Results
Let's examine how each strategy performed:
def print_evaluation_results(eval_results):
"""Print evaluation results in a readable format"""
strategy = eval_results['strategy']
success_rate = eval_results['success_rate']
results = eval_results['results']
print(f"\n{strategy}")
print("-" * 80)
print(f"Success Rate: {len([r for r in results if r['found']])}/{len(results)} queries ({success_rate*100:.0f}%)")
print()
for i, result in enumerate(results, 1):
status = "✓" if result['found'] else "✗"
position = f"(position #{result['position']})" if result['found'] else ""
print(f"{i}. {status} {result['query']}")
print(f" Expected: {result['expected_paper']}")
print(f" Found: {result['found']} {position}")
print(f" Best match distance: {result['best_distance']:.4f}")
print(f" Top 3 papers: {', '.join(result['retrieved_papers'][:3])}")
print()
# Print results for both strategies
print_evaluation_results(fixed_token_eval)
print_evaluation_results(sentence_eval)
# Compare directly
print("\n" + "="*80)
print("DIRECT COMPARISON")
print("="*80)
print(f"{'Query':<60} {'Fixed':<10} {'Sentence':<10}")
print("-" * 80)
for i in range(len(test_queries)):
query = test_queries[i]['text'][:55]
fixed_pos = fixed_token_eval['results'][i]['position']
sent_pos = sentence_eval['results'][i]['position']
fixed_str = f"#{fixed_pos}" if fixed_pos else "Not found"
sent_str = f"#{sent_pos}" if sent_pos else "Not found"
print(f"{query:<60} {fixed_str:<10} {sent_str:<10}")
Fixed Token Windows
--------------------------------------------------------------------------------
Success Rate: 5/5 queries (100%)
1. ✓ knowledge editing in language models
Expected: 2510.25798v1
Found: True (position #1)
Best match distance: 0.8865
Top 3 papers: 2510.25798v1
2. ✓ masked diffusion models for inference optimization
Expected: 2511.04647v2
Found: True (position #1)
Best match distance: 0.9526
Top 3 papers: 2511.04647v2
3. ✓ robotic manipulation with spatial representations
Expected: 2511.09555v1
Found: True (position #1)
Best match distance: 0.9209
Top 3 papers: 2511.09555v1
4. ✓ blockchain ledger technology for database integrity
Expected: 2507.13932v1
Found: True (position #1)
Best match distance: 0.6678
Top 3 papers: 2507.13932v1
5. ✓ automated test generation and oracle synthesis
Expected: 2510.26423v1
Found: True (position #1)
Best match distance: 0.9395
Top 3 papers: 2510.26423v1
Sentence-Based
--------------------------------------------------------------------------------
Success Rate: 5/5 queries (100%)
1. ✓ knowledge editing in language models
Expected: 2510.25798v1
Found: True (position #1)
Best match distance: 0.8831
Top 3 papers: 2510.25798v1
2. ✓ masked diffusion models for inference optimization
Expected: 2511.04647v2
Found: True (position #1)
Best match distance: 0.9586
Top 3 papers: 2511.04647v2, 2511.07930v1
3. ✓ robotic manipulation with spatial representations
Expected: 2511.09555v1
Found: True (position #1)
Best match distance: 0.8863
Top 3 papers: 2511.09555v1
4. ✓ blockchain ledger technology for database integrity
Expected: 2507.13932v1
Found: True (position #1)
Best match distance: 0.6746
Top 3 papers: 2507.13932v1
5. ✓ automated test generation and oracle synthesis
Expected: 2510.26423v1
Found: True (position #1)
Best match distance: 0.9320
Top 3 papers: 2510.26423v1
================================================================================
DIRECT COMPARISON
================================================================================
Query Fixed Sentence
--------------------------------------------------------------------------------
knowledge editing in language models #1 #1
masked diffusion models for inference optimization #1 #1
robotic manipulation with spatial representations #1 #1
blockchain ledger technology for database integrity #1 #1
automated test generation and oracle synthesis #1 #1
Understanding the Results
Let's break down what these results tell us:
Key Finding 1: Both Strategies Work Well
Both chunking strategies achieved 100% success rate. Every test query successfully retrieved its expected paper at position #1. This is the most important result.
When you have good queries that match actual document content, chunking strategy matters less than you might think. Both approaches work because they both preserve the semantic meaning of the content, just in slightly different ways.
Key Finding 2: Sentence-Based Has Better Distances
Look at the distance values. ChromaDB uses squared Euclidean distance by default, where lower values indicate higher similarity:
Query 1 (knowledge editing):
- Fixed token: 0.8865
- Sentence-based: 0.8831 (better)
Query 3 (robotic manipulation):
- Fixed token: 0.9209
- Sentence-based: 0.8863 (better)
Query 5 (automated test generation):
- Fixed token: 0.9395
- Sentence-based: 0.9320 (better)
In 3 out of 5 queries, sentence-based chunking produced lower distances, meaning higher similarity scores. This suggests that preserving sentence boundaries helps maintain semantic coherence, resulting in embeddings that better capture document meaning.
Key Finding 3: Low Agreement in Secondary Results
While both strategies found the right paper at #1, look at the papers in positions #2 and #3. They often differ between strategies:
Query 1: Both found the same top 3 papers
Query 2: Top paper matches, but #2 and #3 differ
Query 5: Only the top paper matches; #2 and #3 are completely different
This happens because chunk size affects which papers surface as similar. Neither is "wrong," they just have different perspectives on what else might be relevant. The important thing is they both got the most relevant paper right.
What This Means for Your Projects
So which chunking strategy should you choose? The answer is: it depends on your constraints and priorities.
Choose Fixed Token Windows when:
- You need consistent chunk sizes for batch processing or downstream tasks
- Storage isn't a concern and you want finer-grained retrieval
- Your documents lack clear sentence structure (logs, code, transcripts)
- You're working with multilingual content where sentence detection is unreliable
Choose Sentence-Based Chunking when:
- You want to minimize storage costs (44% fewer chunks)
- Semantic coherence is more important than size consistency
- Your documents have clear sentence boundaries (articles, papers, documentation)
- You want better similarity scores (as our results suggest)
The honest truth: Both strategies work well. If you implement either one properly, you'll get good retrieval results. The choice is less about "which is better" and more about which tradeoffs align with your project constraints.
Beyond These Two Strategies
We've implemented two practical chunking strategies, but there's a third approach worth knowing about: structure-aware chunking.
The Concept
Instead of chunking based on arbitrary token boundaries or sentence groupings, structure-aware chunking respects the logical organization of documents:
- Academic papers have sections: Introduction, Methods, Results, Discussion
- Technical documentation has headers, code blocks, and lists
- Web pages have HTML structure: headings, paragraphs, articles
- Markdown files have explicit hierarchy markers
Structure-aware chunking says: "Don't just group words or sentences. Recognize that this is an Introduction section, and this is a Methods section, and keep them separate."
Simple Implementation Example
Here's what structure-aware chunking might look like for markdown documents:
def chunk_by_markdown_sections(text, min_words=100):
"""
Chunk text by markdown section headers.
Each section becomes one chunk (or multiple if very long).
"""
chunks = []
current_section = []
for line in text.split('\n'):
# Detect section headers (lines starting with #)
if line.startswith('#'):
# Save previous section if it exists
if current_section:
section_text = '\n'.join(current_section)
if len(section_text.split()) >= min_words:
chunks.append(section_text)
# Start new section
current_section = [line]
else:
current_section.append(line)
# Don't forget the last section
if current_section:
section_text = '\n'.join(current_section)
if len(section_text.split()) >= min_words:
chunks.append(section_text)
return chunks
This is pseudocode-level simplicity, but it illustrates the concept: identify structure markers, use them to define chunk boundaries.
When Structure-Aware Chunking Helps
Structure-aware chunking excels when:
- Document structure matches query patterns: If users search for "Methods," they probably want the Methods section, not a random 512-token window that happens to include some methods
- Context boundaries are important: Code with comments, FAQs with Q&A pairs, API documentation with endpoints
- Sections have distinct topics: A paper discussing both neural networks and patient privacy should keep those sections separate
Why We Didn't Implement It Fully
The evaluation framework we built works for any chunking strategy. You have all the tools needed to implement and test structure-aware chunking yourself:
- Write a chunking function that respects document structure
- Generate embeddings for your chunks
- Store them in ChromaDB
- Use our evaluation framework to compare against the strategies we built
The process is identical. The only difference is how you define chunk boundaries.
Hyperparameter Tuning Guidance
We made specific choices for our chunking parameters:
- Fixed token: 512 tokens with 100-token overlap (20%)
- Sentence-based: 400-word target with 100-word minimum
Are these optimal? Maybe, maybe not. They're reasonable defaults that worked well for academic papers. But your documents might benefit from different values.
When to Experiment with Different Parameters
Try smaller chunks (256 tokens or 200 words) when:
- Queries target specific facts rather than broad concepts
- Precision matters more than context
- Storage costs aren't a constraint
Try larger chunks (1024 tokens or 600 words) when:
- Context matters more than precision
- Your queries are conceptual rather than factual
- You want to reduce the total number of embeddings
Adjust overlap when:
- Concepts frequently span chunk boundaries (increase overlap to 30-40%)
- Storage costs are critical (reduce overlap to 10%)
- You notice important information getting split
How to Experiment Systematically
The evaluation framework we built makes experimentation straightforward:
- Modify chunking parameters
- Generate new chunks and embeddings
- Store in a new ChromaDB collection
- Run your test queries
- Compare results
Don't spend hours tuning parameters before you know if chunking helps at all. Start with reasonable defaults (like ours), measure performance, then tune if needed. Most projects never need aggressive parameter tuning.
Practical Exercise
Now it's your turn to experiment. Here are some variations to try:
Option 1: Modify Fixed Token Strategy
Change the chunk size to 256 or 1024 tokens. How does this affect:
- Total number of chunks?
- Success rate on test queries?
- Average similarity distances?
# Try this
chunks_small = chunk_text_fixed_tokens(sample_text, chunk_size=256, overlap=50)
chunks_large = chunk_text_fixed_tokens(sample_text, chunk_size=1024, overlap=200)
Option 2: Modify Sentence-Based Strategy
Adjust the target word count to 200 or 600 words:
# Try this
chunks_small_sent = chunk_text_by_sentences(sample_text, target_words=200)
chunks_large_sent = chunk_text_by_sentences(sample_text, target_words=600)
Option 3: Implement Structure-Aware Chunking
If your papers have clear section markers, try implementing a structure-aware chunker. Use the evaluation framework to compare it against our two strategies.
Reflection Questions
As you experiment, consider:
- When would you choose fixed token over sentence-based chunking?
- How would you chunk code documentation? Chat logs? News articles?
- What chunk size makes sense for a chatbot knowledge base? For legal documents?
- How does overlap affect retrieval quality in your tests?
Summary and Next Steps
We've built and evaluated two complete chunking strategies for vector databases. Here's what we accomplished:
Core Skills Gained
Implementation:
- Fixed token window chunking with overlap (914 chunks from 20 papers)
- Sentence-based chunking respecting linguistic boundaries (513 chunks)
- Batch processing with rate limit handling
- ChromaDB collection management for experimentation
Evaluation:
- Systematic evaluation framework with ground truth queries
- Measuring success rate and ranking position
- Comparing strategies quantitatively using real performance data
- Understanding that query quality matters more than chunking strategy
Key Takeaways
- No Universal "Best" Chunking Strategy: Both strategies achieved 100% success when given good queries. The choice depends on your constraints (storage, semantic coherence, document structure) rather than one approach being objectively better.
- Query Quality Matters Most: Bad queries make any chunking strategy look bad. Before evaluating chunking, understand your documents and create queries that match actual content. This lesson applies to all retrieval systems, not just chunking.
- Sentence-Based Provides Better Distances: In 3 out of 5 test queries, sentence-based chunking had lower distances (higher similarity). Preserving sentence boundaries helps maintain semantic coherence in embeddings.
- Tradeoffs Are Real: Fixed token creates 1.8x more chunks than sentence-based (914 vs 513). This means more storage and more embeddings to generate (which gets expensive at scale). But you get finer retrieval granularity. Neither is wrong, they optimize for different things. Remember that with overlap, you're paying for every chunk: smaller chunks plus overlap means significantly higher API costs when embedding large document collections.
- Edge Cases Are Normal: That 16-word chunk from fixed token chunking? The 601-word chunk from sentence-based? These are real-world behaviors, not bugs. Production systems handle imperfect inputs gracefully.
Looking Ahead
We now know how to chunk documents and store them in ChromaDB. But what if we want to enhance our searches? What if we need to filter results by publication year? Search only computer vision papers? Combine semantic similarity with traditional keyword matching?
An upcoming tutorial will teach:
- Designing metadata schemas for effective filtering
- Combining vector similarity with metadata constraints
- Implementing hybrid search (BM25 + vector similarity)
- Understanding performance tradeoffs of different filtering approaches
- Making metadata work at scale
Before moving on, make sure you understand:
- How fixed token and sentence-based chunking differ
- When to choose each strategy based on project needs
- How to evaluate chunking systematically with test queries
- Why query quality matters more than chunking strategy
- How to handle rate limits and ChromaDB collection management
When you're comfortable with these chunking fundamentals, you're ready to enhance your vector search with metadata and hybrid approaches.
Appendix: Dataset Preparation Code
This appendix provides the complete code we used to prepare the dataset for this tutorial. You don't need to run this code to complete the tutorial, but it's here if you want to:
- Understand how we selected and downloaded papers from arXiv
- Extract text from your own PDF files
- Extend the dataset with different papers or categories
Downloading Papers from arXiv
We selected 20 papers (4 from each category) from the 5,000-paper dataset used in the previous tutorial. Here's how we downloaded the PDFs:
import urllib.request
import pandas as pd
from pathlib import Path
import time
def download_arxiv_pdf(arxiv_id, save_dir):
"""
Download a paper PDF from arXiv.
Args:
arxiv_id: The arXiv ID (e.g., '2510.25798v1')
save_dir: Directory to save the PDF
Returns:
Path to downloaded PDF or None if failed
"""
# Create save directory if it doesn't exist
save_dir = Path(save_dir)
save_dir.mkdir(exist_ok=True)
# Construct arXiv PDF URL
# arXiv URLs follow pattern: https://arxiv.org/pdf/{id}.pdf
pdf_url = f"https://arxiv.org/pdf/{arxiv_id}.pdf"
save_path = save_dir / f"{arxiv_id}.pdf"
try:
print(f"Downloading {arxiv_id}...")
urllib.request.urlretrieve(pdf_url, save_path)
print(f" ✓ Saved to {save_path}")
return save_path
except Exception as e:
print(f" ✗ Failed: {e}")
return None
# Example: Download papers from our metadata file
df = pd.read_csv('arxiv_20papers_metadata.csv')
pdf_dir = Path('arxiv_pdfs')
for arxiv_id in df['arxiv_id']:
download_arxiv_pdf(arxiv_id, pdf_dir)
time.sleep(1) # Be respectful to arXiv servers
Important: The code above respects arXiv's servers by adding a 1-second delay between downloads. For larger downloads, consider using arXiv's bulk data access or API.
Extracting Text from PDFs
Once we had the PDFs, we extracted text using PyPDF2:
import PyPDF2
from pathlib import Path
def extract_paper_text(pdf_path):
"""
Extract text from a PDF file.
Args:
pdf_path: Path to the PDF file
Returns:
Extracted text as a string
"""
try:
with open(pdf_path, 'rb') as file:
reader = PyPDF2.PdfReader(file)
# Extract text from all pages
text = ""
for page in reader.pages:
text += page.extract_text()
return text
except Exception as e:
print(f"Error extracting {pdf_path}: {e}")
return None
def extract_all_papers(pdf_dir, output_dir):
"""
Extract text from all PDFs in a directory.
Args:
pdf_dir: Directory containing PDF files
output_dir: Directory to save extracted text files
"""
pdf_dir = Path(pdf_dir)
output_dir = Path(output_dir)
output_dir.mkdir(exist_ok=True)
pdf_files = list(pdf_dir.glob('*.pdf'))
print(f"Found {len(pdf_files)} PDF files")
success_count = 0
for pdf_path in pdf_files:
print(f"Extracting {pdf_path.name}...")
text = extract_paper_text(pdf_path)
if text:
# Save as text file with same name
output_path = output_dir / f"{pdf_path.stem}.txt"
with open(output_path, 'w', encoding='utf-8') as f:
f.write(text)
word_count = len(text.split())
print(f" ✓ Extracted {word_count:,} words")
success_count += 1
else:
print(f" ✗ Failed to extract")
print(f"\nSuccessfully extracted {success_count}/{len(pdf_files)} papers")
# Example: Extract all papers
extract_all_papers('arxiv_pdfs', 'arxiv_fulltext_papers')
Paper Selection Process
We selected 20 papers from the 5,000-paper dataset used in the previous tutorial. The selection criteria were:
import pandas as pd
import numpy as np
# Load the original 5k dataset
df_5k = pd.read_csv('arxiv_papers_5k.csv')
# Select 4 papers from each category
categories = ['cs.CL', 'cs.CV', 'cs.DB', 'cs.LG', 'cs.SE']
selected_papers = []
np.random.seed(42) # For reproducibility
for category in categories:
# Get papers from this category
category_papers = df_5k[df_5k['category'] == category]
# Randomly sample 4 papers
# In practice, we also checked that abstracts were substantial
sampled = category_papers.sample(n=4, random_state=42)
selected_papers.append(sampled)
# Combine all selected papers
df_selected = pd.concat(selected_papers, ignore_index=True)
# Save to new metadata file
df_selected.to_csv('arxiv_20papers_metadata.csv', index=False)
print(f"Selected {len(df_selected)} papers:")
print(df_selected['category'].value_counts().sort_index())
Text Quality Considerations
PDF extraction isn't perfect. Common issues include:
Formatting artifacts:
- Extra spaces between words
- Line breaks in unexpected places
- Mathematical symbols rendered as Unicode
- Headers/footers appearing in body text
Handling these issues:
def clean_extracted_text(text):
"""
Basic cleaning for extracted PDF text.
"""
# Remove excessive whitespace
text = ' '.join(text.split())
# Remove common artifacts (customize based on your PDFs)
text = text.replace('ï¬', 'fi') # Common ligature issue
text = text.replace('’', "'") # Apostrophe encoding issue
return text
# Apply cleaning when extracting
text = extract_paper_text(pdf_path)
if text:
text = clean_extracted_text(text)
# Now save cleaned text
We kept cleaning minimal for this tutorial to show realistic extraction results. In production, you might implement more aggressive cleaning depending on your PDF sources.
Why We Chose These 20 Papers
The 20 papers in this tutorial were selected to provide:
- Diversity across topics: 4 papers each from Machine Learning, Computer Vision, Computational Linguistics, Databases, and Software Engineering
- Variety in length: Papers range from 2,735 to 20,763 words
- Realistic content: Papers published in 2024-2025 with modern topics
- Successful extraction: All 20 papers extracted cleanly with readable text
This diversity ensures that chunking strategies are tested across different writing styles, document lengths, and technical domains rather than being optimized for a single type of content.
You now have all the code needed to prepare your own document chunking datasets. The same pattern works for any PDF collection: download, extract, clean, and chunk.
Key Reminders:
- Both chunking strategies work well (100% success rate) with proper queries
- Sentence-based requires 44% less storage (513 vs 914 chunks)
- Sentence-based shows slightly better similarity distances
- Fixed token provides more consistent sizes and finer granularity
- Query quality matters more than chunking strategy
- Rate limiting is normal production behavior, handle it gracefully
- ChromaDB collection deletion is standard during experimentation
- Edge cases (tiny chunks, variable sizes) are expected and usually fine
- Evaluation frameworks transfer to any chunking strategy
- Choose based on your constraints (storage, coherence, structure) not on "best practice"