Semantic Caching and Memory Patterns for Vector Databases
Over the past few tutorials, we've built a complete paper search system. We learned how to store embeddings efficiently in ChromaDB, chunk documents for better retrieval, filter results with metadata and hybrid search, and choose the right production database for different scenarios. Our system retrieves relevant papers based on similarity to a query and returns those papers.
This tutorial focuses on two optimization techniques that become valuable when you connect vector databases to language models. We'll add a simple LLM synthesis step to our paper search system because it provides an expensive operation worth optimizing. Then we'll build semantic caching to avoid redundant API calls when users ask similar questions, and we'll implement conversation memory so the system can handle follow-up queries that reference earlier exchanges.
To be clear, we're demonstrating caching and memory mechanics here using straightforward synthesis as our example. This is not a comprehensive treatment of retrieval-augmented generation (RAG) systems. Production RAG systems involve query expansion, reranking strategies, citation handling, evaluation frameworks, failure mode detection, and deployment patterns that are beyond the scope of this tutorial. Think of this as learning caching and memory techniques that happen to use synthesis as an example.
By the end of this tutorial, you'll understand how to use vector databases for semantic caching and conversation memory, two techniques that reduce costs and improve multi-turn interactions in LLM applications.
Prerequisites and Setup
This tutorial builds directly on the arXiv paper search system from previous tutorials. You'll need the same dataset and embeddings we've been working with, plus a few additional packages for interacting with the Cohere API.
Required packages:
We'll be using these package versions for this tutorial:
chromadb==1.3.7(vector database)cohere==5.20.1(LLM API client)numpy==2.0.2(array operations)pandas==2.2.2(data handling)python-dotenv==1.2.1(environment variable management)
Install them with pip if you haven't already:
pip install chromadb cohere numpy pandas python-dotenv
API Key Setup:
You'll need a Cohere API key for this tutorial. Sign up for a free account at cohere.com if you don't have one. The free tier provides plenty of API calls for this tutorial.
Create a .env file in your working directory and add your API key:
COHERE_API_KEY=your-api-key-here
This keeps your API key secure and separate from your code. We'll load it using python-dotenv:
from dotenv import load_dotenv
import os
load_dotenv()
cohere_api_key = os.getenv('COHERE_API_KEY')
Dataset:
We're using the same 5,000 arXiv papers from previous tutorials. Download both files if you haven't already:
- arxiv_papers_5k.csv (7.7 MB, metadata)
- embeddings_cohere_5k.npy (61.4 MB, pre-generated embeddings)
Place both files in your working directory before proceeding.
Part 1: Semantic Caching
Adding an Expensive Operation
To demonstrate semantic caching effectively, we need an LLM operation that's expensive enough to make caching worthwhile. We'll add a simple synthesis step to our paper search system where we retrieve papers and then ask an LLM to generate an answer based on those papers. This gives us realistic API costs and latency to optimize.
The flow becomes: retrieve papers from ChromaDB, send papers plus query to Cohere's LLM, get synthesized response. Each synthesis call processes thousands of tokens and takes a couple seconds. That's exactly the kind of expensive operation where caching provides measurable value.
Baseline Performance Without Caching
Let's build a simple version of this synthesis system and measure its costs. The code below loads our paper collection, processes a query through retrieval and synthesis, and times each step:
import os
import time
import chromadb
import cohere
import numpy as np
import pandas as pd
from dotenv import load_dotenv
# Load environment variables
load_dotenv()
# -----------------------------
# Global config and helpers
# -----------------------------
EMBED_MODEL = "embed-v4.0"
CHAT_MODEL = "command-a-03-2025"
CACHE_NAMESPACE = f"{CHAT_MODEL}|temp=default"
def embed_query(client: cohere.ClientV2, text: str) -> list[float]:
"""
Generate an embedding for a query using Cohere ClientV2.
Returns a list[float] suitable for ChromaDB.
"""
resp = client.embed(
model=EMBED_MODEL,
input_type="search_query",
texts=[text],
embedding_types=["float"],
)
return resp.embeddings.float[0]
def extract_text(resp) -> str:
"""
Extract generated text from Cohere chat responses.
Works across Cohere response shapes.
"""
return resp.text if hasattr(resp, "text") else resp.message.content[0].text
def reset_collection(client: chromadb.Client, name: str) -> None:
"""
Delete and recreate a ChromaDB collection.
Useful in notebooks to avoid duplicate ID errors on re-runs.
"""
try:
client.delete_collection(name)
except Exception:
pass
# -----------------------------
# Initialize clients
# -----------------------------
api_key = os.getenv("COHERE_API_KEY")
if not api_key:
raise ValueError(
"Missing COHERE_API_KEY. Add it to your .env file (COHERE_API_KEY=...) "
"or set it as an environment variable before running this notebook."
)
chroma_client = chromadb.Client()
co = cohere.ClientV2(api_key)
# -----------------------------
# Load dataset
# -----------------------------
# If you're continuing from the previous lesson using the Docker lab,
# these files are located in the `data/` directory inside the container.
# Adjust the paths below if needed (for example, 'data/arxiv_papers_5k.csv').
papers_df = pd.read_csv("arxiv_papers_5k.csv")
embeddings = np.load("embeddings_cohere_5k.npy")
# -----------------------------
# Create ChromaDB collection and add papers
# -----------------------------
# Reset collection to avoid duplicate ID errors if re-running this cell
reset_collection(chroma_client, "arxiv_papers")
collection = chroma_client.get_or_create_collection(
name="arxiv_papers",
metadata={"hnsw:space": "cosine"},
)
# For this tutorial (5,000 papers), inserting everything in one call usually works.
# Some environments impose request-size limits, so we batch inserts for reliability.
batch_size = 5000
for i in range(0, len(papers_df), batch_size):
batch_end = min(i + batch_size, len(papers_df))
collection.add(
ids=[str(idx) for idx in range(i, batch_end)],
embeddings=embeddings[i:batch_end].tolist(),
documents=papers_df["abstract"].iloc[i:batch_end].tolist(),
metadatas=papers_df[["title", "category", "published"]]
.iloc[i:batch_end]
.to_dict("records"),
)
# -----------------------------
# Baseline performance test
# -----------------------------
query = "What are attention mechanisms in transformers?"
# Time the embedding step
start = time.time()
query_embedding = embed_query(co, query)
embedding_time = (time.time() - start) * 1000
# Time the retrieval step
start = time.time()
results = collection.query(
query_embeddings=[query_embedding],
n_results=5,
include=["documents", "metadatas", "distances"],
)
retrieval_time = (time.time() - start) * 1000
# Time the synthesis step
start = time.time()
papers_text = "\n\n".join([
f"Paper {i+1}: {results['metadatas'][0][i]['title']}\n{results['documents'][0][i][:500]}..."
for i in range(len(results["documents"][0]))
])
prompt = f"""Based on these research papers, answer the question: {query}
{papers_text}
Provide a clear, synthesized answer based on the papers above.
"""
resp = co.chat(
model=CHAT_MODEL,
messages=[{"role": "user", "content": prompt}],
)
_ = extract_text(resp)
synthesis_time = (time.time() - start) * 1000
total_time = embedding_time + retrieval_time + synthesis_time
print(f"Query: {query}")
print("\nTiming breakdown:")
print(f" Embedding: {embedding_time:.1f}ms")
print(f" Retrieval: {retrieval_time:.1f}ms")
print(f" LLM Synthesis: {synthesis_time:.1f}ms")
print(f" Total: {total_time:.1f}ms")
print(f"\nBottleneck: LLM synthesis is {synthesis_time/retrieval_time:.1f}x slower than retrieval")
Query: What are attention mechanisms in transformers?
Timing breakdown:
Embedding: 207.6ms
Retrieval: 4.8ms
LLM Synthesis: 5926.5ms
Total: 6139.0ms
Bottleneck: LLM synthesis is 1228.6x slower than retrieval
Note: chromadb.Client() creates an in-memory database by default. This is convenient for learning, but it means collections reset when your Python process restarts. In production or longer experiments, you would use a persistent client.
In this run, embedding took 208ms and retrieval was ~5ms, but the synthesis step took ~6 seconds, or roughly 1,200× slower than retrieval. That’s why caching can create large wins: it avoids repeating the most expensive part of the pipeline.
Your exact timings will vary depending on model latency, network speed, and machine performance. The key takeaway is that synthesis is orders of magnitude slower than vector search.
It's also an API cost issue. The Cohere API charges per token, and each synthesis call processes thousands of tokens between the input papers and the generated response. LLM costs scale with input and output tokens. A rough estimate is:
cost ≈ (input_tokens + output_tokens) × price_per_token
The exact cost depends on the model and response length.
Many queries are semantically similar even when worded differently. If someone asks "What are attention mechanisms in transformers?" and then someone else asks "How do attention mechanisms work in transformer models?", those are essentially the same question. We shouldn't pay to synthesize the same answer twice. Semantic caching solves this problem.
Two-Tier Cache Architecture
The solution is a two-tier cache that catches both exact matches and semantic similarities. Here's how it works:
Layer 1: Exact Match Cache
This is a simple Python dictionary that maps query strings to responses. If someone asks the exact same question twice, we return the cached response immediately. Dictionary lookups are typically well under 1 millisecond, which is essentially free compared to LLM synthesis.
Layer 2: Semantic Match Cache
This uses ChromaDB to find semantically similar queries. When a new query comes in, we embed it and search for similar cached queries. If we find a match above a certain similarity threshold, we return that cached response instead of calling the LLM again.
The key insight is that these two layers catch different patterns. The exact match cache is perfect for when users literally ask the same question. The semantic cache handles the much more common case where users rephrase questions naturally.
Distance vs similarity:
ChromaDB returns distance, not similarity. When using cosine distance, lower is better.
A common pattern is to convert it to an approximate similarity score using the formula below. Because cosine distance ranges from 0 to 2, this converted “similarity” can be negative. That simply means the queries are very dissimilar.
similarity ≈ 1 - distance
That’s what we’re doing here so the threshold (e.g., 0.90) is easier to reason about.
Let's implement this cache system:
import hashlib
import time
class SemanticCache:
def __init__(self, chroma_client, cohere_client):
self.co = cohere_client
self.cache_namespace = CACHE_NAMESPACE
# Layer 1: Exact-match cache
self.exact_cache = {}
# Layer 2: Semantic cache
self.semantic_cache = chroma_client.get_or_create_collection(
name="query_cache",
metadata={"hnsw:space": "cosine"},
)
self.cache_count = 0
def _hash_query(self, query: str) -> str:
base = f"{self.cache_namespace}:{query.lower().strip()}"
return hashlib.md5(base.encode("utf-8")).hexdigest()
def get(self, query: str, similarity_threshold: float = 0.90):
# Layer 1: exact match
query_hash = self._hash_query(query)
if query_hash in self.exact_cache:
return self.exact_cache[query_hash], "exact"
# Layer 2: semantic match
query_embedding = embed_query(self.co, query)
results = self.semantic_cache.query(
query_embeddings=[query_embedding],
n_results=1,
include=["documents", "metadatas", "distances"],
)
if not results.get("documents") or not results["documents"][0]:
return None, None
distance = results["distances"][0][0]
similarity = 1 - distance
if similarity >= similarity_threshold:
cached_response = results["metadatas"][0][0]["response"]
return cached_response, "semantic"
return None, None
def put(self, query: str, response: str) -> None:
query_hash = self._hash_query(query)
self.exact_cache[query_hash] = response
query_embedding = embed_query(self.co, query)
self.semantic_cache.add(
ids=[f"cache_{self.cache_count}"],
embeddings=[query_embedding],
documents=[query],
metadatas=[{"response": response}],
)
self.cache_count += 1
def answer_query_with_cache(
query: str,
cache: SemanticCache,
collection,
similarity_threshold: float = 0.90,
):
start = time.time()
cached_response, cache_type = cache.get(query, similarity_threshold=similarity_threshold)
if cached_response is not None:
elapsed = (time.time() - start) * 1000
print(f"Cache hit ({cache_type}): {elapsed:.1f}ms")
return cached_response, cache_type
print("Cache miss - running full pipeline...")
query_embedding = embed_query(cache.co, query)
results = collection.query(
query_embeddings=[query_embedding],
n_results=5,
include=["documents", "metadatas", "distances"],
)
papers_text = "\n\n".join([
f"Paper {i+1}: {results['metadatas'][0][i]['title']}\n{results['documents'][0][i][:500]}..."
for i in range(len(results["documents"][0]))
])
prompt = f"""Based on these research papers, answer the question: {query}
{papers_text}
Provide a clear, synthesized answer based on the papers above.
"""
resp = cache.co.chat(
model=CHAT_MODEL,
messages=[{"role": "user", "content": prompt}],
)
answer = extract_text(resp)
cache.put(query, answer)
elapsed = (time.time() - start) * 1000
print(f"Full pipeline: {elapsed:.1f}ms")
return answer, "miss"
try:
chroma_client.delete_collection("query_cache")
except Exception:
pass
cache = SemanticCache(chroma_client, co)
query1 = "What are attention mechanisms in transformers?"
answer1, _ = answer_query_with_cache(query1, cache, collection)
print(f"\nAnswer: {answer1[:200]}...")
Cache miss - running full pipeline...
Full pipeline: 5741.9ms
Answer: Attention mechanisms in transformers are a core component that enable
the model to focus on relevant parts of the input sequence when making
predictions. They are responsible for capturing dependencie...
The first call took the full 5,742 milliseconds because nothing was cached yet. Now let's try asking the exact same question again.
# Ask the exact same question
answer2, _ = answer_query_with_cache(query1, cache, collection)
print(f"\nSame answer returned: {answer1 == answer2}")
Cache hit (exact): 0.1ms
Same answer returned: True
The second call took well under a millisecond instead of several seconds because it was just a Python dictionary lookup. That’s orders of magnitude faster than running embedding + retrieval + LLM synthesis. That said, semantic hits still require embedding the query and searching the cache, so they are usually 100 to 300 ms total. The big savings comes from avoiding the LLM call.
But exact matches aren't that interesting because users rarely ask identical questions. The real power comes from semantic matching. Let's try a rephrased version:
# Ask a semantically similar question
query2 = "How do attention mechanisms work in transformer models?"
answer3, _ = answer_query_with_cache(query2, cache, collection, similarity_threshold=0.90)
print(f"\nGot cached response: {answer1 == answer3}")
Cache hit (semantic): 185.9ms
Got cached response: True
This is the value of semantic caching. The user asked a different question with different words, but the meaning was similar enough that we returned the cached answer. The query took about 186 milliseconds instead of several seconds because we only needed to embed the query and search the cache, not call the LLM again. That is still much faster than the full pipeline, and it saves the cost of an API call.
The ~186 milliseconds breaks down into embedding time plus a fast ChromaDB search. Semantic cache lookups are not free, but they are far cheaper than running retrieval plus LLM synthesis.
Why include model settings in the cache key? If you change the LLM model (or generation settings), cached answers may no longer match what the system would generate today. Including the model (and optionally temperature or a prompt version) prevents confusing situations where you “upgrade the model” but keep getting old cached answers.
Now we need to understand how to tune that similarity threshold properly. Set it too high and you'll miss legitimate cache hits. Set it too low and you'll return wrong answers to questions that just happen to have similar embeddings.
Understanding Similarity Thresholds
The threshold parameter controls how similar two queries need to be before we consider them the same for caching purposes. We set it to 0.90 in our example, but where does that number come from? Let's investigate with real similarity scores.
We'll take our base query about attention mechanisms and compare it against several other queries. Some are legitimate rephrasing that should hit the cache. Others are different questions that shouldn't.
from numpy import dot
from numpy.linalg import norm
# Base query that's already cached
base_query = "What are attention mechanisms in transformers?"
# Test queries with different intents
test_queries = [
# Natural rephrasing (SHOULD cache)
"How do attention mechanisms work in transformer models?",
"Explain attention in transformers",
"What is the purpose of attention mechanisms?",
"How does self-attention work?",
# Different intent (should NOT cache)
"How expensive are transformer models to train?",
"What are transformer limitations?",
"Why attention instead of RNNs?",
"What datasets train transformers?",
]
# Embed queries using the same helper function we used earlier
base_embedding = embed_query(co, base_query)
test_embeddings = [embed_query(co, q) for q in test_queries]
print("Similarity scores to base query:")
print(f"Base: {base_query}\n")
for i, (query, emb) in enumerate(zip(test_queries, test_embeddings)):
similarity = dot(base_embedding, emb) / (norm(base_embedding) * norm(emb))
intent = "SHOULD cache" if i < 4 else "should NOT cache"
print(f"{similarity:.4f} - {query} ({intent})")
Similarity scores to base query:
Base: What are attention mechanisms in transformers?
0.9446 - How do attention mechanisms work in transformer models? (SHOULD cache)
0.8410 - Explain attention in transformers (SHOULD cache)
0.7845 - What is the purpose of attention mechanisms? (SHOULD cache)
0.6004 - How does self-attention work? (SHOULD cache)
0.3594 - How expensive are transformer models to train? (should NOT cache)
0.3868 - What are transformer limitations? (should NOT cache)
0.5216 - Why attention instead of RNNs? (should NOT cache)
0.4559 - What datasets train transformers? (should NOT cache)
The pattern is clear. Natural rephrasing of the same question produces similarities between 0.84 and 0.94. Questions about different topics, even when they mention transformers, produce similarities between 0.36 and 0.52. There's a meaningful gap between legitimate rephrasing and different questions.
The one interesting case is "How does self-attention work?" at 0.6004. That's asking about a specific component rather than attention mechanisms broadly. It's borderline, which shows why threshold tuning matters. If you set your threshold at 0.85, you'd correctly reject this as different enough to warrant a new answer. At 0.50, you'd incorrectly return the cached general answer to this more specific question.
Choosing Your Threshold
Based on these similarity patterns, here are practical threshold recommendations:
0.95 (Conservative)
Use when wrong answers are costly. This catches only very close paraphrasing. You'll miss some legitimate cache hits, but you'll never return incorrect cached answers. Good for applications where accuracy matters more than cost savings.
0.90 (Balanced - Recommended)
This is the sweet spot for most applications. It catches natural rephrasing while avoiding false positives. In our testing, this threshold distinguished between rephrased questions (0.84 to 0.94) and different questions (0.36 to 0.52) with zero false positives.
0.85 (Aggressive)
Use when cost savings are critical. This catches more variations but approaches the danger zone where different questions might incorrectly hit the cache. Monitor your cache hits carefully at this threshold.
Similarity scores often form loose clusters, but they are not perfectly separated. In this run, close paraphrases landed above 0.84, while clearly different questions landed below 0.52. But some legitimate follow-ups, like questions about self-attention, scored much lower. That is why thresholds are a tradeoff, not a guaranteed boundary.
Realistic Cache Performance
Let's test the cache with a more realistic workload. We'll simulate 22 queries that mix exact repeats, natural variations, and different questions.
# Simulate a realistic query workload
realistic_queries = [
"What are attention mechanisms in transformers?",
"How do attention mechanisms work in transformer models?", # Variation
"What are attention mechanisms in transformers?", # Exact repeat
"Explain the transformer architecture",
"What is the transformer architecture?", # Variation
"How do transformers handle long sequences?",
"What are the limitations of transformer models?",
"Explain attention in transformers", # Variation of query 1
"How expensive are transformers to train?",
"What datasets are used for training transformers?",
"How do transformers compare to RNNs?",
"What is self-attention?",
"Explain the transformer architecture", # Exact repeat
"What are positional encodings in transformers?",
"How do attention mechanisms work in transformer models?", # Exact repeat
"What are the computational costs of transformers?",
"How do transformers handle variable length sequences?",
"What are the key innovations in transformers?",
"Explain attention in transformers", # Exact repeat
"What are attention mechanisms in transformers?", # Exact repeat
"How do transformers process text?",
"What makes transformers effective for NLP?",
]
# Reset cache collection + cache object for a clean workload test
try:
chroma_client.delete_collection("query_cache")
except Exception:
pass
cache = SemanticCache(chroma_client, co)
# Track metrics
total_queries = len(realistic_queries)
exact_hits = 0
semantic_hits = 0
cache_misses = 0
total_time = 0
print("Processing realistic workload...\n")
for query in realistic_queries:
start = time.time()
# Try cache
cached_response, cache_type = cache.get(query, similarity_threshold=0.90)
if cached_response is not None:
if cache_type == "exact":
exact_hits += 1
else:
semantic_hits += 1
elapsed = (time.time() - start) * 1000
total_time += elapsed
continue
# Cache miss: run full pipeline
cache_misses += 1
query_embedding = embed_query(co, query)
results = collection.query(
query_embeddings=[query_embedding],
n_results=5,
include=["documents", "metadatas", "distances"],
)
papers_text = "\n\n".join([
f"Paper {i+1}: {results['metadatas'][0][i]['title']}\n{results['documents'][0][i][:500]}..."
for i in range(len(results["documents"][0]))
])
prompt = f"""Based on these research papers, answer the question: {query}
{papers_text}
Provide a clear, synthesized answer based on the papers above.
"""
resp = co.chat(
model=CHAT_MODEL,
messages=[{"role": "user", "content": prompt}],
)
answer = extract_text(resp)
cache.put(query, answer)
elapsed = (time.time() - start) * 1000
total_time += elapsed
# Calculate metrics
hit_rate = ((exact_hits + semantic_hits) / total_queries) * 100
avg_time = total_time / total_queries
print(f"Workload Results (threshold=0.90):")
print(f" Total queries: {total_queries}")
print(f" Exact hits: {exact_hits} ({(exact_hits/total_queries)*100:.1f}%)")
print(f" Semantic hits: {semantic_hits} ({(semantic_hits/total_queries)*100:.1f}%)")
print(f" Cache misses: {cache_misses} ({(cache_misses/total_queries)*100:.1f}%)")
print(f" Overall hit rate: {hit_rate:.1f}%")
print(f"\n Average latency: {avg_time:.0f}ms per query")
# LLM calls avoided
llm_calls_avoided = total_queries - cache_misses
print(f"\n LLM calls avoided: {llm_calls_avoided} ({(llm_calls_avoided/total_queries)*100:.1f}%)")
Processing realistic workload...
Workload Results (threshold=0.90):
Total queries: 22
Exact hits: 7 (31.8%)
Semantic hits: 2 (9.1%)
Cache misses: 13 (59.1%)
Overall hit rate: 40.9%
Average latency: 1803ms per query
LLM calls avoided: 9 (40.9%)
These numbers tell an honest story about caching performance. With a realistic workload and a 0.90 threshold, we achieved a 40.9% cache hit rate. That means 9 out of 22 queries avoided a full LLM synthesis call. Most of those hits came from exact repeats, with a smaller contribution from semantic matching. That is typical in exploratory research workloads where users bounce between topics instead of repeatedly rephrasing the same question.
If you increased the threshold to 0.85, you'd catch one or two additional semantic matches (bringing hit rate to around 45%), but you'd risk false positives. The 0.90 threshold provides reliable savings without the danger of returning wrong answers.
Cache Guardrails: When NOT to Cache
Semantic caching works best when the “right” answer for a query stays stable over time and does not depend on who is asking. The problem is that embeddings cannot reliably detect certain kinds of hidden context, so two queries can look identical to the cache even though they require different answers.
Here are the two most common cases where caching can produce incorrect results:
- Time-sensitive queries
Queries that depend on “now” or “recent” change meaning over time. Even if the query text is identical, the correct answer can change from week to week or month to month as new papers are published.
- User-specific queries
Queries that depend on a user’s personal history, preferences, or saved items should not be cached globally. Two different users can submit the same query string and legitimately require different results.
Because of this, production systems typically implement guardrails that decide whether a query should be cached at all. The simplest approach is a lightweight rule-based filter that bypasses caching when queries contain time-sensitive language or user-specific language. More advanced systems use intent classification or query parsing, but the core principle is the same:
If the answer depends on time, user identity, or session context, do not reuse cached responses across requests.
This tutorial does not implement guardrails directly in code, but the approach is straightforward: add a should_cache(query) check before calling cache.get() or before writing with cache.put(). If should_cache returns False, skip the cache and run the full pipeline.
Time To Live (TTL) Strategies
Even “safe” queries can become stale over time. A cached answer that was correct yesterday might not be correct next month, especially when your underlying dataset changes or your model behavior evolves.
That is why production caching systems often use a time to live (TTL) policy. Each cached entry has an expiration window, and once it expires, the system recomputes the response and replaces the cached value.
A practical way to think about TTL is:
- Use longer TTLs for stable concepts and evergreen explanations.
- Use shorter TTLs for anything that references “recent” work, trending topics, or fast-moving domains.
- Use very short TTLs (or no caching) for queries that depend on live or user-specific data.
This tutorial does not implement automatic expiration or cache eviction, since that requires background cleanup and persistence strategies. For learning purposes, the key takeaway is to match your caching strategy to how quickly the correct answer can change.
Part 2: Conversation Memory
Semantic caching solves one optimization problem. Now let's tackle another challenge that emerges when systems handle multi-turn conversations. The issue is context.
The Multi-Turn Problem
Consider this natural research conversation:
Turn 1: "What are attention mechanisms?"
→ System retrieves papers about attention mechanisms
Turn 2: "How do they compare to RNNs?"
→ Without memory: "they" = ??? (system has no idea)
→ With memory: "they" = attention mechanisms from Turn 1
Turn 3: "Show me efficient implementations"
→ Without memory: Implementations of WHAT?
→ With memory: Efficient attention mechanisms
The follow-up questions only make sense with context from earlier turns. Without memory, "they" and "implementations" are ambiguous. The system would either fail to answer or retrieve irrelevant papers. With memory, these pronouns and implicit references become meaningful.
This is fundamentally different from caching. Caching avoids recomputing the same answer. Memory makes new questions answerable by providing context from past exchanges.
Does Memory Actually Help?
Let's test this with a real example. We'll start a conversation about attention mechanisms, then ask an ambiguous follow-up question and see what happens with and without memory.
# First turn: Establish context
turn1_query = "What are attention mechanisms in transformers?"
# Retrieve papers to establish context (using our shared embedding helper)
turn1_embedding = embed_query(co, turn1_query)
turn1_results = collection.query(
query_embeddings=[turn1_embedding],
n_results=5,
include=["metadatas"],
)
print("Turn 1 - Retrieved papers:")
for i, meta in enumerate(turn1_results["metadatas"][0]):
print(f"{i+1}. {meta['title'][:80]}...")
# Second turn: Ambiguous follow-up
turn2_query = "Show me efficient implementations"
# WITHOUT MEMORY: search using the ambiguous query alone
print("\n\nTurn 2 WITHOUT MEMORY:")
print(f"Query: {turn2_query}")
turn2_embedding_no_memory = embed_query(co, turn2_query)
results_no_memory = collection.query(
query_embeddings=[turn2_embedding_no_memory],
n_results=5,
include=["metadatas"],
)
print("Retrieved papers:")
for i, meta in enumerate(results_no_memory["metadatas"][0]):
print(f"{i+1}. {meta['title'][:80]}...")
Turn 1 - Retrieved papers:
1. $π$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling...
2. How Particle-System Random Batch Methods Enhance Graph Transformer: Memory Effic...
3. Multistability of Self-Attention Dynamics in Transformers...
4. A Unified Geometric Field Theory Framework for Transformers: From Manifold Embed...
5. Fractional neural attention for efficient multiscale sequence processing...
Turn 2 WITHOUT MEMORY:
Query: Show me efficient implementations
Retrieved papers:
1. Indexing Strings with Utilities...
2. Attention and Compression is all you need for Controllably Efficient Language Mo...
3. MossNet: Mixture of State-Space Experts is a Multi-Head Attention...
4. Hidden Sketch: A Space-Efficient Reversible Sketch for Tracking Frequent Items i...
5. Inferring the Most Similar Variable-length Subsequences between Multidimensional...
Without memory, we still get some relevant results, but the retrieval is often inconsistent. You may see one or two papers that clearly match the intended meaning (for example, efficient attention methods), mixed with other papers that are only loosely related to the phrase “efficient implementations.” This happens because the query is too vague on its own. The system does not know what kind of implementation we mean until we provide context from the previous turn.
Now let's add memory:
# WITH MEMORY: Include context from Turn 1
print("\n\nTurn 2 WITH MEMORY:")
print(f"Query: {turn2_query}")
# Create context-aware query by combining Turn 1 and Turn 2
memory_enhanced_query = f"{turn1_query} {turn2_query}"
print(f"Context-aware query: {memory_enhanced_query}")
turn2_embedding_with_memory = embed_query(co, memory_enhanced_query)
results_with_memory = collection.query(
query_embeddings=[turn2_embedding_with_memory],
n_results=5,
include=["metadatas"],
)
print("\nRetrieved papers:")
for i, meta in enumerate(results_with_memory["metadatas"][0]):
print(f"{i+1}. {meta['title'][:80]}...")
# Compare how many papers changed (using titles since IDs may not always be returned)
without_titles = {m["title"] for m in results_no_memory["metadatas"][0]}
with_titles = {m["title"] for m in results_with_memory["metadatas"][0]}
papers_changed = len(without_titles - with_titles)
print(f"\nPapers changed: {papers_changed} out of 5 ({(papers_changed/5)*100:.0f}%)")
Turn 2 WITH MEMORY:
Query: Show me efficient implementations
Context-aware query: What are attention mechanisms in transformers? Show me efficient implementations
Retrieved papers:
1. Attention and Compression is all you need for Controllably Efficient Language Mo...
2. $π$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling...
3. FlashEVA: Accelerating LLM inference via Efficient Attention...
4. How Particle-System Random Batch Methods Enhance Graph Transformer: Memory Effic...
5. Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-of...
Papers changed: 4 out of 5 (80%)
Note: Naively concatenating earlier turns can sometimes pull retrieval toward whatever you talked about previously. That is fine for learning, but production systems usually limit memory to recent turns or store summarized context instead.
With memory, 4 out of 5 papers changed. The results are now consistently focused on efficient attention mechanisms and transformer optimization work. The context from Turn 1 clarified what “efficient implementations” referred to, so the retrieval step became much more relevant and less noisy.
This is the value of conversation memory. The follow-up query “Show me efficient implementations” is ambiguous on its own. Once the system includes prior conversation context, that vague query becomes a clear search intent, and the retrieved papers reflect what the user actually meant.
Implementing Conversation Memory
Here's a simple conversation memory system using ChromaDB to store and retrieve relevant context:
class ConversationMemory:
def __init__(self, chroma_client, cohere_client):
self.co = cohere_client
self.turn_counter = 0
# Create separate collection for memory
self.memory_collection = chroma_client.get_or_create_collection(
name="conversation_memory",
metadata={"hnsw:space": "cosine"},
)
def add_turn(self, user_query: str, assistant_response: str) -> None:
"""Store a conversation turn in memory."""
# Combine query and response for context
# We use the first 200 chars of response to keep it manageable
turn_text = f"User asked: {user_query}\nSystem answered: {assistant_response[:200]}..."
# Embed and store (search_document is appropriate for stored content)
embedding = self.co.embed(
model=EMBED_MODEL,
input_type="search_document",
texts=[turn_text],
embedding_types=["float"],
).embeddings.float[0]
self.memory_collection.add(
ids=[f"turn_{self.turn_counter}"],
embeddings=[embedding],
documents=[turn_text],
metadatas=[{
"turn": self.turn_counter,
"user_query": user_query,
}],
)
self.turn_counter += 1
def get_relevant_context(self, current_query: str, n_results: int = 2) -> str:
"""Retrieve relevant past turns for the current query."""
if self.turn_counter == 0:
return "" # No history yet
# Embed current query using the shared query embedding helper
query_embedding = embed_query(self.co, current_query)
results = self.memory_collection.query(
query_embeddings=[query_embedding],
n_results=min(n_results, self.turn_counter),
include=["documents", "metadatas", "distances"],
)
if not results.get("documents") or not results["documents"][0]:
return ""
# Format context from past turns
context = "Previous conversation:\n"
for doc in results["documents"][0]:
context += f"{doc}\n\n"
return context
# Test the memory system
memory = ConversationMemory(chroma_client, co)
print("Starting conversation with memory...\n")
# Turn 1
query1 = "What are attention mechanisms in transformers?"
response1 = "Attention mechanisms allow transformers to weigh different parts of the input..."
memory.add_turn(query1, response1)
print(f"Turn 1: {query1}")
print(f"Response: {response1[:50]}...\n")
# Turn 2
query2 = "How do they compare to RNNs?"
context = memory.get_relevant_context(query2)
print(f"Turn 2: {query2}")
print(f"Retrieved context:\n{context}")
Starting conversation with memory...
Turn 1: What are attention mechanisms in transformers?
Response: Attention mechanisms allow transformers to weigh...
Turn 2: How do they compare to RNNs?
Retrieved context:
Previous conversation:
User asked: What are attention mechanisms in transformers?
System answered: Attention mechanisms allow transformers to weigh different parts of the input...
The memory system stores each turn and retrieves relevant context when needed. When the user asks "How do they compare to RNNs?", the system retrieves the previous turn about attention mechanisms. Now "they" is no longer ambiguous.
The chunking approach we used here (user query plus first 200 chars of response) is straightforward but not systematically tested. Production systems might experiment with alternatives like storing just the user query, storing full responses, or storing LLM-generated summaries of turns. What matters is the pattern of storing conversation turns as searchable embeddings and retrieving relevant context for new queries.
When Memory Matters Most
Memory provides the most value in specific scenarios.
Multi-turn research conversations where users progressively refine their exploration. "What are transformers?" followed by "What about vision transformers?" followed by "Show me recent papers."
Follow-up questions with pronouns like "they", "it", "those" that reference earlier topics. Without memory, these are ambiguous. With memory, they're clear.
Progressive refinement where each question builds on the previous answer. "What is attention?" then "What about multi-head attention?" then "Show me efficient implementations."
Context-dependent queries like "Show me related work" where "related to what?" depends on earlier conversation.
Memory is less critical for standalone queries where each question is self-contained and independent. If users jump between unrelated topics, memory won't help much. But for the natural flow of research conversations where topics evolve and build on each other, memory transforms the experience.
Taking This to Production
The techniques we've built in this tutorial work well for learning and prototyping. When you're ready to move to production, there are specialized tools and services designed specifically for semantic caching and conversation memory at scale.
Production Caching Solutions
What we built: Python dict (exact match) plus ChromaDB (semantic match)
- Works great for learning and understanding the fundamentals
- Good enough for prototypes and small-scale applications
- Limitation: In-memory dict isn't persistent, ChromaDB wasn't optimized for caching workloads
Production alternatives to consider:
GPTCache is a modular semantic caching framework by Zilliz. It integrates with LangChain and LlamaIndex, supports multiple vector stores (Milvus, Qdrant, FAISS), and provides battle-tested caching patterns. Good choice when building serious production systems where you want fine-grained control.
Upstash Semantic Cache is a fully managed service built on Upstash Vector. It offers a simple API with automatic scaling and zero infrastructure management. Good choice when you want to focus on your application rather than cache operations, though it does mean vendor lock-in and costs that scale with usage.
Redis plus a vector database combines Redis for exact matching (sub-millisecond lookups, persistent storage) with a separate vector DB (pgvector, Qdrant) for semantic matching. This gives you production-grade speed and durability but requires wiring two systems together. Good choice when you're already using Redis and want fine-grained control over both layers.
AI gateway services like Portkey provide caching as part of a broader observability platform. They act as a proxy layer that handles caching automatically while also providing rate limiting, fallbacks, and monitoring. Good choice when you want comprehensive observability plus caching in one service.
Production Memory Solutions
What we built: ChromaDB for storing conversation turns
- Simple and consistent with the rest of this tutorial series
- Works for learning and prototyping
- Limitation: Not optimized for high-write workloads typical of conversation logging
Production alternatives to consider:
pgvector (from our earlier tutorial) lets you store conversations in PostgreSQL with a vector column. This gives you persistence, transactional guarantees, and easy integration with your existing user database. Good choice when you're already using PostgreSQL and need durable conversation storage.
LangChain ConversationBufferMemory is a simple in-memory buffer of the last N messages. It's built-in, requires zero setup, and works great for prototyping. The limitation is no persistence and no semantic retrieval, just chronological buffering. Good enough for simple chatbots with ephemeral sessions.
LangChain ConversationSummaryMemory uses an LLM to summarize conversation history automatically. This handles long conversations elegantly and reduces token usage, but it costs money (LLM calls for summarization) and is lossy compression. Good choice when conversations get very long and token limits matter.
Redis for session state plus vector DB for semantic retrieval stores raw conversation JSON in Redis (fast access, persistent) while maintaining a parallel vector index for semantic search of past turns. This requires managing two systems but gives you both fast session access and semantic retrieval when needed. Good choice for high-traffic production systems.
Multi-User Considerations
Everything we built in this tutorial assumes a single user. Production systems serving multiple users need additional considerations.
Session isolation: User A's cache shouldn't serve answers to User B. User A's conversation history shouldn't be visible to User B. This requires adding user_id or session_id to all cache keys and metadata.
Implementation pattern:
cache_key = f"{user_id}:{query_hash}"
memory_id = f"{user_id}_turn_{turn_number}"
Every cache operation and memory storage needs these identifiers to maintain proper isolation. This isn't complicated technically, but it's critical for privacy and correctness.
Monitoring and Observability
Production systems need metrics to validate that caching and memory actually provide value. Track cache hit rate over time (both exact and semantic), latency distribution at p50, p95, and p99, cost savings (API calls avoided multiplied by cost per call), and false positive rate (wrong answers served from cache).
For memory systems, measure how often retrieved context actually helps answer queries and track the quality of multi-turn conversations. Tools like Prometheus and Grafana work well for metrics dashboards, while LangSmith and similar services provide LLM-specific observability.
Without metrics, you're flying blind. You might think your cache is helping when it's actually serving stale or wrong answers. Measure, monitor, and adjust based on what you observe.
Alternative LLM Providers
We used Cohere throughout this tutorial for consistency with earlier tutorials in the series. The caching and memory patterns we've built work identically with any LLM provider. Just swap the API client.
OpenAI (GPT-3.5-turbo, GPT-4):
response = openai.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}]
)
Anthropic (Claude):
response = anthropic.messages.create(
model="claude-3-haiku-20240307",
messages=[{"role": "user", "content": prompt}]
)
Local models (Ollama, LM Studio):
response = ollama.chat(
model="llama2",
messages=[{"role": "user", "content": prompt}]
)
The caching logic, similarity thresholds, guardrails, and memory patterns remain identical. Only the API client call changes. Local models eliminate API costs entirely but typically run slower than cloud-hosted options.
What We Didn’t Test
This tutorial focuses on mechanics, not evaluation. We used real retrieval outputs and timing measurements to demonstrate the caching and memory patterns, but we did not run a full quality evaluation.
What we tested with real data:
- Baseline timing breakdown for embedding, retrieval, and synthesis
- Exact cache hits and semantic cache hits with real queries
- Similarity score behavior across several “should cache” and “should not cache” queries
- Cache hit rate for a small realistic workload
- Retrieval differences with and without conversation memory (80% of results changed in our example)
What we simplified for learning:
- We did not evaluate answer quality, only timing and retrieval behavior
- We did not implement TTL enforcement or eviction logic
- We did not implement guardrails in code, only described the approach
- We did not support multi-user session isolation
- We did not test performance under high concurrency or large cache sizes
Think of this tutorial as a working baseline. Production systems should add evaluation, monitoring, expiration strategies, and user/session isolation before relying on caching or memory for correctness.
What's Next
You now have two powerful optimization techniques for LLM applications. Semantic caching reduces costs and latency by recognizing similar questions. Conversation memory makes multi-turn exchanges natural by providing context from past turns. Both techniques use vector databases in ways that build directly on everything you've learned in this series.
The techniques we built work for learning and small-scale applications. When you're ready to scale up, the production alternatives section gives you clear paths forward. GPTCache, Upstash, Redis patterns, and specialized memory solutions all implement these same core concepts we've explored. The fundamentals stay constant even as the tools change.
Here's how to apply what you've learned:
Start with the basics. Use the two-tier cache pattern (exact plus semantic) we built here. Measure your hit rates, costs, and latency. Understand your actual query patterns before adding complexity.
Tune for your use case. The 0.90 threshold worked for our research queries, but your queries might cluster differently. Test with your actual data. Measure false positive rates. Adjust based on whether accuracy or cost savings matter more.
Add guardrails from day one. Time-sensitive and user-specific queries will break your cache if you don't block them. Start with pattern matching like we showed. Refine based on what you observe in production.
Measure everything. Track cache hit rates, latency distributions, cost savings, and false positives. For memory systems, measure how often context actually helps. Without metrics, you won't know if your optimizations are working.
Scale when needed, not before. The Python dict plus ChromaDB cache works fine for prototypes and small applications. Don't jump to GPTCache or Upstash until you've validated the patterns with simple implementations first. Premature optimization wastes time.
The vector database skills you've built across this series all come together here. You learned to store embeddings efficiently, chunk documents intelligently, filter and search effectively, and choose production databases appropriately. Now you can optimize those systems with caching and memory.
When you're ready to build:
- Pick a domain and data source. Not arXiv papers. Something relevant to your interests or work.
- Implement a simple retrieval system with synthesis. Measure baseline costs and latency.
- Add semantic caching. Track your hit rates and cost savings.
- Test multi-turn conversations. See where memory helps and where it doesn't.
- Measure, adjust, repeat.
The best way to solidify these concepts is to build something real. Use your own data, serve your own queries, encounter your own challenges. The patterns we've covered will guide you, but hands-on experience teaches more than any tutorial can.
Key Takeaways
- LLM synthesis is the bottleneck: Embedding and vector retrieval are fast, but synthesis takes seconds, making it the most valuable target for optimization.
- Two-tier caching works well in practice: Exact match caching is nearly free and catches repeated queries. Semantic caching is slower than exact match but still far cheaper than calling the LLM again.
- Semantic thresholds are a tradeoff: High thresholds reduce wrong cache hits but miss legitimate rephrases. Low thresholds increase hit rates but risk incorrect reuse.
- Most savings often come from exact repeats: In exploratory research workloads, many cache hits come from users repeating the same question, not just paraphrasing it.
- Guardrails matter for correctness: Queries that depend on time, user identity, or session context should bypass caching to avoid incorrect responses.
- Conversation memory improves retrieval: Follow-up queries like “Show me efficient implementations” become meaningfully searchable when you include context from earlier turns.
- Start simple and measure: Use basic patterns first, then refine thresholds, guardrails, and memory strategies based on real usage metrics.