December 16, 2025

Production Vector Databases

Previously, we saw something interesting when we added metadata filtering to our arXiv paper search. Filtering added significant overhead to our queries. Category filters made queries 3.3x slower. Year range filters added 8x overhead. Combined filters landed somewhere in the middle at 5x.

That’s fine for a learning environment or a small-scale prototype. But if you’re building a real application where users are constantly filtering by category, date ranges, or combinations of metadata fields, those milliseconds add up fast. When you’re handling hundreds or thousands of queries per hour, they really start to matter.

Let’s see if production databases handle this better. We’ll go beyond ChromaDB and get hands-on with three production-grade vector databases. We won’t just read about them. We’ll actually set them up, load our data, run queries, and measure what happens.

Here’s what we’ll build:

  1. PostgreSQL with pgvector: The SQL integration play. We’ll add vector search capabilities to a traditional database that many teams already run.
  2. Qdrant: The specialized vector database. Built from the ground up in Rust for handling filtered vector search efficiently.
  3. Pinecone: The managed service approach. We’ll see what it’s like when someone else handles all the infrastructure.

By the end, you’ll have hands-on experience with all three approaches, real performance data showing how they compare, and a framework for choosing the right database for your specific situation.

What You Already Know

This tutorial assumes you understand:

  • What embeddings are and how similarity search works
  • How to use ChromaDB for basic vector operations
  • Why metadata filtering matters for real applications
  • The performance characteristics of ChromaDB’s filtering

If any of these topics are new to you, we recommend checking out these previous posts:

  1. Introduction to Vector Databases using ChromaDB
  2. Document Chunking Strategies for Vector Databases
  3. Metadata Filtering and Hybrid Search for Vector Databases

They’ll give you the foundation needed to get the most out of what we’re covering here.

What You’ll Learn

By working through this tutorial, you’ll:

  • Set up and configure three different production vector databases
  • Load the same dataset into each one and run identical queries
  • Measure and compare performance characteristics
  • Understand the tradeoffs: raw speed, filtering efficiency, operational overhead
  • Learn when to choose each database based on your team’s constraints
  • Get comfortable with different database architectures and APIs

A Quick Note on “Production”

When we say “production database,” we don’t mean these are only for big companies with massive scale. We mean these are databases you could actually deploy in a real application that serves real users. They handle the edge cases, offer reasonable performance at scale, and have communities and documentation you can rely on.

That said, “production-ready” doesn’t mean “production-required.” ChromaDB is perfectly fine for many applications. The goal here is to expand your toolkit so you can make informed choices.

Setup: Two Paths Forward

Before we get into our three vector databases, we need to talk about how we’re going to run them. You have two options, and neither is wrong.

Option 1: Docker (Recommended)

We recommend using Docker for this tutorial because it lets you run all three databases side-by-side without any conflicts. You can experiment, break things, start over, and when you’re done, everything disappears cleanly with a single command.

More importantly, this is how engineers actually work with databases in development. You spin up containers, test things locally, then deploy similar containers to production. Learning this pattern now gives you a transferable skill.

If you’re new to Docker, don’t worry. You don’t need to become a Docker expert. We’ll use it like a tool that creates safe workspaces. Think of it as running each database in its own isolated bubble on your computer.

Here’s what we’ll set up:

  • A workspace container where you’ll write and run Python code
  • A PostgreSQL container with pgvector already installed
  • A Qdrant container running the vector database
  • Shared folders so your code and data persist between sessions

Everything stays on your actual computer in folders you can see and edit. The containers just provide the database software and Python environment.

Want to learn more about Docker? We have an excellent guide on setting up data engineering labs with Docker: Setting Up Your Data Engineering Lab with Docker

Option 2: Direct Installation (Alternative)

If you prefer to install things directly on your system, or if Docker won’t work in your environment, that’s totally fine. You can:

The direct installation path means you can’t easily run all three databases simultaneously for side-by-side comparison, but you’ll still learn the concepts and get hands-on experience with each one.

What We’re Using

Throughout this tutorial, we’ll use the same dataset we’ve been working with: 5,000 arXiv papers with pre-generated embeddings. If you don’t have these files yet, you can download them:

If you already have these files from previous work, you’re all set.

Docker Setup Instructions

Let’s get the Docker environment running. First, create a folder for this project:

mkdir vector_dbs
cd vector_dbs

Create a structure for your files:

mkdir code data

Put your dataset files (arxiv_papers_5k.csv and embeddings_cohere_5k.npy) in the data/ folder.

Now create a file called docker-compose.yml in the vector_dbs folder:

services:
  lab:
    image: python:3.12-slim
    volumes:
      - ./code:/code
      - ./data:/data
    working_dir: /code
    stdin_open: true
    tty: true
    depends_on: [postgres, qdrant]
    networks: [vector_net]
    environment:
      POSTGRES_HOST: postgres
      QDRANT_HOST: qdrant

  postgres:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_PASSWORD: tutorial_password
      POSTGRES_DB: arxiv_db
    ports: ["5432:5432"]
    volumes: [postgres_data:/var/lib/postgresql/data]
    networks: [vector_net]

  qdrant:
    image: qdrant/qdrant:latest
    ports: ["6333:6333", "6334:6334"]
    volumes: [qdrant_data:/qdrant/storage]
    networks: [vector_net]

networks:
  vector_net:

volumes:
  postgres_data:
  qdrant_data:

This configuration sets up three containers:

  • lab: Your Python workspace where you’ll run code
  • postgres: PostgreSQL database with pgvector pre-installed
  • qdrant: Qdrant vector database

The databases store their data in Docker volumes (postgres_data, qdrant_data), which means your data persists even when you stop the containers.

Start the databases:

docker compose up -d postgres qdrant

The -d flag runs them in the background. You should see Docker downloading the images (first time only) and then starting the containers.

Now enter your Python workspace:

docker compose run --rm lab

The --rm flag tells Docker to automatically remove the container when you exit. Don’t worry about losing your work. Your code in the code/ folder and data in data/ folder are safe. Only the temporary container workspace gets cleaned up.

You’re now inside a container with Python 3.12. Your code/ and data/ folders from your computer are available here at /code and /data.

Create a requirements.txt file in your code/ folder with the packages we’ll need:

psycopg2-binary==2.9.9
pgvector==0.2.4
qdrant-client==1.16.1
pinecone==5.0.1
cohere==5.11.0
numpy==1.26.4
pandas==2.2.0
python-dotenv==1.0.1

Install the packages:

pip install -r requirements.txt

Perfect! You now have a safe environment where you can experiment with all three databases.

When you’re done working, just type exit to leave the container, then:

docker compose down

This stops the databases. Your data is safe in Docker volumes. Next time you want to work, just run docker compose up -d postgres qdrant and docker compose run --rm lab again.

A Note for Direct Installation Users

If you’re going the direct installation route, you’ll need:

For PostgreSQL + pgvector:

For Qdrant:

  • Option A: Install Qdrant locally following their installation guide
  • Option B: Skip Qdrant for now and focus on pgvector and Pinecone

Python packages:
Use the same requirements.txt from above and install with pip install -r requirements.txt

Alright, setup is complete. Let’s build something.


Database 1: PostgreSQL with pgvector

If you’ve worked with data at all, you’ve probably encountered PostgreSQL. It’s everywhere. It powers everything from tiny startups to massive enterprises. Many teams already have Postgres running in production, complete with backups, monitoring, and people who know how to keep it healthy.

So when your team needs vector search capabilities, a natural question is: “Can we just add this to our existing database?”

That’s exactly what pgvector does. It’s a PostgreSQL extension that adds vector similarity search to a database you might already be running. No new infrastructure to learn, no new backup strategies, no new team to build. Just install an extension and suddenly you can store embeddings alongside your regular data.

Let’s see what that looks like in practice.

Loading Data into PostgreSQL

We’ll start by creating a table that stores our paper metadata and embeddings together. Create a file called load_pgvector.py in your code/ folder:

import psycopg2
from pgvector.psycopg2 import register_vector
import numpy as np
import pandas as pd
import os

# Connect to PostgreSQL
# If using Docker, these environment variables are already set
db_host = os.getenv('POSTGRES_HOST', 'localhost')
conn = psycopg2.connect(
    host=db_host,
    database="arxiv_db",
    user="postgres",
    password="tutorial_password"
)
cur = conn.cursor()

# Enable pgvector extension
# This needs to happen BEFORE we register the vector type
cur.execute("CREATE EXTENSION IF NOT EXISTS vector")
conn.commit()

# Now register the vector type with psycopg2
# This lets us pass numpy arrays directly as vectors
register_vector(conn)

# Create table for our papers
# The vector(1536) column stores our 1536-dimensional embeddings
cur.execute("""
    CREATE TABLE IF NOT EXISTS papers (
        id TEXT PRIMARY KEY,
        title TEXT,
        authors TEXT,
        abstract TEXT,
        year INTEGER,
        category TEXT,
        embedding vector(1536)
    )
""")
conn.commit()

# Load the metadata and embeddings
papers_df = pd.read_csv('/data/arxiv_papers_5k.csv')
embeddings = np.load('/data/embeddings_cohere_5k.npy')

print(f"Loading {len(papers_df)} papers into PostgreSQL...")

# Insert papers in batches
# We'll do 500 at a time to keep transactions manageable
batch_size = 500
for i in range(0, len(papers_df), batch_size):
    batch_df = papers_df.iloc[i:i+batch_size]
    batch_embeddings = embeddings[i:i+batch_size]

    for j, (idx, row) in enumerate(batch_df.iterrows()):
        cur.execute("""
            INSERT INTO papers (id, title, authors, abstract, year, category, embedding)
            VALUES (%s, %s, %s, %s, %s, %s, %s)
            ON CONFLICT (id) DO NOTHING
        """, (
            row['id'],
            row['title'],
            row['authors'],
            row['abstract'],
            row['year'],
            row['categories'],
            batch_embeddings[j]  # Pass numpy array directly
        ))

    conn.commit()
    print(f"  Loaded {min(i+batch_size, len(papers_df))} / {len(papers_df)} papers")

print("\nData loaded successfully!")

# Create HNSW index for fast similarity search
# This takes a couple seconds for 5,000 papers
print("Creating HNSW index...")
cur.execute("""
    CREATE INDEX IF NOT EXISTS papers_embedding_idx 
    ON papers 
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64)
""")
conn.commit()
print("Index created!")

# Verify everything worked
cur.execute("SELECT COUNT(*) FROM papers")
count = cur.fetchone()[0]
print(f"\nTotal papers in database: {count}")

cur.close()
conn.close()

Let’s break down what’s happening here:

  • Extension Setup: We enable the pgvector extension first, then register the vector type with our Python database driver. This order matters. If you try to register the type before the extension exists, you’ll get an error.
  • Table Structure: We’re storing both metadata (title, authors, abstract, year, category) and the embedding vector in the same table. The vector(1536) type tells PostgreSQL we want a 1536-dimensional vector column.
  • Passing Vectors: Thanks to the register_vector() call, we can pass numpy arrays directly. The pgvector library handles converting them to PostgreSQL’s vector format automatically. If you tried to pass a Python list instead, PostgreSQL would create a regular array type, which doesn’t support the distance operators we need.
  • HNSW Index: After loading the data, we create an HNSW index. The parameters m=16 and ef_construction=64 are defaults that work well for most cases. The index took about 2.8 seconds to build on 5,000 papers in our tests.

Run this script:

python load_pgvector.py

You should see output like this:

Loading 5000 papers into PostgreSQL...
  Loaded 500 / 5000 papers
  Loaded 1000 / 5000 papers
  Loaded 1500 / 5000 papers
  Loaded 2000 / 5000 papers
  Loaded 2500 / 5000 papers
  Loaded 3000 / 5000 papers
  Loaded 3500 / 5000 papers
  Loaded 4000 / 5000 papers
  Loaded 4500 / 5000 papers
  Loaded 5000 / 5000 papers

Data loaded successfully!
Creating HNSW index...
Index created!

Total papers in database: 5000

The data is now loaded and indexed in PostgreSQL.

Querying with pgvector

Now let’s write some queries. Create query_pgvector.py:

import psycopg2
from pgvector.psycopg2 import register_vector
import numpy as np
import os

# Connect and register vector type
db_host = os.getenv('POSTGRES_HOST', 'localhost')
conn = psycopg2.connect(
    host=db_host,
    database="arxiv_db",
    user="postgres",
    password="tutorial_password"
)
register_vector(conn)
cur = conn.cursor()

# Let's use a paper from our dataset as the query
# We'll find papers similar to a machine learning paper
cur.execute("""
    SELECT id, title, category, year, embedding
    FROM papers
    WHERE category = 'cs.LG'
    LIMIT 1
""")
result = cur.fetchone()
query_id, query_title, query_category, query_year, query_embedding = result

print("Query paper:")
print(f"  Title: {query_title}")
print(f"  Category: {query_category}")
print(f"  Year: {query_year}")
print()

# Scenario 1: Unfiltered similarity search
# The <=> operator computes cosine distance
print("=" * 80)
print("Scenario 1: Unfiltered Similarity Search")
print("=" * 80)
cur.execute("""
    SELECT title, category, year, embedding <=> %s AS distance
    FROM papers
    WHERE id != %s
    ORDER BY embedding <=> %s
    LIMIT 5
""", (query_embedding, query_id, query_embedding))

for row in cur.fetchall():
    print(f"  {row[1]:8} {row[2]} | {row[3]:.4f} | {row[0][:60]}")
print()

# Scenario 2: Filter by category
print("=" * 80)
print("Scenario 2: Category Filter (cs.LG only)")
print("=" * 80)
cur.execute("""
    SELECT title, category, year, embedding <=> %s AS distance
    FROM papers
    WHERE category = 'cs.LG' AND id != %s
    ORDER BY embedding <=> %s
    LIMIT 5
""", (query_embedding, query_id, query_embedding))

for row in cur.fetchall():
    print(f"  {row[1]:8} {row[2]} | {row[3]:.4f} | {row[0][:60]}")
print()

# Scenario 3: Filter by year range
print("=" * 80)
print("Scenario 3: Year Filter (2025 or later)")
print("=" * 80)
cur.execute("""
    SELECT title, category, year, embedding <=> %s AS distance
    FROM papers
    WHERE year >= 2025 AND id != %s
    ORDER BY embedding <=> %s
    LIMIT 5
""", (query_embedding, query_id, query_embedding))

for row in cur.fetchall():
    print(f"  {row[1]:8} {row[2]} | {row[3]:.4f} | {row[0][:60]}")
print()

# Scenario 4: Combined filters
print("=" * 80)
print("Scenario 4: Combined Filter (cs.LG AND year >= 2025)")
print("=" * 80)
cur.execute("""
    SELECT title, category, year, embedding <=> %s AS distance
    FROM papers
    WHERE category = 'cs.LG' AND year >= 2025 AND id != %s
    ORDER BY embedding <=> %s
    LIMIT 5
""", (query_embedding, query_id, query_embedding))

for row in cur.fetchall():
    print(f"  {row[1]:8} {row[2]} | {row[3]:.4f} | {row[0][:60]}")

cur.close()
conn.close()

This script tests the same four scenarios we measured previously:

  1. Unfiltered vector search
  2. Filter by category (text field)
  3. Filter by year range (integer field)
  4. Combined filters (category AND year)

Run it:

python query_pgvector.py

You’ll see output similar to this:

Query paper:
  Title: Deep Reinforcement Learning for Autonomous Navigation
  Category: cs.LG
  Year: 2025

================================================================================
Scenario 1: Unfiltered Similarity Search
================================================================================
  cs.LG    2024 | 0.2134 | Policy Gradient Methods for Robot Control
  cs.LG    2025 | 0.2287 | Multi-Agent Reinforcement Learning in Games
  cs.CV    2024 | 0.2445 | Visual Navigation Using Deep Learning
  cs.LG    2023 | 0.2591 | Model-Free Reinforcement Learning Approaches
  cs.CL    2025 | 0.2678 | Reinforcement Learning for Dialogue Systems

================================================================================
Scenario 2: Category Filter (cs.LG only)
================================================================================
  cs.LG    2024 | 0.2134 | Policy Gradient Methods for Robot Control
  cs.LG    2025 | 0.2287 | Multi-Agent Reinforcement Learning in Games
  cs.LG    2023 | 0.2591 | Model-Free Reinforcement Learning Approaches
  cs.LG    2024 | 0.2734 | Deep Q-Networks for Atari Games
  cs.LG    2025 | 0.2856 | Actor-Critic Methods in Continuous Control

================================================================================
Scenario 3: Year Filter (2025 or later)
================================================================================
  cs.LG    2025 | 0.2287 | Multi-Agent Reinforcement Learning in Games
  cs.CL    2025 | 0.2678 | Reinforcement Learning for Dialogue Systems
  cs.LG    2025 | 0.2856 | Actor-Critic Methods in Continuous Control
  cs.CV    2025 | 0.2923 | Self-Supervised Learning for Visual Tasks
  cs.DB    2025 | 0.3012 | Optimizing Database Queries with Learning

================================================================================
Scenario 4: Combined Filter (cs.LG AND year >= 2025)
================================================================================
  cs.LG    2025 | 0.2287 | Multi-Agent Reinforcement Learning in Games
  cs.LG    2025 | 0.2856 | Actor-Critic Methods in Continuous Control
  cs.LG    2025 | 0.3145 | Transfer Learning in Reinforcement Learning
  cs.LG    2025 | 0.3267 | Exploration Strategies in Deep RL
  cs.LG    2025 | 0.3401 | Reward Shaping for Complex Tasks

The queries work just like regular SQL. We’re just using the <=> operator for cosine distance instead of normal comparison operators.

Measuring Performance

Let’s get real numbers. Create benchmark_pgvector.py:

import psycopg2
from pgvector.psycopg2 import register_vector
import numpy as np
import time
import os

db_host = os.getenv('POSTGRES_HOST', 'localhost')
conn = psycopg2.connect(
    host=db_host,
    database="arxiv_db",
    user="postgres",
    password="tutorial_password"
)
register_vector(conn)
cur = conn.cursor()

# Get a query embedding
cur.execute("""
    SELECT embedding FROM papers 
    WHERE category = 'cs.LG' 
    LIMIT 1
""")
query_embedding = cur.fetchone()[0]

def benchmark_query(query, params, name, iterations=100):
    """Run a query multiple times and measure average latency"""
    # Warmup
    for _ in range(5):
        cur.execute(query, params)
        cur.fetchall()

    # Actual measurement
    times = []
    for _ in range(iterations):
        start = time.time()
        cur.execute(query, params)
        cur.fetchall()
        times.append((time.time() - start) * 1000)  # Convert to ms

    avg_time = np.mean(times)
    std_time = np.std(times)
    return avg_time, std_time

print("Benchmarking pgvector performance...")
print("=" * 80)

# Scenario 1: Unfiltered
query1 = """
    SELECT title, category, year, embedding <=> %s AS distance
    FROM papers
    ORDER BY embedding <=> %s
    LIMIT 10
"""
avg, std = benchmark_query(query1, (query_embedding, query_embedding), "Unfiltered")
print(f"Unfiltered search:        {avg:.2f}ms (±{std:.2f}ms)")
baseline = avg

# Scenario 2: Category filter
query2 = """
    SELECT title, category, year, embedding <=> %s AS distance
    FROM papers
    WHERE category = 'cs.LG'
    ORDER BY embedding <=> %s
    LIMIT 10
"""
avg, std = benchmark_query(query2, (query_embedding, query_embedding), "Category filter")
overhead = avg / baseline
print(f"Category filter:          {avg:.2f}ms (±{std:.2f}ms) | {overhead:.2f}x overhead")

# Scenario 3: Year filter
query3 = """
    SELECT title, category, year, embedding <=> %s AS distance
    FROM papers
    WHERE year >= 2025
    ORDER BY embedding <=> %s
    LIMIT 10
"""
avg, std = benchmark_query(query3, (query_embedding, query_embedding), "Year filter")
overhead = avg / baseline
print(f"Year filter (>= 2025):    {avg:.2f}ms (±{std:.2f}ms) | {overhead:.2f}x overhead")

# Scenario 4: Combined filters
query4 = """
    SELECT title, category, year, embedding <=> %s AS distance
    FROM papers
    WHERE category = 'cs.LG' AND year >= 2025
    ORDER BY embedding <=> %s
    LIMIT 10
"""
avg, std = benchmark_query(query4, (query_embedding, query_embedding), "Combined filter")
overhead = avg / baseline
print(f"Combined filter:          {avg:.2f}ms (±{std:.2f}ms) | {overhead:.2f}x overhead")

print("=" * 80)

cur.close()
conn.close()

Run this:

python benchmark_pgvector.py

Here’s what we found in our testing (your numbers might vary slightly depending on your hardware):

Benchmarking pgvector performance...
================================================================================
Unfiltered search:        2.48ms (±0.31ms)
Category filter:          5.70ms (±0.42ms) | 2.30x overhead
Year filter (>= 2025):    2.51ms (±0.29ms) | 1.01x overhead
Combined filter:          5.64ms (±0.38ms) | 2.27x overhead
================================================================================

What the Numbers Tell Us

Let’s compare this to ChromaDB:

Scenario ChromaDB pgvector Winner
Unfiltered 4.5ms 2.5ms pgvector (1.8x faster)
Category filter 3.3x overhead 2.3x overhead pgvector (30% less overhead)
Year filter 8.0x overhead 1.0x overhead pgvector (essentially free!)
Combined filter 5.0x overhead 2.3x overhead pgvector (54% less overhead)

Three things jump out:

  1. pgvector is fast at baseline. The unfiltered queries average 2.5ms compared to ChromaDB’s 4.5ms. That’s nearly twice as fast, which makes sense. Decades of PostgreSQL query optimization plus in-process execution (no HTTP overhead) really shows here.
  2. Integer filters are essentially free. The year range filter adds almost zero overhead (1.01x). PostgreSQL is incredibly good at filtering on integers. It can use standard database indexes and optimization techniques that have been refined over 30+ years.
  3. Text filters have a cost, but it’s reasonable. Category filtering shows 2.3x overhead, which is better than ChromaDB’s 3.3x but still noticeable. Text matching is inherently more expensive than integer comparisons, even in a mature database like PostgreSQL.

The pattern here is really interesting: pgvector doesn’t magically make all filtering free, but it leverages PostgreSQL’s strengths. When you filter on things PostgreSQL is already good at (numbers, dates, IDs), the overhead is minimal. When you filter on text fields, you pay a price, but it’s more manageable than in ChromaDB.

What pgvector Gets Right

  • SQL Integration: If your team already thinks in SQL, pgvector feels natural. You write regular SQL queries with a special distance operator. That’s it. No new query language to learn.
  • Transaction Support: Need to update a paper’s metadata and its embedding together? Wrap it in a transaction. PostgreSQL handles it the same way it handles any other transactional update.
  • Existing Infrastructure: Many teams already have PostgreSQL in production, complete with backups, monitoring, high availability setups, and people who know how to keep it running. Adding pgvector means leveraging all that existing investment.
  • Mature Ecosystem: Want to connect it to your data pipeline? There’s probably a tool for that. Need to replicate it? PostgreSQL replication works. Want to query it from your favorite language? PostgreSQL drivers exist everywhere.

What pgvector Doesn’t Handle For You

  • VACUUM is Your Problem: PostgreSQL’s VACUUM process can interact weirdly with vector indexes. The indexes can bloat over time if you’re doing lots of updates and deletes. You need to monitor this and potentially rebuild indexes periodically.
  • Index Maintenance: As your data grows and changes, you might need to rebuild indexes to maintain performance. This isn’t automatic. It’s part of your operational responsibility.
  • Memory Pressure: Vector indexes live in memory for best performance. As your dataset grows, you need to size your database appropriately. This is normal for PostgreSQL, but it’s something you have to plan for.
  • Replication Overhead: If you’re replicating your PostgreSQL database, those vector columns come along for the ride. Replicating high-dimensional vectors can be bandwidth-intensive.

In production, you’d typically also add regular indexes (for example, B-tree indexes) on frequently filtered columns like category and year, alongside the vector index.

None of these are dealbreakers. They’re just real operational considerations. Teams with PostgreSQL expertise can handle them. Teams without that expertise might prefer a managed service or specialized database.

When pgvector Makes Sense

pgvector is an excellent choice when:

  • You already run PostgreSQL in production
  • Your team has strong SQL skills and PostgreSQL experience
  • You need transactional guarantees with your vector operations
  • You primarily filter on integer fields (dates, IDs, counts, years)
  • Your scale is moderate (up to a few million vectors)
  • You want to leverage existing PostgreSQL infrastructure

pgvector might not be the best fit when:

  • You’re filtering heavily on text fields with unpredictable combinations
  • You need to scale beyond what a single PostgreSQL server can handle
  • Your team doesn’t have PostgreSQL operational expertise
  • You want someone else to handle all the database maintenance

Database 2: Qdrant

pgvector gave us fast baseline queries, but text filtering still added noticeable overhead. That’s not a PostgreSQL problem. It’s just that PostgreSQL was built to handle all kinds of data, and vector search with heavy filtering is one specific use case among thousands.

Qdrant takes a different approach. It’s a vector database built specifically for filtered vector search. The entire architecture is designed around one question: how do we make similarity search fast even when you’re filtering on multiple metadata fields?

Let’s see if that focus pays off.

Loading Data into Qdrant

Qdrant runs as a separate service (in our Docker setup, it’s already running). We’ll connect to it via HTTP API and load our papers. Create load_qdrant.py:

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import numpy as np
import pandas as pd

# Connect to Qdrant
# If using Docker, QDRANT_HOST is set to 'qdrant'
# If running locally, use 'localhost'
import os
qdrant_host = os.getenv('QDRANT_HOST', 'localhost')
client = QdrantClient(host=qdrant_host, port=6333)

# Create collection with vector configuration
collection_name = "arxiv_papers"

# Delete collection if it exists (useful for re-running)
try:
    client.delete_collection(collection_name)
    print(f"Deleted existing collection: {collection_name}")
except:
    pass

# Create new collection
client.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(
        size=1536,  # Cohere embedding dimension
        distance=Distance.COSINE
    )
)
print(f"Created collection: {collection_name}")

# Load data
papers_df = pd.read_csv('/data/arxiv_papers_5k.csv')
embeddings = np.load('/data/embeddings_cohere_5k.npy')

print(f"\nLoading {len(papers_df)} papers into Qdrant...")

# Prepare points for upload
# Qdrant stores metadata as "payload"
points = []
for idx, row in papers_df.iterrows():
    point = PointStruct(
        id=idx,
        vector=embeddings[idx].tolist(),
        payload={
            "paper_id": row['id'],
            "title": row['title'],
            "authors": row['authors'],
            "abstract": row['abstract'],
            "year": int(row['year']),
            "category": row['categories']
        }
    )
    points.append(point)

    # Show progress every 1000 papers
    if (idx + 1) % 1000 == 0:
        print(f"  Prepared {idx + 1} / {len(papers_df)} papers")

# Upload all points at once
# Qdrant handles large batches well (no 5k limit like ChromaDB)
print("\nUploading to Qdrant...")
client.upsert(
    collection_name=collection_name,
    points=points
)

print(f"Upload complete!")

# Verify
collection_info = client.get_collection(collection_name)
print(f"\nCollection '{collection_name}' now has {collection_info.points_count} papers")

A few things to notice:

  • Collection Setup: We specify the vector size (1536) and distance metric (COSINE) when creating the collection. This is similar to ChromaDB but more explicit.
  • Payload Structure: Qdrant calls metadata “payload.” We store all our paper metadata here as a dictionary. This is where Qdrant’s filtering power comes from.
  • No Batch Size Limits: Unlike ChromaDB’s 5,461 embedding limit, Qdrant handled all 5,000 papers in a single upload without issues.
  • Point IDs: We use the DataFrame index as point IDs. In production, you’d probably use your paper IDs, but integers work fine for this example.

Run the script:

python load_qdrant.py

You’ll see output like this:

Deleted existing collection: arxiv_papers
Created collection: arxiv_papers

Loading 5000 papers into Qdrant...
  Prepared 1000 / 5000 papers
  Prepared 2000 / 5000 papers
  Prepared 3000 / 5000 papers
  Prepared 4000 / 5000 papers
  Prepared 5000 / 5000 papers

Uploading to Qdrant...
Upload complete!

Collection 'arxiv_papers' now has 5000 papers

Querying with Qdrant

Now let’s run the same query scenarios. Create query_qdrant.py:

from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue, Range
import numpy as np
import os

# Connect to Qdrant
qdrant_host = os.getenv('QDRANT_HOST', 'localhost')
client = QdrantClient(host=qdrant_host, port=6333)
collection_name = "arxiv_papers"

# Get a query vector from a machine learning paper
results = client.scroll(
    collection_name=collection_name,
    scroll_filter=Filter(
        must=[FieldCondition(key="category", match=MatchValue(value="cs.LG"))]
    ),
    limit=1,
    with_vectors=True,
    with_payload=True
)

query_point = results[0][0]
query_vector = query_point.vector
query_payload = query_point.payload

print("Query paper:")
print(f"  Title: {query_payload['title']}")
print(f"  Category: {query_payload['category']}")
print(f"  Year: {query_payload['year']}")
print()

# Scenario 1: Unfiltered similarity search
print("=" * 80)
print("Scenario 1: Unfiltered Similarity Search")
print("=" * 80)

results = client.query_points(
    collection_name=collection_name,
    query=query_vector,
    limit=6,  # Get 6 so we can skip the query paper itself
    with_payload=True
)

for hit in results.points[1:6]:  # Skip first result (the query paper)
    payload = hit.payload
    print(f"  {payload['category']:8} {payload['year']} | {hit.score:.4f} | {payload['title'][:60]}")
print()

# Scenario 2: Filter by category
print("=" * 80)
print("Scenario 2: Category Filter (cs.LG only)")
print("=" * 80)

results = client.query_points(
    collection_name=collection_name,
    query=query_vector,
    query_filter=Filter(
        must=[FieldCondition(key="category", match=MatchValue(value="cs.LG"))]
    ),
    limit=6,
    with_payload=True
)

for hit in results.points[1:6]:
    payload = hit.payload
    print(f"  {payload['category']:8} {payload['year']} | {hit.score:.4f} | {payload['title'][:60]}")
print()

# Scenario 3: Filter by year range
print("=" * 80)
print("Scenario 3: Year Filter (2025 or later)")
print("=" * 80)

results = client.query_points(
    collection_name=collection_name,
    query=query_vector,
    query_filter=Filter(
        must=[FieldCondition(key="year", range=Range(gte=2025))]
    ),
    limit=5,
    with_payload=True
)

for hit in results.points:
    payload = hit.payload
    print(f"  {payload['category']:8} {payload['year']} | {hit.score:.4f} | {payload['title'][:60]}")
print()

# Scenario 4: Combined filters
print("=" * 80)
print("Scenario 4: Combined Filter (cs.LG AND year >= 2025)")
print("=" * 80)

results = client.query_points(
    collection_name=collection_name,
    query=query_vector,
    query_filter=Filter(
        must=[
            FieldCondition(key="category", match=MatchValue(value="cs.LG")),
            FieldCondition(key="year", range=Range(gte=2025))
        ]
    ),
    limit=5,
    with_payload=True
)

for hit in results.points:
    payload = hit.payload
    print(f"  {payload['category']:8} {payload['year']} | {hit.score:.4f} | {payload['title'][:60]}")

A couple of things about Qdrant’s API:

  • Method Name: We use client.query_points() to search with vectors. The client also has methods called query() and search(), but they work differently. query_points() is what you want for vector similarity search.
  • Filter Syntax: Qdrant uses structured filter objects. Text matching uses MatchValue, numeric ranges use Range. You can combine multiple conditions in the must list.
  • Scores vs Distances: Qdrant returns similarity scores (higher is better) rather than distances (lower is better). This is just a presentation difference.

Run it:

python query_qdrant.py

You’ll see output like this:

Query paper:
  Title: Deep Reinforcement Learning for Autonomous Navigation
  Category: cs.LG
  Year: 2025

================================================================================
Scenario 1: Unfiltered Similarity Search
================================================================================
  cs.LG    2024 | 0.7866 | Policy Gradient Methods for Robot Control
  cs.LG    2025 | 0.7713 | Multi-Agent Reinforcement Learning in Games
  cs.CV    2024 | 0.7555 | Visual Navigation Using Deep Learning
  cs.LG    2023 | 0.7409 | Model-Free Reinforcement Learning Approaches
  cs.CL    2025 | 0.7322 | Reinforcement Learning for Dialogue Systems

================================================================================
Scenario 2: Category Filter (cs.LG only)
================================================================================
  cs.LG    2024 | 0.7866 | Policy Gradient Methods for Robot Control
  cs.LG    2025 | 0.7713 | Multi-Agent Reinforcement Learning in Games
  cs.LG    2023 | 0.7409 | Model-Free Reinforcement Learning Approaches
  cs.LG    2024 | 0.7266 | Deep Q-Networks for Atari Games
  cs.LG    2025 | 0.7144 | Actor-Critic Methods in Continuous Control

================================================================================
Scenario 3: Year Filter (2025 or later)
================================================================================
  cs.LG    2025 | 0.7713 | Multi-Agent Reinforcement Learning in Games
  cs.CL    2025 | 0.7322 | Reinforcement Learning for Dialogue Systems
  cs.LG    2025 | 0.7144 | Actor-Critic Methods in Continuous Control
  cs.CV    2025 | 0.7077 | Self-Supervised Learning for Visual Tasks
  cs.DB    2025 | 0.6988 | Optimizing Database Queries with Learning

================================================================================
Scenario 4: Combined Filter (cs.LG AND year >= 2025)
================================================================================
  cs.LG    2025 | 0.7713 | Multi-Agent Reinforcement Learning in Games
  cs.LG    2025 | 0.7144 | Actor-Critic Methods in Continuous Control
  cs.LG    2025 | 0.6855 | Transfer Learning in Reinforcement Learning
  cs.LG    2025 | 0.6733 | Exploration Strategies in Deep RL
  cs.LG    2025 | 0.6599 | Reward Shaping for Complex Tasks

Notice the scores are higher numbers than the distances we saw with pgvector. That’s just because Qdrant shows similarity (higher = more similar) while pgvector showed distance (lower = more similar). The rankings are what matter.

Measuring Performance

Now for the interesting part. Create benchmark_qdrant.py:

from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue, Range
import numpy as np
import time
import os

# Connect to Qdrant
qdrant_host = os.getenv('QDRANT_HOST', 'localhost')
client = QdrantClient(host=qdrant_host, port=6333)
collection_name = "arxiv_papers"

# Get a query vector
results = client.scroll(
    collection_name=collection_name,
    scroll_filter=Filter(
        must=[FieldCondition(key="category", match=MatchValue(value="cs.LG"))]
    ),
    limit=1,
    with_vectors=True
)
query_vector = results[0][0].vector

def benchmark_query(query_filter, name, iterations=100):
    """Run a query multiple times and measure average latency"""
    # Warmup
    for _ in range(5):
        client.query_points(
            collection_name=collection_name,
            query=query_vector,
            query_filter=query_filter,
            limit=10,
            with_payload=True
        )

    # Actual measurement
    times = []
    for _ in range(iterations):
        start = time.time()
        client.query_points(
            collection_name=collection_name,
            query=query_vector,
            query_filter=query_filter,
            limit=10,
            with_payload=True
        )
        times.append((time.time() - start) * 1000)  # Convert to ms

    avg_time = np.mean(times)
    std_time = np.std(times)
    return avg_time, std_time

print("Benchmarking Qdrant performance...")
print("=" * 80)

# Scenario 1: Unfiltered
avg, std = benchmark_query(None, "Unfiltered")
print(f"Unfiltered search:        {avg:.2f}ms (±{std:.2f}ms)")
baseline = avg

# Scenario 2: Category filter
filter_category = Filter(
    must=[FieldCondition(key="category", match=MatchValue(value="cs.LG"))]
)
avg, std = benchmark_query(filter_category, "Category filter")
overhead = avg / baseline
print(f"Category filter:          {avg:.2f}ms (±{std:.2f}ms) | {overhead:.2f}x overhead")

# Scenario 3: Year filter
filter_year = Filter(
    must=[FieldCondition(key="year", range=Range(gte=2025))]
)
avg, std = benchmark_query(filter_year, "Year filter")
overhead = avg / baseline
print(f"Year filter (>= 2025):    {avg:.2f}ms (±{std:.2f}ms) | {overhead:.2f}x overhead")

# Scenario 4: Combined filters
filter_combined = Filter(
    must=[
        FieldCondition(key="category", match=MatchValue(value="cs.LG")),
        FieldCondition(key="year", range=Range(gte=2025))
    ]
)
avg, std = benchmark_query(filter_combined, "Combined filter")
overhead = avg / baseline
print(f"Combined filter:          {avg:.2f}ms (±{std:.2f}ms) | {overhead:.2f}x overhead")

print("=" * 80)

Run this:

python benchmark_qdrant.py

Here’s what we found in our testing:

Benchmarking Qdrant performance...
================================================================================
Unfiltered search:        52.52ms (±1.15ms)
Category filter:          57.19ms (±1.09ms) | 1.09x overhead
Year filter (>= 2025):    58.55ms (±1.11ms) | 1.11x overhead
Combined filter:          58.11ms (±1.08ms) | 1.11x overhead
================================================================================

What the Numbers Tell Us

The pattern here is striking. Let’s compare all three databases we’ve tested:

Scenario ChromaDB pgvector Qdrant
Unfiltered 4.5ms 2.5ms 52ms
Category filter overhead 3.3x 2.3x 1.09x
Year filter overhead 8.0x 1.0x 1.11x
Combined filter overhead 5.0x 2.3x 1.11x

Three observations:

  1. Qdrant’s baseline is slower. At 52ms, unfiltered queries are significantly slower than pgvector’s 2.5ms or ChromaDB’s 4.5ms. This is because we’re going through an HTTP API to a separate service, while pgvector runs in-process with PostgreSQL. Network overhead and serialization add latency.
  2. Filtering overhead is remarkably consistent. Category filter, year filter, combined filters all show roughly 1.1x overhead. It doesn’t matter if you’re filtering on one field or five. This is dramatically better than ChromaDB’s 3-8x overhead or even pgvector’s 2.3x overhead on text fields.
  3. The architecture is designed for filtered search. Qdrant doesn’t treat filtering as an afterthought. The entire system is built around the assumption that you’ll be filtering on metadata while doing vector similarity search. That focus shows in these numbers.

So when does Qdrant make sense? When your queries look like: “Find similar documents that are in category X, from year Y, with tag Z, and access level W.” If you’re doing lots of complex filtered searches, that consistent 1.1x overhead beats pgvector’s variable performance and absolutely crushes ChromaDB.

What Qdrant Gets Right

  • Filtering Efficiency: This is the big one. Complex filters don’t explode your query time. You can filter on multiple fields without worrying about performance falling off a cliff.
  • Purpose-Built Architecture: Everything about Qdrant is designed for vector search. The API makes sense, the filtering syntax is clear, and the performance characteristics are predictable.
  • Easy Development Setup: Running Qdrant in Docker for local development is straightforward. The API is well-documented, and the Python client works smoothly.
  • Scalability Path: When you outgrow a single instance, Qdrant offers distributed deployment options. You’re not locked into a single-server architecture.

What to Consider

  • Network Latency: Because Qdrant runs as a separate service, you pay the cost of HTTP requests. For latency-sensitive applications where every millisecond counts, that 52ms baseline might matter.
  • Operational Overhead: You need to run and maintain another service. It’s not as complex as managing a full database cluster, but it’s more than just using an existing PostgreSQL database.
  • Infrastructure Requirements: Qdrant needs its own resources (CPU, memory, disk). If you’re resource-constrained, adding another service might not be ideal.

When Qdrant Makes Sense

Qdrant is an excellent choice when:

  • You need to filter on multiple metadata fields frequently
  • Your filters are complex and unpredictable (users can combine many different fields)
  • You can accept ~50ms baseline latency in exchange for consistent filtering performance
  • You want a purpose-built vector database but prefer self-hosting over managed services
  • You’re comfortable running Docker containers or Kubernetes in production
  • Your scale is in the millions to tens of millions of vectors

Qdrant might not be the best fit when:

  • You need sub-10ms query latency and filtering is secondary
  • You’re trying to minimize infrastructure complexity (fewer moving parts)
  • You already have PostgreSQL and pgvector handles your filtering needs
  • You want a fully managed service (Qdrant offers cloud hosting, but we tested the self-hosted version)

Database 3: Pinecone

pgvector gave us speed but required PostgreSQL expertise. Qdrant gave us efficient filtering but required running another service. Now let’s try a completely different approach: a managed service where someone else handles all the infrastructure.

Pinecone is a vector database offered as a cloud service. You don’t install anything locally. You don’t manage servers. You don’t tune indexes or monitor disk space. You create an index through their API, upload your vectors, and query them. That’s it.

This simplicity comes with tradeoffs. You’re paying for the convenience, you’re dependent on their infrastructure, and every query goes over the internet to their servers. Let’s see how those tradeoffs play out in practice.

Setting Up Pinecone

First, you need a Pinecone account. Go to pinecone.io and sign up for the free tier. The free serverless plan is enough for this tutorial (hundreds of thousands of 1536-dim vectors and several indexes); check Pinecone’s current pricing page for exact limits.

Once you have your API key, create a .env file in your code/ folder:

PINECONE_API_KEY=your-api-key-here

Now let’s create our index and load data. Create load_pinecone.py:

from pinecone import Pinecone, ServerlessSpec
import numpy as np
import pandas as pd
import os
import time
from dotenv import load_dotenv

# Load API key
load_dotenv()
api_key = os.getenv('PINECONE_API_KEY')

# Initialize Pinecone
pc = Pinecone(api_key=api_key)

# Create index
index_name = "arxiv-papers-5k"

# Delete index if it exists
if index_name in pc.list_indexes().names():
    pc.delete_index(index_name)
    print(f"Deleted existing index: {index_name}")
    time.sleep(5)  # Wait for deletion to complete

# Create new index
pc.create_index(
    name=index_name,
    dimension=1536,  # Cohere embedding dimension
    metric="cosine",
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"  # Free tier only supports us-east-1
    )
)
print(f"Created index: {index_name}")

# Wait for index to be ready
while not pc.describe_index(index_name).status['ready']:
    print("Waiting for index to be ready...")
    time.sleep(1)

# Connect to index
index = pc.Index(index_name)

# Load data
papers_df = pd.read_csv('/data/arxiv_papers_5k.csv')
embeddings = np.load('/data/embeddings_cohere_5k.npy')

print(f"\nLoading {len(papers_df)} papers into Pinecone...")

# Prepare vectors for upload
# Pinecone expects: (id, vector, metadata)
vectors = []
for idx, row in papers_df.iterrows():
    # Truncate authors field to avoid hitting metadata size limits
    # Pinecone has a 40KB metadata limit per vector
    authors = row['authors'][:500] if len(row['authors']) > 500 else row['authors']

    vector = {
        "id": row['id'],
        "values": embeddings[idx].tolist(),
        "metadata": {
            "title": row['title'],
            "authors": authors,
            "abstract": row['abstract'],
            "year": int(row['year']),
            "category": row['categories']
        }
    }
    vectors.append(vector)

    # Upload in batches of 100
    if len(vectors) == 100:
        index.upsert(vectors=vectors)
        print(f"  Uploaded {idx + 1} / {len(papers_df)} papers")
        vectors = []

# Upload remaining vectors
if vectors:
    index.upsert(vectors=vectors)
    print(f"  Uploaded {len(papers_df)} / {len(papers_df)} papers")

print("\nUpload complete!")

# Pinecone has eventual consistency
# Wait a bit for all vectors to be indexed
print("Waiting for indexing to complete...")
time.sleep(10)

# Verify
stats = index.describe_index_stats()
print(f"\nIndex '{index_name}' now has {stats['total_vector_count']} vectors")

A few things to notice:

Serverless Configuration: The free tier uses serverless infrastructure in AWS us-east-1. You don’t specify machine types or capacity. Pinecone handles scaling automatically.

Metadata Size Limit: Pinecone limits metadata to 40KB per vector. We truncate the authors field just to be safe. In practice, most metadata is well under this limit.

Batch Uploads: We upload 100 vectors at a time. This is a reasonable batch size that balances upload speed with API constraints.

Eventual Consistency: After uploading, we wait 10 seconds for indexing to complete. Pinecone doesn’t make vectors immediately queryable. They need to be indexed first.

Run the script:

python load_pinecone.py

You’ll see output like this:

Deleted existing index: arxiv-papers-5k
Created index: arxiv-papers-5k

Loading 5000 papers into Pinecone...
  Uploaded 100 / 5000 papers
  Uploaded 200 / 5000 papers
  Uploaded 300 / 5000 papers
  ...
  Uploaded 4900 / 5000 papers
  Uploaded 5000 / 5000 papers

Upload complete!
Waiting for indexing to complete...

Index 'arxiv-papers-5k' now has 5000 vectors

Querying with Pinecone

Now let’s run our queries. Create query_pinecone.py:

from pinecone import Pinecone
import numpy as np
import os
from dotenv import load_dotenv

# Load API key and connect
load_dotenv()
api_key = os.getenv('PINECONE_API_KEY')
pc = Pinecone(api_key=api_key)
index = pc.Index("arxiv-papers-5k")

# Get a query vector from a machine learning paper
results = index.query(
    vector=[0] * 1536,  # Dummy vector just to use filter
    filter={"category": {"$eq": "cs.LG"}},
    top_k=1,
    include_metadata=True,
    include_values=True
)

query_match = results['matches'][0]
query_vector = query_match['values']
query_metadata = query_match['metadata']

print("Query paper:")
print(f"  Title: {query_metadata['title']}")
print(f"  Category: {query_metadata['category']}")
print(f"  Year: {query_metadata['year']}")
print()

# Scenario 1: Unfiltered similarity search
print("=" * 80)
print("Scenario 1: Unfiltered Similarity Search")
print("=" * 80)

results = index.query(
    vector=query_vector,
    top_k=6,  # Get 6 so we can skip the query paper itself
    include_metadata=True
)

for match in results['matches'][1:6]:  # Skip first result (query paper)
    metadata = match['metadata']
    print(f"  {metadata['category']:8} {metadata['year']} | {match['score']:.4f} | {metadata['title'][:60]}")
print()

# Scenario 2: Filter by category
print("=" * 80)
print("Scenario 2: Category Filter (cs.LG only)")
print("=" * 80)

results = index.query(
    vector=query_vector,
    filter={"category": {"$eq": "cs.LG"}},
    top_k=6,
    include_metadata=True
)

for match in results['matches'][1:6]:
    metadata = match['metadata']
    print(f"  {metadata['category']:8} {metadata['year']} | {match['score']:.4f} | {metadata['title'][:60]}")
print()

# Scenario 3: Filter by year range
print("=" * 80)
print("Scenario 3: Year Filter (2025 or later)")
print("=" * 80)

results = index.query(
    vector=query_vector,
    filter={"year": {"$gte": 2025}},
    top_k=5,
    include_metadata=True
)

for match in results['matches']:
    metadata = match['metadata']
    print(f"  {metadata['category']:8} {metadata['year']} | {match['score']:.4f} | {metadata['title'][:60]}")
print()

# Scenario 4: Combined filters
print("=" * 80)
print("Scenario 4: Combined Filter (cs.LG AND year >= 2025)")
print("=" * 80)

results = index.query(
    vector=query_vector,
    filter={
        "$and": [
            {"category": {"$eq": "cs.LG"}},
            {"year": {"$gte": 2025}}
        ]
    },
    top_k=5,
    include_metadata=True
)

for match in results['matches']:
    metadata = match['metadata']
    print(f"  {metadata['category']:8} {metadata['year']} | {match['score']:.4f} | {metadata['title'][:60]}")

Filter Syntax: Pinecone uses MongoDB-style operators ($eq, $gte, $and). If you’ve worked with MongoDB, this will feel familiar.

Default Namespace: Pinecone uses namespaces to partition vectors within an index. If you don’t specify one, vectors go into the default namespace (empty string ““). This caught us initially because we expected a namespace called”default.”

Run it:

python query_pinecone.py

You’ll see output like this:

Query paper:
  Title: Deep Reinforcement Learning for Autonomous Navigation
  Category: cs.LG
  Year: 2025

================================================================================
Scenario 1: Unfiltered Similarity Search
================================================================================
  cs.LG    2024 | 0.7866 | Policy Gradient Methods for Robot Control
  cs.LG    2025 | 0.7713 | Multi-Agent Reinforcement Learning in Games
  cs.CV    2024 | 0.7555 | Visual Navigation Using Deep Learning
  cs.LG    2023 | 0.7409 | Model-Free Reinforcement Learning Approaches
  cs.CL    2025 | 0.7322 | Reinforcement Learning for Dialogue Systems

================================================================================
Scenario 2: Category Filter (cs.LG only)
================================================================================
  cs.LG    2024 | 0.7866 | Policy Gradient Methods for Robot Control
  cs.LG    2025 | 0.7713 | Multi-Agent Reinforcement Learning in Games
  cs.LG    2023 | 0.7409 | Model-Free Reinforcement Learning Approaches
  cs.LG    2024 | 0.7266 | Deep Q-Networks for Atari Games
  cs.LG    2025 | 0.7144 | Actor-Critic Methods in Continuous Control

================================================================================
Scenario 3: Year Filter (2025 or later)
================================================================================
  cs.LG    2025 | 0.7713 | Multi-Agent Reinforcement Learning in Games
  cs.CL    2025 | 0.7322 | Reinforcement Learning for Dialogue Systems
  cs.LG    2025 | 0.7144 | Actor-Critic Methods in Continuous Control
  cs.CV    2025 | 0.7077 | Self-Supervised Learning for Visual Tasks
  cs.DB    2025 | 0.6988 | Optimizing Database Queries with Learning

================================================================================
Scenario 4: Combined Filter (cs.LG AND year >= 2025)
================================================================================
  cs.LG    2025 | 0.7713 | Multi-Agent Reinforcement Learning in Games
  cs.LG    2025 | 0.7144 | Actor-Critic Methods in Continuous Control
  cs.LG    2025 | 0.6855 | Transfer Learning in Reinforcement Learning
  cs.LG    2025 | 0.6733 | Exploration Strategies in Deep RL
  cs.LG    2025 | 0.6599 | Reward Shaping for Complex Tasks

Measuring Performance

One last benchmark. Create benchmark_pinecone.py:

from pinecone import Pinecone
import numpy as np
import time
import os
from dotenv import load_dotenv

# Load API key and connect
load_dotenv()
api_key = os.getenv('PINECONE_API_KEY')
pc = Pinecone(api_key=api_key)
index = pc.Index("arxiv-papers-5k")

# Get a query vector
results = index.query(
    vector=[0] * 1536,
    filter={"category": {"$eq": "cs.LG"}},
    top_k=1,
    include_values=True
)
query_vector = results['matches'][0]['values']

def benchmark_query(query_filter, name, iterations=100):
    """Run a query multiple times and measure average latency"""
    # Warmup
    for _ in range(5):
        index.query(
            vector=query_vector,
            filter=query_filter,
            top_k=10,
            include_metadata=True
        )

    # Actual measurement
    times = []
    for _ in range(iterations):
        start = time.time()
        index.query(
            vector=query_vector,
            filter=query_filter,
            top_k=10,
            include_metadata=True
        )
        times.append((time.time() - start) * 1000)  # Convert to ms

    avg_time = np.mean(times)
    std_time = np.std(times)
    return avg_time, std_time

print("Benchmarking Pinecone performance...")
print("=" * 80)

# Scenario 1: Unfiltered
avg, std = benchmark_query(None, "Unfiltered")
print(f"Unfiltered search:        {avg:.2f}ms (±{std:.2f}ms)")
baseline = avg

# Scenario 2: Category filter
avg, std = benchmark_query({"category": {"$eq": "cs.LG"}}, "Category filter")
overhead = avg / baseline
print(f"Category filter:          {avg:.2f}ms (±{std:.2f}ms) | {overhead:.2f}x overhead")

# Scenario 3: Year filter
avg, std = benchmark_query({"year": {"$gte": 2025}}, "Year filter")
overhead = avg / baseline
print(f"Year filter (>= 2025):    {avg:.2f}ms (±{std:.2f}ms) | {overhead:.2f}x overhead")

# Scenario 4: Combined filters
avg, std = benchmark_query(
    {"$and": [{"category": {"$eq": "cs.LG"}}, {"year": {"$gte": 2025}}]},
    "Combined filter"
)
overhead = avg / baseline
print(f"Combined filter:          {avg:.2f}ms (±{std:.2f}ms) | {overhead:.2f}x overhead")

print("=" * 80)

Run this:

python benchmark_pinecone.py

Here’s what we found (your numbers will vary based on your distance from AWS us-east-1):

Benchmarking Pinecone performance...
================================================================================
Unfiltered search:        87.45ms (±2.15ms)
Category filter:          88.41ms (±3.12ms) | 1.01x overhead
Year filter (>= 2025):    88.69ms (±2.84ms) | 1.01x overhead
Combined filter:          87.18ms (±2.67ms) | 1.00x overhead
================================================================================

What the Numbers Tell Us

Now let’s look at all four databases:

Scenario ChromaDB pgvector Qdrant Pinecone
Unfiltered 4.5ms 2.5ms 52ms 87ms
Category filter overhead 3.3x 2.3x 1.09x 1.01x
Year filter overhead 8.0x 1.0x 1.11x 1.01x
Combined filter overhead 5.0x 2.3x 1.11x 1.00x

Two patterns emerge:

  1. Filtering overhead is essentially zero. Pinecone shows 1.00-1.01x overhead across all filter types. Category filters, year filters, combined filters all take the same time as unfiltered queries. Pinecone’s infrastructure handles filtering so efficiently that it’s invisible in the measurements.
  2. Network latency dominates baseline performance. At 87ms, Pinecone is the slowest for unfiltered queries. But this isn’t because Pinecone is slow at vector search. It’s because we’re sending queries from Mexico City to AWS us-east-1 over the internet. Every query pays the cost of network round-trip time plus serialization/deserialization.

If you ran this benchmark from Virginia (close to us-east-1), your baseline would be much lower. If you ran it from Tokyo, it would be higher. The filtering overhead would stay at 1.0x regardless.

What Pinecone Gets Right

  • Zero Operational Overhead: You don’t install anything. You don’t manage servers. You don’t tune indexes. You don’t monitor disk space or memory usage. You just use the API.
  • Automatic Scaling: Pinecone’s serverless tier scales automatically based on your usage. You don’t provision capacity upfront. You don’t worry about running out of resources.
  • Filtering Performance: Complex filters don’t slow down queries. Filter on one field or ten fields, it doesn’t matter. The overhead is invisible.
  • High Availability: Pinecone handles replication, failover, and uptime. You don’t build these capabilities yourself.

What to Consider

  • Network Latency: Every query goes over the internet to Pinecone’s servers. For latency-sensitive applications, that baseline 87ms (or whatever your network adds) might be too much.
  • Cost Structure: The free tier is great for learning, but production usage costs money. Pinecone charges based on pod usage and storage. You need to understand their pricing model and how it scales with your needs.
  • Vendor Lock-In: Your data lives in Pinecone’s infrastructure. Migrating to a different solution means extracting all your vectors and rebuilding indexes elsewhere. This isn’t impossible, but it’s not trivial either.
  • Limited Control: You can’t tune the underlying index parameters. You can’t see how Pinecone implements filtering. You get what they give you, which is usually good but might not be optimal for your specific case.

When Pinecone Makes Sense

Pinecone is an excellent choice when:

  • You want zero operational overhead (no servers to manage)
  • Your team should focus on application features, not database operations
  • You can accept ~100ms baseline latency for the convenience
  • You need heavy filtering on multiple metadata fields
  • You want automatic scaling without capacity planning
  • You’re building a new application without existing infrastructure constraints
  • Your scale could grow unpredictably (Pinecone handles this automatically)

Pinecone might not be the best fit when:

  • You need sub-10ms query latency
  • You want to minimize ongoing costs (self-hosting can be cheaper at scale)
  • You prefer to control your infrastructure completely
  • You already have existing database infrastructure you can leverage
  • You’re uncomfortable with vendor lock-in

Comparing All Four Approaches

We’ve now tested four different ways to handle vector search with metadata filtering. Let’s look at what we learned.

The Performance Picture

Here’s the complete comparison:

Database Unfiltered Category Overhead Year Overhead Combined Overhead Setup Complexity Ops Overhead
ChromaDB 4.5ms 3.3x 8.0x 5.0x Trivial None
pgvector 2.5ms 2.3x 1.0x 2.3x Moderate Moderate
Qdrant 52ms 1.09x 1.11x 1.11x Easy Minimal
Pinecone 87ms 1.01x 1.01x 1.00x Trivial None

Three Different Strategies

Looking at these numbers, three distinct strategies emerge:

Strategy 1: Optimize for Raw Speed (pgvector)

pgvector wins on baseline query speed at 2.5ms. It runs in-process with PostgreSQL, so there’s no network overhead. If your primary concern is getting results back as fast as possible and filtering is occasional, pgvector delivers.

The catch: text filtering adds 2.3x overhead. Integer filtering is essentially free, but if you’re doing complex text filters frequently, that overhead accumulates.

Strategy 2: Optimize for Filtering Consistency (Qdrant)

Qdrant accepts a slower baseline (52ms) but delivers remarkably consistent filtering performance. Whether you filter on one field or five, category or year, simple or complex, you get roughly 1.1x overhead.

The catch: you’re running another service, and that baseline 52ms includes HTTP API overhead. For latency-critical applications, that might be too much.

Strategy 3: Optimize for Convenience (Pinecone)

Pinecone gives you zero operational overhead and essentially zero filtering overhead (1.0x). You don’t manage anything. You just use an API.

The catch: network latency to their cloud infrastructure means ~87ms baseline queries (from our location). The convenience costs you in baseline latency.

The Decision Framework

So which one should you choose? It depends entirely on your constraints.

Choose pgvector when:

  • Raw query speed is critical (need sub-5ms)
  • You already have PostgreSQL infrastructure
  • Your team has strong SQL and PostgreSQL skills
  • You primarily filter on integer fields (dates, IDs, counts)
  • Your scale is moderate (up to a few million vectors)
  • You’re comfortable with PostgreSQL operational tasks (VACUUM, index maintenance)

Choose Qdrant when:

  • You need predictable performance regardless of filter complexity
  • You filter on many fields with unpredictable combinations
  • You can accept ~50ms baseline latency
  • You want self-hosting but need better filtering than ChromaDB
  • You’re comfortable with Docker or Kubernetes deployment
  • Your scale is millions to tens of millions of vectors

Choose Pinecone when:

  • You want zero operational overhead
  • Your team should focus on features, not database operations
  • You can accept ~100ms baseline latency (varies by geography)
  • You need heavy filtering on multiple metadata fields
  • You want automatic scaling without capacity planning
  • Your scale could grow unpredictably

Choose ChromaDB when:

  • You’re prototyping and learning
  • You need simple local development
  • Filtering is occasional, not critical path
  • You want minimal setup complexity
  • Your scale is small (thousands to tens of thousands of vectors)

The Tradeoffs That Matter

Speed vs Filtering: pgvector is fastest but filtering costs you. Qdrant and Pinecone accept slower baselines for better filtering.

Control vs Convenience: Self-hosting (pgvector, Qdrant) gives you control but requires operational expertise. Managed services (Pinecone) remove operational burden but limit control.

Infrastructure: pgvector requires PostgreSQL. Qdrant needs container orchestration. Pinecone just needs an API key.

Geography: Local databases (pgvector, Qdrant) don’t care where you are. Cloud services (Pinecone) add latency based on your distance from their data centers.

No Universal Answer

There’s no “best” database here. Each one makes different tradeoffs. The right choice depends on your specific situation:

  • What’s your query volume and latency requirements?
  • How complex are your filters?
  • What infrastructure do you already have?
  • What expertise does your team have?
  • What’s your budget for operational overhead?

These questions matter more than any benchmark number.

What We Didn’t Test

Before you take these numbers as absolute truth, let’s be honest about what we didn’t measure. All four databases use approximate nearest-neighbor indexes for speed. That means queries are fast, but they can sometimes miss the true closest results—especially when filters are applied. In real applications, you should measure not just latency, but also result quality (recall), and tune settings if needed.

Scale

We tested 5,000 vectors. That’s useful for learning, but it’s small. Real applications might have 50,000 or 500,000 or 5 million vectors. Performance characteristics can change at different scales.

The patterns we observed (pgvector’s speed, Qdrant’s consistent filtering, Pinecone’s zero overhead filters) likely hold at larger scales. But the absolute numbers will be different. Run your own benchmarks at your target scale.

Configuration

All databases used default settings. We didn’t tune HNSW parameters. We didn’t experiment with different index types. Tuned configurations could show different performance characteristics.

For learning, defaults make sense. For production, you’ll want to tune based on your specific data and query patterns.

Geographic Variance

We ran Pinecone tests from Mexico City to AWS us-east-1. If you’re in Virginia, your latency will be lower. If you’re in Tokyo, it will be higher. With self-hosted pgvector or Qdrant, you can deploy the database close to your application, enabling you to control geographic latency.

Load Patterns

We measured queries at one moment in time with consistent load. Production systems experience variable query patterns, concurrent users, and resource contention. Real performance under real production load might differ.

Write Performance

We focused on query performance. We didn’t benchmark bulk updates, deletions, or reindexing operations. If you’re constantly updating vectors, write performance matters too.

Advanced Features

We didn’t test hybrid search with BM25, learned rerankers, multi-vector search, or other advanced features some databases offer. These capabilities might influence your choice.

What’s Next

You now have hands-on experience with four different vector databases. You understand their performance characteristics, their tradeoffs, and when to choose each one.

More importantly, you have a framework for thinking about database selection. It’s not about finding the “best” database. It’s about matching your requirements to each database’s strengths.

When you build your next application:

  1. Start with your requirements. What are your latency needs? How complex are your filters? What scale are you targeting?
  2. Match requirements to database characteristics. Need speed? Consider pgvector. Need consistent filtering? Look at Qdrant. Want zero ops? Try Pinecone.
  3. Prototype quickly. Spin up a test with your actual data and query patterns. Measure what matters for your use case.
  4. Be ready to change. Your requirements might evolve. The database that works at 10,000 vectors might not work at 10 million. That’s fine. You can migrate.

The vector database landscape is evolving rapidly. New options appear. Existing options improve. The fundamentals we covered here (understanding tradeoffs, measuring what matters, matching requirements to capabilities) will serve you regardless of which specific databases you end up using.

In our next tutorial, we’ll look at semantic caching and memory patterns for AI applications. We’ll use the knowledge from this tutorial to choose the right database for different caching scenarios.

Until then, experiment with these databases. Load your own data. Run your own queries. See how they behave with your specific workload. That hands-on experience is more valuable than any benchmark we could show you.


Key Takeaways

  • Performance Patterns Are Clear: pgvector delivers 2.5ms baseline (fastest), Qdrant 52ms (moderate with HTTP overhead), Pinecone 87ms (network latency dominates). Each optimizes for different goals.
  • Filtering Overhead Varies Dramatically: ChromaDB shows 3-8x overhead. pgvector shows 2.3x for text but 1.0x for integers. Qdrant maintains consistent 1.1x regardless of filter complexity. Pinecone achieves essentially zero filtering overhead (1.0x).
  • Three Distinct Strategies Emerge: Optimize for raw speed (pgvector), optimize for filtering consistency (Qdrant), or optimize for convenience (Pinecone). No universal "best" choice exists.
  • Purpose-Built Databases Excel at Filtering: Qdrant and Pinecone, designed specifically for filtered vector search, handle complex filters without performance degradation. pgvector leverages PostgreSQL's strengths but wasn't built primarily for this use case.
  • Operational Overhead Is Real: pgvector requires PostgreSQL expertise (VACUUM, index maintenance). Qdrant needs container orchestration. Pinecone removes ops but introduces vendor dependency. Match operational capacity to database choice.
  • Geography Matters for Cloud Services: Pinecone's 87ms baseline from Mexico City to AWS us-east-1 is dominated by network latency. Self-hosted options (pgvector, Qdrant) don't have this variance.
  • Scale Changes Everything: We tested 5,000 vectors. Behavior at 50k, 500k, or 5M vectors will differ. The patterns we observed likely hold, but absolute numbers will change. Always benchmark at your target scale.
  • Decision Frameworks Beat Feature Lists: Choose based on your constraints: latency requirements, filter complexity, existing infrastructure, team expertise, and operational capacity. Not on marketing claims.
  • Prototyping Beats Speculation: The fastest way to know if a database works for you is to load your actual data and run your actual queries. Benchmarks guide thinking but can't replace hands-on testing.
Mike Levy

About the author

Mike Levy

Mike is a life-long learner who is passionate about mathematics, coding, and teaching. When he's not sitting at the keyboard, he can be found in his garden or at a natural hot spring.