Vector database benchmarks should measure more than raw speed. A useful benchmark shows whether the system returns good results, how quickly it returns them, how much concurrent load it can handle, and whether the results are semantically useful for the application.
The most common benchmark dimensions are recall, latency, throughput, and semantic search quality.
Short Answer
Benchmark vector databases by testing four things together:
- recall: whether approximate search finds the expected nearest neighbors
- latency: how long individual queries take
- throughput: how many queries per second the system handles under concurrency
- semantic search quality: whether returned results satisfy real user intent
A benchmark that measures only one of these can be misleading.
Why Vector Database Benchmarks Are Different
Traditional database benchmarks often focus on request latency, throughput, storage, and write performance.
Vector database benchmarks must also measure approximation quality. Most large vector search systems use approximate nearest neighbor indexes. These indexes trade a small amount of exactness for much better speed at scale.
That means a vector database can be fast because it is doing less search work. The benchmark must show what quality was achieved at that speed.
Recall
Recall measures how many expected neighbors appear in the returned results.
In ANN benchmarking, recall is usually measured by comparing approximate search results against exact brute-force nearest-neighbor results.
If exact search finds the true top 10 vectors and the ANN index returns 9 of them, the query has 90% Recall@10.
Recall at K
Recall is measured at a cutoff, written as Recall@K.
Recall@10 evaluates the top 10 results. Recall@100 evaluates the top 100 results.
The cutoff should match the application. A recommendation API returning 20 items should not be evaluated only with Recall@100. A reranking pipeline that retrieves 100 candidates should not be evaluated only at 10.
Ground Truth
ANN recall needs a ground truth set.
For index-level benchmarks, ground truth is usually produced with exact search over the same vectors. This is slower, but it gives the benchmark a reference answer.
For semantic search quality, ground truth may come from human relevance labels, click logs, expert review, or task-specific answer sets.
Latency
Latency measures how long a request takes.
Vector database latency should include the work users actually wait for: query processing, index traversal, candidate scoring, object retrieval, filtering, reranking when used, and network overhead if the benchmark runs through the real service path.
Microbenchmarks that return only vector IDs can be useful for index research, but they do not always represent production user latency.
Mean Latency
Mean latency is the average response time.
It is easy to understand, but it can hide bad tail behavior. A system with excellent average latency may still produce slow requests under load.
Mean latency should be reported, but it should not be the only latency number.
P95 and P99 Latency
Percentile latency shows the slow side of the distribution.
P95 latency means 95% of requests completed at or below that time. P99 latency means 99% completed at or below that time.
P99 is especially important for production search because users and downstream applications notice tail latency.
Throughput
Throughput measures how many queries the system can handle per second.
It is often reported as QPS, or queries per second.
Throughput must be measured under concurrency. A single-threaded latency test cannot reliably predict multi-user throughput because contention, locks, memory bandwidth, CPU scheduling, disk access, and network behavior can change under load.
Latency vs Throughput
Latency and throughput are related, but they are not the same.
Latency describes one request. Throughput describes the rate of many requests.
A system may have low latency at light load and much worse latency when pushed to high QPS. A good benchmark reports both the achieved QPS and the latency distribution at that QPS.
Semantic Search Quality
Semantic search quality asks whether results are actually useful.
ANN recall can be high while semantic relevance is poor if the embedding model, chunking strategy, filters, or ranking logic are wrong.
For user-facing search and RAG, semantic quality should be measured with relevance labels, nDCG, MRR, precision at K, recall at K, answer success, or human review.
Recall Is Not the Whole Quality Story
Index recall only says whether approximate search matched exact vector search.
It does not say whether exact vector search was aligned with user intent.
That is why production benchmarks should include both ANN recall and application-level relevance metrics.
Benchmark Inputs
A benchmark should specify the data and workload clearly.
- number of vectors
- vector dimensions
- distance metric
- embedding model
- result limit
- filter patterns
- query distribution
- concurrency level
- hardware profile
- index configuration
- whether compression is enabled
Dataset Choice
Dataset choice strongly affects benchmark results.
Image embeddings, text embeddings, product embeddings, multilingual content, code embeddings, and dense RAG chunks can behave differently.
A public benchmark is useful for comparison, but a production decision should include a dataset close to the real corpus.
Result Limit
The result limit changes benchmark behavior.
Returning 10 results is not the same as returning 100 results. Larger limits can require broader search, more candidate scoring, more object retrieval, and more network transfer.
Benchmarks should test the limits the application will actually use.
Index Configuration
Index settings directly affect benchmark outcomes.
For graph indexes, build-time settings can change graph quality, memory usage, import time, and recall. Query-time settings can trade latency for recall.
A benchmark should report configuration values instead of only reporting final numbers.
Compression Settings
Compression can improve memory use and throughput, but it may affect recall.
Some compression strategies work best with over-fetching or rescoring. Benchmarks should state whether results are scored with compressed vectors, full vectors, or a two-stage process.
Compare compressed and uncompressed systems at similar recall targets, not just similar latency.
Filtering and Metadata
Many production vector searches include metadata filters.
Filters can change both recall and latency. A benchmark without filters may overstate performance for applications that filter by tenant, permissions, product, region, document type, or time.
Filtered and unfiltered queries should be measured separately.
Object Retrieval
Returning IDs is cheaper than returning full objects.
Production queries often retrieve text, metadata, scores, and enough content for a UI or RAG context window.
If the benchmark excludes object retrieval, it should say so clearly.
Import Time
Import time matters when indexes are large, frequently rebuilt, or updated continuously.
Stronger index build settings may improve recall but slow ingestion. Faster imports may produce weaker index quality.
Benchmarking only query performance can miss this operational trade-off.
Memory and Cost
Memory usage affects both performance and cost.
An index that achieves excellent latency by keeping all vectors and graph structures in memory may be expensive at scale. A disk-backed or compressed configuration may be cheaper but slower or more complex to tune.
Include memory footprint and hardware cost when comparing systems.
How to Read Recall vs QPS Curves
Recall vs QPS curves show the trade-off between quality and throughput.
Points higher on the recall axis are more accurate. Points higher on the throughput axis handle more queries.
The strongest configurations are those that improve recall at the same QPS, improve QPS at the same recall, or improve both.
How to Build a Practical Benchmark
A practical benchmark follows this process:
- Choose a representative dataset.
- Create realistic query sets.
- Define result limits and filters.
- Build exact-search ground truth for ANN recall.
- Create relevance judgments for semantic quality.
- Run tests under realistic concurrency.
- Record recall, latency, p99 latency, QPS, memory, and import time.
- Compare configurations at equal quality targets.
Common Benchmarking Mistakes
Common mistakes include:
- comparing QPS without comparing recall
- using mean latency without p95 or p99 latency
- testing only unfiltered queries
- using result limits that differ from production
- benchmarking on a toy dataset with different vector dimensions
- ignoring object retrieval and network overhead
- treating ANN recall as semantic relevance
- omitting memory usage and import time
What a Good Result Looks Like
A good benchmark result is not simply the highest QPS or lowest latency.
A good result is the configuration that meets the required semantic quality and recall target while staying within latency, throughput, memory, and cost constraints.
For production systems, the best number is usually the fastest configuration that still meets the quality bar.
Summary
Vector database benchmarks should measure recall, latency, throughput, and semantic search quality together.
Recall shows whether approximate search finds the expected neighbors. Latency shows how long users wait. Throughput shows how much concurrent load the system can serve. Semantic quality shows whether the results are actually useful.
The most reliable benchmark is the one that mirrors the production workload: real data, real query shapes, real filters, real result limits, realistic concurrency, and clear quality targets.