Vector Search Throughput Explained

Vector search throughput measures how many search requests a vector database can complete in a given amount of time.

It is usually reported as QPS, or queries per second. Throughput is related to latency, but it is not the same thing. Latency describes one request. Throughput describes how many requests the system can serve under load.

Short Answer

Vector search throughput is the rate at which a vector search system can answer queries while meeting a latency and quality target.

A system might serve 2,000 QPS at low recall, but only 800 QPS at higher recall. It might also serve high QPS at p50 latency while failing p99 latency.

Throughput only matters when it is reported with recall, result limits, concurrency, and percentile latency.

Throughput vs Latency

Latency is the time required to complete one query.

Throughput is the number of queries completed per second.

A single request with 10 ms latency does not automatically mean the system can serve 100 QPS in production. Real throughput depends on concurrency, CPU cores, memory bandwidth, disk access, locks, network overhead, and object retrieval.

What QPS Means

QPS means queries per second.

If a vector database completes 5,000 search requests in 10 seconds, its throughput for that test is 500 QPS.

The number is only meaningful if the benchmark also states the workload, hardware, result limit, filters, latency distribution, and recall target.

Why Concurrency Matters

Throughput is measured under concurrent load.

One user sending one query at a time measures single-request behavior. Many users sending queries at the same time measures system capacity.

Vector databases often use multiple CPU cores by serving multiple requests in parallel, even when one individual search path is not fully parallelized.

Single-Thread Estimates Can Mislead

A simple calculation such as 1 second / mean latency can estimate single-thread capacity.

But production systems are not that simple. Scaling across cores may be limited by locks, memory bandwidth, cache misses, disk reads, network overhead, and shared resource contention.

Actual multi-threaded measurement is more reliable than extrapolation.

Throughput and Recall

Throughput must be compared at the same recall level.

A lower-recall configuration can serve more QPS because it searches fewer candidates. A higher-recall configuration usually explores more of the index and therefore serves fewer queries per second.

Do not compare two systems by QPS alone unless their recall targets are equivalent.

Throughput and Result Limits

Result limits affect throughput.

A query that returns 10 results usually costs less than a query that returns 100 results. Larger limits can require broader search, more candidate scoring, more object retrieval, and larger response payloads.

Benchmarks should use the same result limits as production.

Throughput and Filters

Filters can change throughput significantly.

Some filters narrow the eligible set and reduce work. Others make search harder because the nearest vector candidates do not match the filter and the system must search farther.

Measure filtered and unfiltered throughput separately.

Throughput and P99 Latency

High throughput is not useful if tail latency becomes unacceptable.

A system may handle many requests per second on average while a small percentage of requests become very slow. P99 latency shows the slowest request time experienced by 99% of requests.

Capacity planning should specify both QPS and a percentile latency target.

Example Capacity Statement

A useful capacity statement looks like this:

1,500 QPS at Recall@10 of 0.97 with p99 latency below 40 ms, limit=10, no filters, on 16 cores.

This is more meaningful than simply saying the database supports 1,500 QPS.

CPU Bottlenecks

Vector search can be CPU-bound.

Distance computations, graph traversal, filtering logic, decompression, reranking, and serialization all consume CPU. More cores can improve throughput when many searches run concurrently.

Adding CPU may not improve single-query latency as much as it improves total QPS.

Memory Bottlenecks

Memory affects throughput through cache behavior and random access speed.

If vectors, graph links, postings, or hot objects fit in memory, searches avoid slower storage reads. If the working set does not fit, throughput can drop and p99 latency can rise.

Memory bandwidth can also become a bottleneck when many requests read high-dimensional vectors concurrently.

Disk and SSD Bottlenecks

Disk-backed vector search depends on storage access patterns.

Fast SSDs can support high throughput when reads are bounded, batched, compressed, cached, or parallelized. They cannot make unbounded random reads cheap.

If throughput collapses under load, check read IOPS, queue depth, cache hit rate, and disk wait time.

Object Retrieval Cost

Vector search does not end when candidate IDs are found.

Most applications also retrieve object text, metadata, scores, and source fields. Returning full objects costs more than returning IDs.

Throughput tests should include the same payloads the application returns in production.

Compression and Throughput

Compression can improve throughput by reducing memory traffic and making candidate scoring cheaper.

Compressed vectors may allow more candidates to be stored in memory or read from disk per operation.

The trade-off is that compression can reduce recall unless the system uses enough candidate over-fetching and rescoring.

Rescoring and Throughput

Rescoring improves final quality but reduces throughput.

The system retrieves a larger candidate set, then recomputes more accurate distances for a subset of candidates. This adds CPU and sometimes disk reads.

The larger the rescoring window, the lower the maximum QPS is likely to be.

Index Settings and Throughput

ANN index settings directly affect throughput.

Higher query-time search breadth improves recall but reduces QPS. Lower search breadth improves QPS but can miss neighbors.

Build-time settings can also affect throughput by changing graph quality, candidate routing, and memory layout.

Hybrid Search and Reranking

Hybrid search and reranking can improve result quality, but they add work.

Hybrid search may run vector and keyword retrieval together. Reranking may score each candidate with a heavier model.

Measure throughput for the complete retrieval pipeline, not only the vector stage.

Ingestion vs Query Throughput

Query throughput and ingestion throughput are different.

A system may serve searches quickly but ingest slowly because index construction is expensive. Another system may ingest quickly but provide weaker recall or slower queries.

Do not use query QPS as a proxy for import capacity.

How to Benchmark Throughput

A practical throughput benchmark should:

use representative queries
use production result limits
include filters when production uses filters
include object retrieval and response payloads
run under realistic concurrency
measure recall at the same time
report p50, p95, and p99 latency
record CPU, memory, disk, and network utilization
repeat warm-cache and cold-cache tests when relevant

How to Find the Throughput Limit

Increase concurrency gradually.

At first, QPS should rise as more requests are issued. Eventually, a bottleneck appears. Latency rises, p99 worsens, errors appear, or QPS stops increasing.

The sustainable throughput limit is below the point where latency and error rates become unacceptable.

Capacity Planning

Capacity planning should leave headroom.

If a system can serve 2,000 QPS in a benchmark, that does not mean production should run it at 2,000 QPS continuously. Traffic spikes, background ingestion, cache shifts, noisy neighbors, and harder query mixes can reduce available capacity.

Plan for peak load plus safety margin.

Common Mistakes

Common mistakes include:

reporting QPS without recall
reporting QPS without p99 latency
testing only one concurrent client
benchmarking IDs only when production returns full objects
using smaller result limits than production
ignoring filters and access-control constraints
assuming throughput scales linearly with CPU cores
tuning QPS so aggressively that quality falls below the application target

Practical Rule

Optimize throughput only after defining the quality and latency boundary.

The goal is not the highest possible QPS. The goal is the highest sustainable QPS at the required recall, p95 or p99 latency, result limit, filter pattern, and payload size.

Summary

Vector search throughput is the number of search queries a system can complete per second under realistic load.

It depends on concurrency, CPU, memory, disk I/O, object retrieval, index settings, compression, rescoring, result limits, filters, and recall targets.

A useful throughput number always comes with context: QPS at a specific recall, latency percentile, concurrency level, workload, and hardware profile.