Balancing recall, latency, and memory in vector search means choosing how much quality, speed, and infrastructure cost your application needs.
You usually cannot maximize all three at once. Higher recall often requires more search work. Lower latency often requires less search work or more memory. Lower memory often requires compression, smaller vectors, disk-backed indexes, or different index settings.
Short Answer
Start with the minimum recall your application needs. Then tune for the lowest latency and memory footprint that still meets that recall target.
The main knobs are index type, query-time search breadth, build-time index quality, graph connectivity, vector dimensions, compression, rescoring, result limit, filtering strategy, and hardware placement.
The Three-Way Trade-Off
Recall measures whether the search returns the expected nearest neighbors or relevant candidates.
Latency measures how long each query takes.
Memory measures how much RAM the vectors, graph, cache, and index structures require.
Improving one dimension can pressure the others.
Why Recall Costs Work
Higher recall usually requires the database to explore more candidates.
In graph search, that can mean a wider search. In cluster search, it can mean probing more clusters. In compressed search, it can mean over-fetching and rescoring more candidates.
More candidate exploration improves quality but increases CPU, memory access, disk access, and latency.
Why Low Latency Costs Quality
Low latency comes from doing less work per query.
The database may traverse fewer graph nodes, probe fewer partitions, compare fewer vectors, fetch fewer objects, or skip expensive reranking.
If the search is narrowed too aggressively, true nearest neighbors may never enter the candidate set.
Why Low Memory Costs Something
Lower memory usage usually means storing less information in RAM.
That may involve smaller vectors, compressed vectors, fewer graph connections, smaller caches, disk-backed data, or more compact index structures.
These choices can reduce cost, but they may affect recall, latency, or both.
Start With a Quality Target
The safest tuning process starts with quality.
Define a minimum acceptable Recall@K or application-level relevance score. Then find configurations that meet that target.
After the quality floor is clear, optimize latency and memory within that boundary.
Choose the Right K
Measure recall at the cutoff your application uses.
If the UI shows 10 results, Recall@10 matters. If a RAG system retrieves 50 candidates for reranking, Recall@50 matters. If a recommender fills a carousel with 20 items, evaluate at that limit.
Using the wrong K can produce misleading tuning decisions.
Index Type
The index type shapes the trade-off.
Flat search is exact but becomes slow at large scale. Graph indexes are fast and high-recall but can use significant memory. Cluster indexes can reduce memory and search space but require careful probe tuning. Disk-backed indexes can reduce RAM needs but may add latency.
Choose the index family before fine-tuning individual parameters.
Query-Time Search Breadth
Query-time search breadth is the most direct recall-latency knob.
A lower search breadth reduces latency but may lower recall. A higher search breadth improves recall but increases query time.
This setting is useful because it can often be changed without rebuilding the index.
Build-Time Index Quality
Build-time settings affect the quality of the index structure.
Stronger construction can improve recall at query time and sometimes allow lower query-time breadth later. The cost is slower ingestion, higher build work, and sometimes more memory.
If recall is poor even at high query-time settings, build-time quality may be the problem.
Graph Connectivity
Graph indexes use connections between nearby vectors.
More connections can improve navigability and recall, but they consume memory and may increase search overhead. Fewer connections reduce memory but can make it harder for the search to find good paths.
Connectivity is a memory-quality trade-off.
Vector Dimensions
Vector dimensions have a direct memory cost.
A 1536-dimensional float vector uses much more memory than a 384-dimensional vector. Higher dimensions may improve representation quality, but they also increase storage, memory bandwidth, and distance-computation cost.
Changing dimensions means changing the embedding model or representation, so evaluate retrieval quality carefully.
Compression
Compression reduces memory by storing compact vector representations.
Product quantization, scalar quantization, binary quantization, and rotational quantization all reduce vector footprint in different ways.
The benefit is lower memory and sometimes faster candidate scoring. The risk is recall loss from approximate vector representations.
Rescoring
Rescoring helps recover recall after approximate or compressed search.
The system over-fetches a candidate set, then recomputes final scores with full vectors or a more accurate representation.
Rescoring improves quality but adds latency, especially if full vectors must be fetched from storage.
Disk-Backed Storage
Disk-backed vector search reduces memory pressure by placing more data on SSD.
This can make very large collections cheaper to operate. The trade-off is that queries may need bounded disk reads, compressed postings, caching, and careful p99 latency measurement.
Use disk-backed approaches when memory cost is the limiting factor and the application can tolerate slightly higher latency.
Result Limits
Result limits affect all three dimensions.
Higher limits may require broader search to maintain recall. They can also increase object retrieval, network payload, and reranking cost.
Do not tune with a limit of 10 if production uses 100.
Filters
Filters can improve relevance and reduce eligible candidates, but they can also make recall harder.
If a filter is selective, the index may need to search farther to find enough valid results. Pre-filtering and filter-aware traversal can improve behavior, but filtered search must be benchmarked separately.
Unfiltered performance numbers often hide filtered-query issues.
Memory Planning
Memory planning should include more than raw vectors.
Account for vectors, graph edges, compressed codes, caches, metadata indexes, object payloads, query buffers, and operating-system cache.
A vector-only estimate is useful, but production sizing needs headroom.
Latency Planning
Latency planning should include p95 and p99, not only average latency.
Tail latency often reveals cache misses, disk reads, filtered-query difficulty, object retrieval cost, or contention under concurrency.
A configuration that looks good at p50 can still fail production service levels.
Throughput Pressure
High QPS changes the trade-off.
A query setting that is acceptable for low traffic may consume too much CPU under high concurrency. A memory-saving configuration may become I/O-bound when many users query at once.
Benchmark the chosen recall target under realistic concurrency.
Latency-First Configuration
A latency-first configuration usually uses:
- moderate recall target
- lower query-time search breadth
- smaller result limits
- limited rescoring
- in-memory hot indexes
- compression only if it improves speed without hurting quality too much
Recall-First Configuration
A recall-first configuration usually uses:
- higher query-time search breadth
- stronger build-time index settings
- more graph connections or cluster probes
- larger candidate pools
- less aggressive compression
- over-fetching and full-vector rescoring
Memory-First Configuration
A memory-first configuration usually uses:
- vector compression
- lower-dimensional embeddings when acceptable
- smaller graph connectivity
- disk-backed or hybrid storage
- careful cache sizing
- rescoring only for a bounded candidate set
Practical Tuning Order
A practical tuning order is:
- Define the required Recall@K or relevance target.
- Measure the default index configuration.
- Set result limits and filters to match production.
- Tune query-time breadth until recall is acceptable.
- If recall is still poor, adjust build-time quality or index type.
- Reduce memory with compression or dimensions, then remeasure recall.
- Add rescoring if compression hurts final quality.
- Run p95, p99, and QPS tests under concurrency.
- Choose the lowest-cost configuration that meets quality and latency requirements.
What to Measure Together
Track these metrics together:
- Recall@K
- semantic relevance or answer success
- p50 latency
- p95 latency
- p99 latency
- QPS under concurrency
- RAM usage
- disk reads and cache hit rate
- index build or import time
- cost per query
Common Mistakes
Common mistakes include:
- tuning latency without measuring recall
- reducing memory with compression without testing quality
- using average latency instead of p99 latency
- ignoring filtered queries
- changing result limits without retuning search breadth
- maximizing recall beyond what the application needs
- using stronger build settings without considering import time
- assuming more RAM is always cheaper than better tuning
Decision Framework
Ask three questions:
- What recall or relevance level is required for the task?
- What p95 or p99 latency must users experience?
- What memory and infrastructure cost can the system afford?
If recall is below target, increase search breadth, build quality, probes, or rescoring. If latency is too high, reduce search work or move more hot data into memory. If memory is too high, compress vectors, reduce dimensions, tune graph size, or consider disk-backed designs.
Summary
Balancing recall, latency, and memory in vector search is an engineering trade-off, not a single setting.
Higher recall usually costs more query work. Lower latency usually requires less work or more memory. Lower memory usually requires compression, smaller representations, or disk-backed storage.
The best configuration is the cheapest and fastest one that still meets the application’s recall and relevance target under production-like filters, result limits, and concurrency.