KV Cache Locality: How Load Balancing Drives Up LLM Serving Costs

Infra1 source·May 1

gpu inference inference-compute cost-monitoring

Summary

• KV cache hits reduce time-to-first-token from ~500ms to 18ms — a 28x gap on CodeLlama 13B
• Round-robin load balancing causes repeated prefill recomputation across GPUs, wasting compute and money
• Prefix-aware routing raises cache hit rate from 12.5% to 97.5% with zero hardware changes
• Throughput gains of 13–22% achievable for mid-size models; ~$1,200–$1,800/month saved per 8-GPU node

Adjust signal

Details

#	Type	Key Point	Context
1	Tech Info	Transformer inference splits into expensive prefill and cheap decode phases	Prefill computes key-value pairs for all input tokens and is compute-bound on the GPU. Decode reuses those pairs to generate output tokens one at a time. The KV cache stores prefill results so identical token prefixes skip recomputation entirely.
2	Stat	Cache hit on CodeLlama 13B: 18ms P50; cache miss: ~500ms — a 28x TTFT gap	This time-to-first-token difference is the core reason KV cache locality matters operationally. A miss forces full prefill recomputation, which on a 4,000-token system prompt on Llama 3.1 70B at half precision takes over a second per request.
3	Insight	Standard load balancers are blind to KV cache state — they route by connections, not tokens	Round-robin and connection-count load balancers have no visibility into which GPU holds a cached prefix. They treat every request identically, causing redundant prefill across all GPUs in a multi-GPU cluster. This is framed as an orchestration-layer design gap, not a configuration error.
4	Stat	Round-robin yields 12.5% cache hit rate and 36.3 req/s; prefix-aware routing yields 97.5% and 44.4 req/s	Benchmark scenario: RAG application, 4,000-token system prompt, 8 GPUs running CodeLlama 13B, 30 concurrent users. Same hardware, same model, same workload — only the routing strategy differed. P99 TTFT improved from 6,800ms to 1,000ms.
5	Financials	Round-robin routing wastes an estimated $1,200–$1,800/month in GPU-hours on a single 8-GPU node	Cost estimate derived from the 22.3% throughput gap between routing strategies on the 8-GPU benchmark, priced at ~$10/hr per node. The dollar figure scales linearly with cluster size, request volume, and continuous deployment duration.
6	Stat	Throughput gains are model-size-dependent: mid-size models benefit most, large/small see ~0%	Llama 3.1 8B: ~0% aggregate throughput gain (inference fast enough that routing overhead negates the cache benefit). CodeLlama 13B: +13.7% to +22.3% aggregate gain. Llama 3.1 70B: ~0% aggregate gain (GPUs already compute-saturated; routing helps latency but not throughput ceiling).
7	Infrastructure	vLLM already implements per-GPU KV caching; the optimization gap is at the load balancer layer	Serving engines like vLLM cache KV pairs and skip prefill on prefix matches. The problem is upstream: the load balancer must be made aware of prefix identity and GPU cache state to route correctly. This is an orchestration-layer problem, not a model-serving problem.

1.Tech Info

Transformer inference splits into expensive prefill and cheap decode phases

Prefill computes key-value pairs for all input tokens and is compute-bound on the GPU. Decode reuses those pairs to generate output tokens one at a time. The KV cache stores prefill results so identical token prefixes skip recomputation entirely.

2.Stat

Cache hit on CodeLlama 13B: 18ms P50; cache miss: ~500ms — a 28x TTFT gap

This time-to-first-token difference is the core reason KV cache locality matters operationally. A miss forces full prefill recomputation, which on a 4,000-token system prompt on Llama 3.1 70B at half precision takes over a second per request.

3.Insight

Standard load balancers are blind to KV cache state — they route by connections, not tokens

Round-robin and connection-count load balancers have no visibility into which GPU holds a cached prefix. They treat every request identically, causing redundant prefill across all GPUs in a multi-GPU cluster. This is framed as an orchestration-layer design gap, not a configuration error.

4.Stat

Round-robin yields 12.5% cache hit rate and 36.3 req/s; prefix-aware routing yields 97.5% and 44.4 req/s

Benchmark scenario: RAG application, 4,000-token system prompt, 8 GPUs running CodeLlama 13B, 30 concurrent users. Same hardware, same model, same workload — only the routing strategy differed. P99 TTFT improved from 6,800ms to 1,000ms.

5.Financials

Round-robin routing wastes an estimated $1,200–$1,800/month in GPU-hours on a single 8-GPU node

Cost estimate derived from the 22.3% throughput gap between routing strategies on the 8-GPU benchmark, priced at ~$10/hr per node. The dollar figure scales linearly with cluster size, request volume, and continuous deployment duration.

6.Stat

Throughput gains are model-size-dependent: mid-size models benefit most, large/small see ~0%

Llama 3.1 8B: ~0% aggregate throughput gain (inference fast enough that routing overhead negates the cache benefit). CodeLlama 13B: +13.7% to +22.3% aggregate gain. Llama 3.1 70B: ~0% aggregate gain (GPUs already compute-saturated; routing helps latency but not throughput ceiling).

7.Infrastructure

vLLM already implements per-GPU KV caching; the optimization gap is at the load balancer layer

Serving engines like vLLM cache KV pairs and skip prefill on prefix matches. The problem is upstream: the load balancer must be made aware of prefix identity and GPU cache state to route correctly. This is an orchestration-layer problem, not a model-serving problem.

Tech Info = how a technology works, Stat = quantified benchmark or measurement, Insight = analytical argument or framing, Financials = cost or revenue impact, Infrastructure = systems architecture consideration

What This Means

For teams running LLM inference at scale, the load balancer is a hidden cost center that most infrastructure stacks ignore entirely. Switching from round-robin to prefix-aware routing can recover 13–22% throughput on mid-size models with no hardware spend — translating directly to lower per-request GPU cost or higher capacity headroom. This is especially material for RAG applications with long, shared system prompts, where the same prefill is being recomputed redundantly across every GPU in a cluster. Engineers evaluating LLM serving costs should treat load balancer design as a first-class optimization target alongside batching strategy and quantization.

Sources

KV Cache Locality: The Hidden Variable in Your LLM Serving CostRanvier

Similar Events

New LLM Architectures Target Long-Context Efficiency: Gemma 4, DeepSeek V4, and More

May 18

SMG: Rust Gateway Disaggregates CPU Work from GPU Inference to Kill GIL Bottleneck

May 1