KV Cache Locality: How Load Balancing Drives Up LLM Serving Costs
Summary
- • KV cache hits reduce time-to-first-token from ~500ms to 18ms — a 28x gap on CodeLlama 13B
- • Round-robin load balancing causes repeated prefill recomputation across GPUs, wasting compute and money
- • Prefix-aware routing raises cache hit rate from 12.5% to 97.5% with zero hardware changes
- • Throughput gains of 13–22% achievable for mid-size models; ~$1,200–$1,800/month saved per 8-GPU node
Details
Transformer inference splits into expensive prefill and cheap decode phases
Prefill computes key-value pairs for all input tokens and is compute-bound on the GPU. Decode reuses those pairs to generate output tokens one at a time. The KV cache stores prefill results so identical token prefixes skip recomputation entirely.
Cache hit on CodeLlama 13B: 18ms P50; cache miss: ~500ms — a 28x TTFT gap
This time-to-first-token difference is the core reason KV cache locality matters operationally. A miss forces full prefill recomputation, which on a 4,000-token system prompt on Llama 3.1 70B at half precision takes over a second per request.
Standard load balancers are blind to KV cache state — they route by connections, not tokens
Round-robin and connection-count load balancers have no visibility into which GPU holds a cached prefix. They treat every request identically, causing redundant prefill across all GPUs in a multi-GPU cluster. This is framed as an orchestration-layer design gap, not a configuration error.
Round-robin yields 12.5% cache hit rate and 36.3 req/s; prefix-aware routing yields 97.5% and 44.4 req/s
Benchmark scenario: RAG application, 4,000-token system prompt, 8 GPUs running CodeLlama 13B, 30 concurrent users. Same hardware, same model, same workload — only the routing strategy differed. P99 TTFT improved from 6,800ms to 1,000ms.
Round-robin routing wastes an estimated $1,200–$1,800/month in GPU-hours on a single 8-GPU node
Cost estimate derived from the 22.3% throughput gap between routing strategies on the 8-GPU benchmark, priced at ~$10/hr per node. The dollar figure scales linearly with cluster size, request volume, and continuous deployment duration.
Throughput gains are model-size-dependent: mid-size models benefit most, large/small see ~0%
Llama 3.1 8B: ~0% aggregate throughput gain (inference fast enough that routing overhead negates the cache benefit). CodeLlama 13B: +13.7% to +22.3% aggregate gain. Llama 3.1 70B: ~0% aggregate gain (GPUs already compute-saturated; routing helps latency but not throughput ceiling).
vLLM already implements per-GPU KV caching; the optimization gap is at the load balancer layer
Serving engines like vLLM cache KV pairs and skip prefill on prefix matches. The problem is upstream: the load balancer must be made aware of prefix identity and GPU cache state to route correctly. This is an orchestration-layer problem, not a model-serving problem.
Tech Info = how a technology works, Stat = quantified benchmark or measurement, Insight = analytical argument or framing, Financials = cost or revenue impact, Infrastructure = systems architecture consideration
What This Means
For teams running LLM inference at scale, the load balancer is a hidden cost center that most infrastructure stacks ignore entirely. Switching from round-robin to prefix-aware routing can recover 13–22% throughput on mid-size models with no hardware spend — translating directly to lower per-request GPU cost or higher capacity headroom. This is especially material for RAG applications with long, shared system prompts, where the same prefill is being recomputed redundantly across every GPU in a cluster. Engineers evaluating LLM serving costs should treat load balancer design as a first-class optimization target alongside batching strategy and quantization.
