Summary
- • Moonshot AI's PrfaaS enables LLM prefill and decode to run on separate hardware clusters across datacenters via commodity Ethernet
- • Hybrid-attention models with smaller KV caches make cross-datacenter KV transfer bandwidth-feasible for the first time
- • 1T-parameter case study shows 54% throughput gain over homogeneous baseline, 32% over naive heterogeneous
- • Compatible with vLLM, SGLang, and Dynamo; enables NVIDIA Rubin CPX and Groq LPU to work together across sites
Details
PrfaaS decouples prefill and decode across datacenters using commodity Ethernet
Traditional PD disaggregation requires both stages to share a high-bandwidth RDMA fabric, constraining them to a single datacenter. PrfaaS removes this by selectively offloading only long-context prefill requests to remote compute clusters, keeping shorter requests local to avoid unnecessary bandwidth consumption.
Hybrid-attention architectures are the key enabler — smaller KV caches make cross-datacenter transfer feasible
Standard full-attention models produce large KV caches that would overwhelm typical inter-datacenter links. Newer hybrid-attention designs substantially reduce cache size, shifting the cost-benefit calculation for cross-cluster transport. PrfaaS is explicitly designed to exploit this architectural trend.
Four core innovations: KV efficiency, selective offloading, bandwidth-aware scheduling, cache-aware placement
Bandwidth-aware scheduling dynamically decides whether to route requests to the remote prefill cluster based on current link utilization. Cache-aware placement routes requests to minimize redundant KV transfers when a cache already exists nearby. Together these minimize cross-datacenter traffic while maximizing hardware utilization.
54% throughput improvement over homogeneous baseline; 32% over naive heterogeneous — 1T-param case study
The benchmark used an internal Moonshot AI 1-trillion-parameter hybrid-attention model. The homogeneous baseline is a standard single-cluster PD setup. Cross-datacenter bandwidth usage remained modest in all tested scenarios.
Compatible with vLLM, SGLang, and Dynamo inference frameworks out of the box
Framework compatibility means PrfaaS can be adopted without rewriting existing serving stacks, significantly lowering the integration barrier for teams already running production inference on any of these three platforms.
Enables NVIDIA Rubin CPX (prefill) and Groq LPU (decode) to operate as a unified system across sites
Rubin CPX targets high-throughput compute for prefill; Groq's LPU emphasizes memory bandwidth for decode. Today these cannot easily be paired in production because they sit in different clusters without shared RDMA. PrfaaS makes heterogeneous hardware pairings across datacenters economically and technically viable.
New Tech = novel system capability; Research = key enabling insight from paper; Tech Info = architectural detail; Stat = quantitative result; Infrastructure = deployment/tooling; Market Impact = hardware ecosystem implications
What This Means
For AI infrastructure teams, PrfaaS represents a meaningful architectural unlock: the ability to build inference clusters using best-in-class hardware for each phase of generation without being constrained to a single physical location or shared RDMA fabric. As hybrid-attention models become more common and KV caches shrink further, the economics of cross-datacenter serving will only improve. Builders running large-scale inference on vLLM, SGLang, or Dynamo should monitor this work closely — it could reshape how prefill and decode capacity is planned, procured, and scaled.
