Moonshot AI Proposes Cross-Datacenter LLM Serving via Prefill-as-a-Service

Research1 source·Apr 20

moonshot nvidia groq inference inference-compute

Summary

• Moonshot AI's PrfaaS enables LLM prefill and decode to run on separate hardware clusters across datacenters via commodity Ethernet
• Hybrid-attention models with smaller KV caches make cross-datacenter KV transfer bandwidth-feasible for the first time
• 1T-parameter case study shows 54% throughput gain over homogeneous baseline, 32% over naive heterogeneous
• Compatible with vLLM, SGLang, and Dynamo; enables NVIDIA Rubin CPX and Groq LPU to work together across sites

Adjust signal

Details

#	Type	Key Point	Context
1	New Tech	PrfaaS decouples prefill and decode across datacenters using commodity Ethernet	Traditional PD disaggregation requires both stages to share a high-bandwidth RDMA fabric, constraining them to a single datacenter. PrfaaS removes this by selectively offloading only long-context prefill requests to remote compute clusters, keeping shorter requests local to avoid unnecessary bandwidth consumption.
2	Research	Hybrid-attention architectures are the key enabler — smaller KV caches make cross-datacenter transfer feasible	Standard full-attention models produce large KV caches that would overwhelm typical inter-datacenter links. Newer hybrid-attention designs substantially reduce cache size, shifting the cost-benefit calculation for cross-cluster transport. PrfaaS is explicitly designed to exploit this architectural trend.
3	Tech Info	Four core innovations: KV efficiency, selective offloading, bandwidth-aware scheduling, cache-aware placement	Bandwidth-aware scheduling dynamically decides whether to route requests to the remote prefill cluster based on current link utilization. Cache-aware placement routes requests to minimize redundant KV transfers when a cache already exists nearby. Together these minimize cross-datacenter traffic while maximizing hardware utilization.
4	Stat	54% throughput improvement over homogeneous baseline; 32% over naive heterogeneous — 1T-param case study	The benchmark used an internal Moonshot AI 1-trillion-parameter hybrid-attention model. The homogeneous baseline is a standard single-cluster PD setup. Cross-datacenter bandwidth usage remained modest in all tested scenarios.
5	Infrastructure	Compatible with vLLM, SGLang, and Dynamo inference frameworks out of the box	Framework compatibility means PrfaaS can be adopted without rewriting existing serving stacks, significantly lowering the integration barrier for teams already running production inference on any of these three platforms.
6	Market Impact	Enables NVIDIA Rubin CPX (prefill) and Groq LPU (decode) to operate as a unified system across sites	Rubin CPX targets high-throughput compute for prefill; Groq's LPU emphasizes memory bandwidth for decode. Today these cannot easily be paired in production because they sit in different clusters without shared RDMA. PrfaaS makes heterogeneous hardware pairings across datacenters economically and technically viable.

1.New Tech

PrfaaS decouples prefill and decode across datacenters using commodity Ethernet

Traditional PD disaggregation requires both stages to share a high-bandwidth RDMA fabric, constraining them to a single datacenter. PrfaaS removes this by selectively offloading only long-context prefill requests to remote compute clusters, keeping shorter requests local to avoid unnecessary bandwidth consumption.

2.Research

Hybrid-attention architectures are the key enabler — smaller KV caches make cross-datacenter transfer feasible

Standard full-attention models produce large KV caches that would overwhelm typical inter-datacenter links. Newer hybrid-attention designs substantially reduce cache size, shifting the cost-benefit calculation for cross-cluster transport. PrfaaS is explicitly designed to exploit this architectural trend.

3.Tech Info

Four core innovations: KV efficiency, selective offloading, bandwidth-aware scheduling, cache-aware placement

Bandwidth-aware scheduling dynamically decides whether to route requests to the remote prefill cluster based on current link utilization. Cache-aware placement routes requests to minimize redundant KV transfers when a cache already exists nearby. Together these minimize cross-datacenter traffic while maximizing hardware utilization.

4.Stat

54% throughput improvement over homogeneous baseline; 32% over naive heterogeneous — 1T-param case study

The benchmark used an internal Moonshot AI 1-trillion-parameter hybrid-attention model. The homogeneous baseline is a standard single-cluster PD setup. Cross-datacenter bandwidth usage remained modest in all tested scenarios.

5.Infrastructure

Compatible with vLLM, SGLang, and Dynamo inference frameworks out of the box

Framework compatibility means PrfaaS can be adopted without rewriting existing serving stacks, significantly lowering the integration barrier for teams already running production inference on any of these three platforms.

6.Market Impact

Enables NVIDIA Rubin CPX (prefill) and Groq LPU (decode) to operate as a unified system across sites

Rubin CPX targets high-throughput compute for prefill; Groq's LPU emphasizes memory bandwidth for decode. Today these cannot easily be paired in production because they sit in different clusters without shared RDMA. PrfaaS makes heterogeneous hardware pairings across datacenters economically and technically viable.

New Tech = novel system capability; Research = key enabling insight from paper; Tech Info = architectural detail; Stat = quantitative result; Infrastructure = deployment/tooling; Market Impact = hardware ecosystem implications

What This Means

For AI infrastructure teams, PrfaaS represents a meaningful architectural unlock: the ability to build inference clusters using best-in-class hardware for each phase of generation without being constrained to a single physical location or shared RDMA fabric. As hybrid-attention models become more common and KV caches shrink further, the economics of cross-datacenter serving will only improve. Builders running large-scale inference on vLLM, SGLang, or Dynamo should monitor this work closely — it could reshape how prefill and decode capacity is planned, procured, and scaled.

Sources

Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-DatacenterArxiv

Similar Events

New LLM Architectures Target Long-Context Efficiency: Gemma 4, DeepSeek V4, and More

May 18

AWS + llm-d Bring Disaggregated LLM Inference to SageMaker and EKS

Mar 16