AWS + llm-d Bring Disaggregated LLM Inference to SageMaker and EKS
Summary
- • AWS and llm-d jointly launch disaggregated inference for large-scale LLM serving
- • Prefill and decode phases now split across specialized hardware for better GPU use
- • New AWS container integrates EFA networking and NIXL for multi-node inference
- • Works on Amazon SageMaker HyperPod and Amazon EKS after extensive benchmarking
Details
AWS and llm-d community announce joint disaggregated inference effort
Several months of collaboration produced a new AWS-specific container (ghcr.io/llm-d/llm-d-aws) that bundles AWS networking libraries (EFA, libfabric) and integrates the NIXL library for multi-node inference. This is a formal joint engineering effort, not just a compatibility port.
LLM inference phases — prefill and decode — have fundamentally different hardware needs
Prefill is compute-bound and processes the entire input prompt in parallel to populate the KV cache. Decode is memory-bound and generates tokens one at a time, requiring high memory bandwidth. Mixing them on the same GPU cluster leads to underutilization of both compute and memory resources.
llm-d separates prefill and decode onto specialized hardware nodes
Built on top of vLLM, llm-d extends it with disaggregated serving architecture, intelligent KV-cache-aware request scheduling, expert parallelism for mixture-of-experts models, and multi-node distribution via high-bandwidth interconnects. This allows each phase to run on hardware optimized for its workload profile.
AWS-specific container adds EFA and libfabric for high-throughput multi-node communication
Elastic Fabric Adapter (EFA) is AWS's high-bandwidth, low-latency networking fabric for GPU clusters. Including EFA and libfabric directly in the llm-d container enables the cross-node KV cache transfers that disaggregated inference requires, without custom integration work by operators.
Disaggregated inference validated on SageMaker HyperPod and Amazon EKS
Extensive benchmarking was completed before release to ensure stability on both managed ML infrastructure (HyperPod) and general Kubernetes clusters (EKS), giving enterprise teams a supported path to production deployment.
Agentic and reasoning workloads generate 10x more tokens than single-shot LLM calls
Complex reasoning chains and agentic workflows create highly variable, bursty inference demands that overwhelm traditional single-node inference setups. This makes efficient disaggregated serving increasingly important as enterprises move from prototypes to production AI systems.
Expert parallelism support extends disaggregation benefits to MoE model architectures
Mixture-of-experts models route tokens through sparse subsets of parameters. llm-d distributes expert computation across nodes, improving throughput for these increasingly popular architectures rather than forcing all experts onto a single node.
Partnership = joint effort, Tech Info = technical background, New Tech = new capability, Infrastructure = compute/networking, Product Launch = released product, Context = background framing, Insight = analytical observation
What This Means
As LLM workloads shift from simple completions to complex agentic and reasoning tasks, the GPU inefficiency of treating prefill and decode as a single operation becomes a real cost and performance bottleneck. Disaggregated inference — splitting these phases across specialized hardware — is an emerging production pattern, and this AWS-llm-d collaboration gives enterprise teams a validated, Kubernetes-native path to implement it on AWS infrastructure. For organizations running large-scale inference on SageMaker HyperPod or EKS, this means better GPU utilization, lower serving costs, and the ability to handle the bursty demand patterns that agentic workflows create. It also signals that cloud providers are actively investing in open-source inference infrastructure rather than building proprietary-only solutions.
