llm-d: Kubernetes Framework for Scalable LLM Inference Donated to CNCF
Kubernetes Gains Ground in AI Inference with llm-d Donation to CNCF
IBM Research, Red Hat, and Google Cloud officially donated llm-d, an open-source distributed inference framework, to the Cloud Native Computing Foundation (CNCF) as a sandbox project on Tuesday at KubeCon Europe 2026 in Amsterdam. The move aims to address the challenges of serving large language models (LLMs) at scale within Kubernetes environments.
llm-d was initially developed to overcome limitations in traditional routing and autoscaling when applied to the demands of LLM inference. The framework provides a Kubernetes-native approach to distributed inference, enabling more efficient and scalable deployment of foundation models, according to project contributors. The donation is supported by founding collaborators NVIDIA and CoreWeave, as well as AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI.
“llm-d bridges the gap between traditional distributed systems and the emerging AI inference stack, making large-scale model serving a first-class, cloud-native workload,” said Carlos Costa, a Distinguished Engineer at IBM Research, during his keynote at KubeCon. The project’s goal is to provide a replicable blueprint for deploying inference stacks for any model, on any accelerator, and in any cloud environment.
The core functionality of llm-d centers around transforming LLM serving into a distributed system. It achieves this by disaggregating inference into prefill and decode phases, allowing each to be run on separate pods and scaled independently. Llm-d introduces an LLM-aware routing and scheduling layer that utilizes KV-cache state, pod load, and hardware characteristics to optimize latency and throughput. The framework leverages vLLM as an inference gateway and provides a modular stack built on Kubernetes.
According to Brian Stevens, SVP and AI CTO at Red Hat, the project is designed to accommodate a wide range of hardware accelerators. “We do a lot of work bringing in new accelerators – TPUs, AMD, Nvidia, and a long tail of other accelerators. We really want to spot them have ways of getting in,” Stevens stated. “So that way, just like Linux, you can run any hardware, any application, any model, any accelerator.”
Early testing conducted by Google Cloud demonstrated a two-fold improvement in time-to-first-token for apply cases like code completion when using llm-d. This improvement is attributed to the framework’s ability to address the specific requirements of stateful inference workloads, including efficient KV cache management and orchestration of prefill/decode phases across heterogeneous accelerators.
llm-d incorporates prefix-cache-aware routing and prefill/decode disaggregation, enabling independent scaling of inference phases. It also supports hierarchical cache offloading across GPU, CPU, and storage tiers, allowing for larger context windows without overwhelming accelerator memory. The framework’s traffic- and hardware-aware autoscaler dynamically adapts to workload patterns, moving beyond basic utilization metrics.
Priya Nagpurkar, VP of AI Platform at IBM Research, emphasized the necessitate for operational maturity in LLM inference, stating, “You need the scale, distribution, and reliability of what Kubernetes provided for the previous era, even as recognizing that this is a incredibly different workload.”
Looking ahead, development efforts will focus on expanding llm-d’s capabilities to support multi-modal workloads, multi-LoRA optimization within the Hugging Face ecosystem, and deeper integration with vLLM. Mistral AI is already contributing code to advance open standards around disaggregated serving.
IBM Research plans to continue exploring the intersection of inference and training, including reinforcement learning and self-optimizing AI infrastructure. As Costa noted, “Creating a common foundation stack lets the ecosystem focus on pushing AI forward instead of rebuilding the basics.”
