Abdelrahman Hosny
on 24 March 2026

Canonical welcomes NVIDIA’s donation of the GPU DRA driver to CNCF

Share on:

At KubeCon Europe in Amsterdam, NVIDIA announced that it will donate the GPU Dynamic Resource Allocation (DRA) Driver to the Cloud Native Computing Foundation (CNCF). This marks an important milestone for the Kubernetes ecosystem and for the future of AI infrastructure.

For years, GPUs have been central to modern machine learning and high-performance computing workloads, yet integrating them into Kubernetes has required specialized tooling and vendor-specific components. The donation of the DRA driver represents a shift toward deeper standardization of GPU orchestration in cloud-native environments. By bringing this technology into the CNCF ecosystem, NVIDIA is helping ensure that advanced GPU scheduling capabilities evolve in the open, alongside the broader Kubernetes community.

This contribution strengthens Kubernetes as the platform for large-scale AI workloads and provides a foundation for more flexible, programmable GPU resource management. To understand why this matters, it helps to look at the broader NVIDIA GPU ecosystem that powers AI workloads on Kubernetes.

The NVIDIA GPU ecosystem for Kubernetes

As of 2026, the NVIDIA GPU stack in Kubernetes is organized into three major layers: the GPU Operator, the Modern Resource Stack built around DRA, and advanced orchestration capabilities such as the Kubernetes AI (KAI) Scheduler. Together, these components transform GPUs from simple hardware accelerators into fully orchestrated infrastructure resources.

The GPU operator: automating GPU infrastructure

The NVIDIA GPU Operator automates the lifecycle management of the software required for GPUs to function inside a Kubernetes cluster. Instead of requiring administrators to manually configure drivers, runtimes, and monitoring tools, the operator deploys and manages these components automatically. This provides a consistent, production-ready environment for GPU workloads.

Typical components deployed by the operator include:

NVIDIA Driver: The kernel modules and userspace libraries required for GPU operation are installed through a containerized driver manager.
NVIDIA Container Toolkit: This component integrates GPUs with container runtimes such as containerd or CRI-O, allowing containers to access GPU hardware and CUDA libraries on the node.
GPU Access Layer: Clusters traditionally used the NVIDIA device plugin to request GPUs using simple integer values. With the introduction of the DRA driver, clusters can adopt the new Kubernetes-native resource model instead. The GPU driver will install and manage the DRA driver for GPUs in an upcoming release. The use of the device plugin and DRA driver in the same cluster is and will remain mutually exclusive.
DCGM Exporter: Exports telemetry such as power usage, temperature, and utilization metrics to Prometheus for monitoring.
GPU Feature Discovery (GFD): automatically labels Kubernetes nodes with GPU capabilities, such as memory size or CUDA support.
NVIDIA MIG Manager: allows modern GPUs such as NVIDIA H100, NVIDIA H200, and NVIDIA Blackwell to be partitioned into multiple logical GPU instances using Multi-Instance GPU (MIG) technology.

The GPU Operator therefore acts as the operational backbone of GPU infrastructure in Kubernetes clusters.

The DRA driver: a modern resource model for GPUs

The DRA driver represents the next generation of GPU resource management for Kubernetes. Historically, Kubernetes treated GPUs as simple integer resources. A workload would request something like nvidia.com/gpu:1. While effective, this model lacked the expressiveness needed for modern AI workloads.

DRA introduces a richer model based on ResourceClaims, enabling applications to request very specific hardware capabilities rather than just a count of GPUs.

Examples include:

Requesting GPUs connected through NVIDIA NVLink
Requesting a specific GPU slice
Allocating GPUs across nodes that share memory domains

This level of control becomes essential for modern training workloads, which often rely on tightly coupled GPU communication.

DRA also introduces several important capabilities:

ComputeDomains: This abstraction enables multi-node NVIDIA NVLink communication. Systems (such as GB200) can allow workloads across multiple nodes to behave as if they are running on a single massive GPU.
Container Device Interface (CDI): Instead of relying on environment variables such as NVIDIA_VISIBLE_DEVICES, CDI injects devices into containers through a standardized interface, improving reliability and portability.

With the DRA driver moving to the CNCF, these capabilities become part of a broader open ecosystem for accelerator orchestration.

The KAI scheduler: AI-aware scheduling

Running AI workloads efficiently requires more than just allocating GPUs. It requires scheduling decisions that understand how AI jobs behave. The KAI Scheduler adds a layer of intelligence on top of Kubernetes scheduling. It builds on top of the GPU Operator and the DRA driver to enable more advanced resource coordination.

Key capabilities include:

Fractional GPU allocation: Multiple workloads can share a GPU using memory partitioning or time slicing.
Hierarchical queuing: Teams can be assigned GPU quotas, and the scheduler manages fairness and prioritization within those quotas.
Gang scheduling for distributed training: Large training jobs often require dozens or hundreds of GPUs simultaneously. KAI ensures these jobs start only when the required resources are available, preventing partially allocated clusters that sit idle.

These capabilities are critical for organizations running large-scale training pipelines or shared AI platforms.

Why the CNCF donation matters

The donation of the DRA driver to the CNCF represents a significant step toward making advanced GPU orchestration a first-class citizen of the Kubernetes ecosystem. It accelerates the adoption of Kubernetes-native resource models for GPUs, encourages community-driven innovation, and strengthens the foundation for large-scale AI workloads. As AI infrastructure becomes increasingly central to modern platforms, open collaboration around core technologies like GPU scheduling and resource allocation will play a key role in shaping the next generation of cloud-native systems.

Canonical Kubernetes: a platform for cloud-native AI infrastructure

Running modern AI workloads requires more than GPUs and schedulers. It requires a Kubernetes platform that is secure, easy to operate, and capable of supporting large-scale, hardware-accelerated workloads.

Canonical provides a Kubernetes distribution designed to deliver exactly that. Canonical Kubernetes is a lightweight, secure, and opinionated Kubernetes distribution that includes all the components required to deploy and operate a production-ready cluster. It bundles the essential services needed for Kubernetes clusters, including the container runtime, networking (CNI), DNS, ingress, and other operational components, so that teams can deploy and manage clusters with minimal operational overhead.

By building directly on upstream Kubernetes, Canonical Kubernetes maintains compatibility with the broader cloud-native ecosystem while simplifying lifecycle management. Security updates and upstream Kubernetes releases are delivered in a streamlined way, allowing teams to stay current without the operational complexity typically associated with cluster maintenance. Canonical Kubernetes is designed to support deployments across a wide range of environments; from small clusters used for experimentation to large enterprise deployments operating across multiple regions. The platform integrates naturally with Canonical’s broader open infrastructure stack and benefits from the reliability and security of Ubuntu.

For organizations running AI workloads, this provides a stable foundation on which the NVIDIA GPU ecosystem can operate. Components such as the GPU Operator, the DRA driver, and advanced schedulers can be deployed on top of Canonical Kubernetes to enable GPU-accelerated machine learning pipelines, distributed training clusters, and scalable inference platforms.

Together, Canonical Kubernetes and the evolving NVIDIA AI infrastructure ecosystem provide the building blocks needed to run modern AI infrastructure using open, cloud-native technologies.

Quick links

Quick links

Quick links

Quick links

Quick links

Quick links

Quick links

Quick links

Quick links

Categories

Industries

Partner programs

Quick links

Roles by department

Working here

Explore Canonical

Latest updates

Company highlights ›

Canonical welcomes NVIDIA’s donation of the GPU DRA driver to CNCF

The NVIDIA GPU ecosystem for Kubernetes

The GPU operator: automating GPU infrastructure

The DRA driver: a modern resource model for GPUs

The KAI scheduler: AI-aware scheduling

Why the CNCF donation matters

Canonical Kubernetes: a platform for cloud-native AI infrastructure

Further reading

Related posts

Run NVIDIA Nemotron 3 Nano Omni locally in a single command

Developing web apps with local LLM inference

Three weeks to go: A sneak peek of the Ubuntu Summit 26.04 experience