ComputeDomain abstraction, which handles the machinery required to present IMEX channels as allocatable container resources through DRA.
Previously, CoreWeave provisioned IMEX channels transparently through the
nvidia-imex daemonset. With the NVIDIA DRA driver, IMEX channel allocation is on-demand for Pods that request them.Provision ComputeDomains
AComputeDomain defines a logical container for a set of Nodes that are permitted to share an IMEX channel allocation. You create a ComputeDomain in your namespace, and the controller generates a corresponding ResourceClaimTemplate that workloads can reference to obtain access to a shared channel.
Each independent workload should use its own
ComputeDomain. Deploying multiple workloads into a single ComputeDomain works, but it may result in unintended memory sharing between them.ResourceClaimTemplate contains the same name and namespace.
Claim IMEX channels from a ComputeDomain
AComputeDomain follows the workload, and its Node membership depends on where Pods land. This means the validity of the resulting IMEX domain depends on scheduling. If Pods spread across Nodes that aren’t physically connected through NVLink, the workload may not function as expected. For this reason, workloads should always include affinity rules to constrain Pods to Nodes within the same rack.
To claim an IMEX channel, add a resourceClaims entry to your Pod specification that references the ResourceClaimTemplate for your rack. Each container that needs IMEX access must also declare the claim under resources.claims.
Minimal example
Replace
[TEMPLATE-NAME] with the name of the channel defined in your ComputeDomain.Multi-node example: MPIJob across a full GB200 rack
For full-rack distributed workloads, the following example schedules an MPIJob across all Nodes of a GB200 rack using DRA IMEX.This example requires the MPI Operator installed in your cluster.
slotsPerWorker: 4matches the 4 GPUs per Node on GB200 NVL72 systems.replicas: 18covers all Nodes in a single GB200 rack.- The
topologyKey: nvidia.com/gpu.cliqueaffinity ensures all worker Pods land on Nodes within the same NVLink partition, as identified by GPU Feature Discovery.
Verify resource allocation
After submitting a workload, verify thatResourceClaims are in allocated,reserved state by listing them in your namespace:
- The
ComputeDomainfor your rack is active:kubectl get computedomain -A. - The
resourceClaimTemplateNamein your Pod spec exactly matches an availableResourceClaimTemplate. - All Pods are scheduled on Nodes within the same
nvidia.com/gpu.cliquedomain.