Enabling IMEX Compute Domains with Dynamic Resource Allocation
Configuring Resource Claims for IMEX channels with Dynamic Resource Allocation (DRA)
Enabling workloads to use IMEX with Dynamic Resource Allocation (DRA)
With DRA, users can enable better utilization of resources such as IMEX domains for workloads on rack-based instances such as the NVIDIA GB200 and GB300.
IMEX with Dynamic Resource Allocation (DRA) is currently a Limited Availability feature in CKS and has the following limitations:
- Limited instance support: Currently only supported on rack-based instances such as the NVIDIA GB200 and GB300.
- Active development: IMEX and the imex-dra components are still in active development.
- Limited Kubernetes Version Support: Requires Kubernetes v1.30 or higher to support DRA for IMEX.
- Manual enablement required: This feature is not enabled by default.
If you would like to have this feature enabled for your rack-based instances, please contact your CoreWeave account manager or reach out to our sales team to learn more.
Rack-based instances
GB200/GB300 instances are comprised of individual Nodes (which function as independent instances), yet they are deployed and scheduled in multiples of 18 Nodes per Rack. The key difference from traditional GPU instances is that their performance and throughput are tightly interconnected and optimized at the Rack level, which often leads to them being logically treated as a single, large-scale compute resource by the scheduling and orchestration layers, rather than 18 discrete, isolated instances.
The image below shows the list of Kubernetes Node objects that represent all 18 instances of an NVIDIA GB200 rack.
When scheduling workloads on rack-based instances, a ComputeDomain will automatically be created when the rack joins the cluster.
apiVersion: resource.nvidia.com/v1beta1kind: ComputeDomainmetadata:creationTimestamp: "2025-10-29T18:26:02Z"finalizers:- resource.nvidia.com/computeDomaingeneration: 1name: s0-011-us-west-01anamespace: cw-nvidia-gpu-operatorresourceVersion: "107136503"uid: e6d3a22d-12a8-47f7-b55b-fdcbcf4e35d8spec:channel:allocationMode: SingleresourceClaimTemplate:name: imex-channel-s0-011-us-west-01anumNodes: 0status:status: Ready
Scheduling workloads with rack-based instances
Note that scheduling currently only supports GB200 and GB300 instances.
Once a rack has been delivered to your cluster and the ComputeDomain created, you can schedule workloads to use IMEX with DRA by creating a ResourceClaim for the IMEX resource.
A ResourceClaimTemplate will exist for the ComputeDomain created for the rack-based instance, which can be used to create ResourceClaims for workloads.
apiVersion: resource.k8s.io/v1beta1kind: ResourceClaimTemplatemetadata:creationTimestamp: "2025-10-29T18:26:02Z"finalizers:- resource.nvidia.com/computeDomainlabels:resource.nvidia.com/computeDomain: e6d3a22d-12a8-47f7-b55b-fdcbcf4e35d8resource.nvidia.com/computeDomainTarget: Workloadname: imex-channel-s0-011-us-west-01anamespace: cw-nvidia-gpu-operatorresourceVersion: "107136502"uid: e9e2590e-f362-434a-a9a8-d48a864ca672spec:metadata:creationTimestamp: nullspec:devices:config:- opaque:driver: compute-domain.nvidia.comparameters:allocationMode: SingleapiVersion: resource.nvidia.com/v1beta1domainID: e6d3a22d-12a8-47f7-b55b-fdcbcf4e35d8kind: ComputeDomainChannelConfigrequests:- channelrequests:- allocationMode: ExactCountcount: 1deviceClassName: compute-domain-default-channel.nvidia.comname: channel
With the ResourceClaimTemplate created, workloads can now create ResourceClaims to use IMEX channels from the ComputeDomain.
apiVersion: kubeflow.org/v2beta1kind: MPIJobmetadata:name: dra-example-gb200-4xspec:slotsPerWorker: 4runPolicy:cleanPodPolicy: RunningmpiReplicaSpecs:Launcher:replicas: 1template:spec:affinity:nodeAffinity:requiredDuringSchedulingIgnoredDuringExecution:nodeSelectorTerms:- matchExpressions:- key: ds.coreweave.com/nvlink.domainoperator: Invalues:- S0-011-US-WEST-01Acontainers:- image: ghcr.io/nvidia/k8s-samples:nvbandwidth-v0.7-8d103163name: mpi-launchersecurityContext:runAsUser: 1000command: ["/bin/bash", "-c"]args:- |sleep infinity;resources:requests:cpu: 2memory: 128MiWorker:replicas: 18template:metadata:labels:app: nvbandwidth-test-workerspec:containers:- image: ghcr.io/nvidia/k8s-samples:nvbandwidth-v0.7-8d103163name: ncclsecurityContext:privileged: falseresources:claims:- name: imex-channel-0requests:cpu: 110memory: 960Ginvidia.com/gpu: 4limits:memory: 960Ginvidia.com/gpu: 4volumeMounts:- mountPath: /dev/shmname: dshmaffinity:podAffinity:requiredDuringSchedulingIgnoredDuringExecution:- labelSelector:matchExpressions:- key: appoperator: Invalues:- nvbandwidth-test-workertopologyKey: nvidia.com/gpu.cliqueresourceClaims:- name: imex-channel-0resourceClaimTemplateName: imex-channel-s0-011-us-west-01avolumes:- emptyDir:medium: Memoryname: dshm
After submitting the MPIJob we can see the following resources created
$ k get computedomain,resourceclaim,resourceclaimtemplate,resourceslice,ds;NAME AGEcomputedomain.resource.nvidia.com/s0-011-us-west-01a 22hNAME STATE AGEresourceclaim.resource.k8s.io/dra-example-gb200-4x-worker-0-imex-channel-0-5vz5j allocated,reserved 4m3sresourceclaim.resource.k8s.io/dra-example-gb200-4x-worker-1-imex-channel-0-szpkc allocated,reserved 4m3sresourceclaim.resource.k8s.io/dra-example-gb200-4x-worker-10-imex-channel-0-57xxh allocated,reserved 4m2sresourceclaim.resource.k8s.io/dra-example-gb200-4x-worker-11-imex-channel-0-qrzqq allocated,reserved 4m2sresourceclaim.resource.k8s.io/dra-example-gb200-4x-worker-12-imex-channel-0-qz6ld allocated,reserved 4m2sresourceclaim.resource.k8s.io/dra-example-gb200-4x-worker-13-imex-channel-0-gnrst allocated,reserved 4m2sresourceclaim.resource.k8s.io/dra-example-gb200-4x-worker-14-imex-channel-0-r2ccv allocated,reserved 4m1sresourceclaim.resource.k8s.io/dra-example-gb200-4x-worker-15-imex-channel-0-sffsc allocated,reserved 4m1sresourceclaim.resource.k8s.io/dra-example-gb200-4x-worker-16-imex-channel-0-l4l2w allocated,reserved 4m1sresourceclaim.resource.k8s.io/dra-example-gb200-4x-worker-17-imex-channel-0-5445b allocated,reserved 4m1sresourceclaim.resource.k8s.io/dra-example-gb200-4x-worker-2-imex-channel-0-zdwrt allocated,reserved 4m3sresourceclaim.resource.k8s.io/dra-example-gb200-4x-worker-3-imex-channel-0-g62bj allocated,reserved 4m3sresourceclaim.resource.k8s.io/dra-example-gb200-4x-worker-4-imex-channel-0-nm2r8 allocated,reserved 4m3sresourceclaim.resource.k8s.io/dra-example-gb200-4x-worker-5-imex-channel-0-sw48w allocated,reserved 4m3sresourceclaim.resource.k8s.io/dra-example-gb200-4x-worker-6-imex-channel-0-7xq25 allocated,reserved 4m3sresourceclaim.resource.k8s.io/dra-example-gb200-4x-worker-7-imex-channel-0-n8kxc allocated,reserved 4m3sresourceclaim.resource.k8s.io/dra-example-gb200-4x-worker-8-imex-channel-0-snk6s allocated,reserved 4m3sresourceclaim.resource.k8s.io/dra-example-gb200-4x-worker-9-imex-channel-0-68vvh allocated,reserved 4m2sresourceclaim.resource.k8s.io/s0-011-us-west-01a-ws9p4-2mhfk-compute-domain-daemon-hv8p9 allocated,reserved 4m2sresourceclaim.resource.k8s.io/s0-011-us-west-01a-ws9p4-46bj5-compute-domain-daemon-2plxp allocated,reserved 4m2sresourceclaim.resource.k8s.io/s0-011-us-west-01a-ws9p4-6cxtq-compute-domain-daemon-9z6tz allocated,reserved 4m2sresourceclaim.resource.k8s.io/s0-011-us-west-01a-ws9p4-7dj8z-compute-domain-daemon-gqznf allocated,reserved 4m1sresourceclaim.resource.k8s.io/s0-011-us-west-01a-ws9p4-d46g2-compute-domain-daemon-s6lxj allocated,reserved 4m2sresourceclaim.resource.k8s.io/s0-011-us-west-01a-ws9p4-dclxf-compute-domain-daemon-f8vwt allocated,reserved 4m2sresourceclaim.resource.k8s.io/s0-011-us-west-01a-ws9p4-dh4vl-compute-domain-daemon-mqzgz allocated,reserved 4m1sresourceclaim.resource.k8s.io/s0-011-us-west-01a-ws9p4-dnfb7-compute-domain-daemon-xfp5s allocated,reserved 4m2sresourceclaim.resource.k8s.io/s0-011-us-west-01a-ws9p4-f2c64-compute-domain-daemon-ckbts allocated,reserved 4m1sresourceclaim.resource.k8s.io/s0-011-us-west-01a-ws9p4-jnbct-compute-domain-daemon-zw9jg allocated,reserved 4m1sresourceclaim.resource.k8s.io/s0-011-us-west-01a-ws9p4-l2lzs-compute-domain-daemon-dlj7s allocated,reserved 74sresourceclaim.resource.k8s.io/s0-011-us-west-01a-ws9p4-l5tr8-compute-domain-daemon-nkth6 allocated,reserved 4mresourceclaim.resource.k8s.io/s0-011-us-west-01a-ws9p4-nmg8q-compute-domain-daemon-cslk6 allocated,reserved 4m2sresourceclaim.resource.k8s.io/s0-011-us-west-01a-ws9p4-nvz88-compute-domain-daemon-7dxlp allocated,reserved 4m2sresourceclaim.resource.k8s.io/s0-011-us-west-01a-ws9p4-tpnkg-compute-domain-daemon-bzkpn allocated,reserved 4mresourceclaim.resource.k8s.io/s0-011-us-west-01a-ws9p4-wg82q-compute-domain-daemon-m7xz2 allocated,reserved 4m2sresourceclaim.resource.k8s.io/s0-011-us-west-01a-ws9p4-zc2wf-compute-domain-daemon-r4t9t allocated,reserved 4mresourceclaim.resource.k8s.io/s0-011-us-west-01a-ws9p4-zsg4b-compute-domain-daemon-wmw9x allocated,reserved 4m1sNAME AGEresourceclaimtemplate.resource.k8s.io/imex-channel-s0-011-us-west-01a 22hresourceclaimtemplate.resource.k8s.io/s0-011-us-west-01a-daemon-claim-template-nbscw 22hNAME NODE DRIVER POOL AGEresourceslice.resource.k8s.io/s1vqxs64-compute-domain.nvidia.com-n42kz s1vqxs64 compute-domain.nvidia.com s1vqxs64 10mresourceslice.resource.k8s.io/s2cqxs64-compute-domain.nvidia.com-2vpk9 s2cqxs64 compute-domain.nvidia.com s2cqxs64 10mresourceslice.resource.k8s.io/s3bqxs64-compute-domain.nvidia.com-6zpgz s3bqxs64 compute-domain.nvidia.com s3bqxs64 10mresourceslice.resource.k8s.io/s58qxs64-compute-domain.nvidia.com-ggk54 s58qxs64 compute-domain.nvidia.com s58qxs64 10mresourceslice.resource.k8s.io/s5fqxs64-compute-domain.nvidia.com-h7bct s5fqxs64 compute-domain.nvidia.com s5fqxs64 10mresourceslice.resource.k8s.io/s67qxs64-compute-domain.nvidia.com-m7c7k s67qxs64 compute-domain.nvidia.com s67qxs64 10mresourceslice.resource.k8s.io/s70qxs64-compute-domain.nvidia.com-6mqlg s70qxs64 compute-domain.nvidia.com s70qxs64 10mresourceslice.resource.k8s.io/s7dqxs64-compute-domain.nvidia.com-6m6f5 s7dqxs64 compute-domain.nvidia.com s7dqxs64 10mresourceslice.resource.k8s.io/s7vqxs64-compute-domain.nvidia.com-k6hd6 s7vqxs64 compute-domain.nvidia.com s7vqxs64 10mresourceslice.resource.k8s.io/s9lqxs64-compute-domain.nvidia.com-hd4gk s9lqxs64 compute-domain.nvidia.com s9lqxs64 10mresourceslice.resource.k8s.io/s9sqxs64-compute-domain.nvidia.com-7bt68 s9sqxs64 compute-domain.nvidia.com s9sqxs64 10mresourceslice.resource.k8s.io/sbdqxs64-compute-domain.nvidia.com-shkz9 sbdqxs64 compute-domain.nvidia.com sbdqxs64 10mresourceslice.resource.k8s.io/sf9qxs64-compute-domain.nvidia.com-tscd4 sf9qxs64 compute-domain.nvidia.com sf9qxs64 10mresourceslice.resource.k8s.io/sg9qxs64-compute-domain.nvidia.com-zg89f sg9qxs64 compute-domain.nvidia.com sg9qxs64 10mresourceslice.resource.k8s.io/sgcqxs64-compute-domain.nvidia.com-xf4hx sgcqxs64 compute-domain.nvidia.com sgcqxs64 10mresourceslice.resource.k8s.io/sh0qxs64-compute-domain.nvidia.com-xt6ll sh0qxs64 compute-domain.nvidia.com sh0qxs64 75sresourceslice.resource.k8s.io/sh7qxs64-compute-domain.nvidia.com-ckfns sh7qxs64 compute-domain.nvidia.com sh7qxs64 10mresourceslice.resource.k8s.io/shrqxs64-compute-domain.nvidia.com-h4nkt shrqxs64 compute-domain.nvidia.com shrqxs64 10m
Next steps
If you are interested in leveraging DRA for IMEX, please contact your CoreWeave account manager or reach out to our sales team to see about having your CKS cluster enabled with this feature.