> ## Documentation Index
> Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
> Use this file to discover all available pages before exploring further.

# IMEX with Dynamic Resource Allocation

> Schedule workloads using IMEX channels via Kubernetes Dynamic Resource Allocation (DRA) on rack-based instances

This page explains how to schedule multi-node GPU workloads that require IMEX channels on CoreWeave NVL72-powered instances by using Kubernetes Dynamic Resource Allocation (DRA). It's intended for cluster operators and workload authors who need cross-Node GPU memory access within an NVLink Domain.

For background on what IMEX is and how it relates to NVLink domains on CoreWeave, start with [IMEX overview](/products/cks/clusters/scheduling/imex-overview).

[IMEX (Internode Memory Exchange)](https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/overview.html) is NVIDIA's mechanism that enables direct GPU-to-GPU memory access across Nodes within the same NVLink Domain. On [NVL72-powered instances](/platform/instances/nvl72), IMEX is required for multi-node workloads that rely on high-bandwidth communication across all Nodes in a rack.

<Warning>
  **Limited availability**

  DRA IMEX is in limited availability on [NVL72-powered instances](/platform/instances/nvl72) and is under active development.
  To enable this feature for your cluster, contact your CoreWeave account manager or [reach out to our sales team](https://www.coreweave.com/contact-us).
</Warning>

CoreWeave uses the [NVIDIA DRA Driver](https://github.com/kubernetes-sigs/dra-driver-nvidia-gpu) as its mechanism for assigning IMEX channels to workloads. It provides the `ComputeDomain` abstraction, which handles the machinery required to present IMEX channels as allocatable container resources through DRA.

<Info>
  Previously, CoreWeave provisioned IMEX channels transparently through the `nvidia-imex` daemonset. With the NVIDIA DRA driver, IMEX channel allocation is on-demand for Pods that request them.
</Info>

## Provision ComputeDomains

A `ComputeDomain` defines a logical container for a set of Nodes that are permitted to share an IMEX channel allocation. You create a `ComputeDomain` in your namespace, and the controller generates a corresponding `ResourceClaimTemplate` that workloads can reference to obtain access to a shared channel.

<Note>
  Each independent workload should use its own `ComputeDomain`. Deploying multiple workloads into a single `ComputeDomain` works, but it may result in unintended memory sharing between them.
</Note>

```yaml theme={"system"}
apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomain
metadata:
  name: [YOUR-COMPUTE-DOMAIN-NAME]
  namespace: [YOUR-NAMESPACE]
spec:
  channel:
    resourceClaimTemplate:
      name: imex-channel-0
```

The `ResourceClaimTemplate` contains the same name and namespace.

```yaml theme={"system"}
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
  name: imex-channel-0
  namespace: [YOUR-NAMESPACE]
  # ...
```

## Claim IMEX channels from a ComputeDomain

A `ComputeDomain` follows the workload, and its Node membership depends on where Pods land. This means the validity of the resulting IMEX domain depends on scheduling. If Pods spread across Nodes that aren't physically connected through NVLink, the workload may not function as expected. For this reason, workloads should always include affinity rules to constrain Pods to Nodes within the same rack.

To claim an IMEX channel, add a `resourceClaims` entry to your Pod specification that references the `ResourceClaimTemplate` for your rack. Each container that needs IMEX access must also declare the claim under `resources.claims`.

### Minimal example

<Note>
  Replace `[TEMPLATE-NAME]` with the name of the channel defined in your `ComputeDomain`.
</Note>

```yaml theme={"system"}
apiVersion: v1
kind: Pod
metadata:
  name: imex-workload
  labels:
    app: imex-workload
spec:
  resourceClaims:
  - name: imex-channel
    resourceClaimTemplateName: [TEMPLATE-NAME]
  containers:
  - name: workload
    image: [YOUR-IMAGE]
    resources:
      claims:
      - name: imex-channel
      requests:
        nvidia.com/gpu: 4
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - imex-workload
        topologyKey: nvidia.com/gpu.clique
```

### Multi-node example: MPIJob across a full GB200 rack

For full-rack distributed workloads, the following example schedules an MPIJob across all Nodes of a GB200 rack using DRA IMEX.

<Note>
  This example requires the [MPI Operator](https://github.com/kubeflow/mpi-operator) installed in your cluster.
</Note>

* `slotsPerWorker: 4` matches the 4 GPUs per Node on GB200 NVL72 systems.
* `replicas: 18` covers all Nodes in a single GB200 rack.
* The `topologyKey: nvidia.com/gpu.clique` affinity ensures all worker Pods land on Nodes within the same NVLink partition, as identified by GPU Feature Discovery.

```yaml theme={"system"}
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: dra-example-gb200-4x
spec:
  slotsPerWorker: 4
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
            - image: ghcr.io/nvidia/k8s-samples:nvbandwidth-v0.7-8d103163
              name: mpi-launcher
              securityContext:
                runAsUser: 1000
              command: ["/bin/bash", "-c"]
              args:
                - sleep infinity
              resources:
                requests:
                  cpu: 2
                  memory: 128Mi
    Worker:
      replicas: 18
      template:
        metadata:
          labels:
            app: nvbandwidth-test-worker
        spec:
          resourceClaims:
          - name: imex-channel
            resourceClaimTemplateName: [IMEX-CHANNEL-TEMPLATE-NAME]
          containers:
            - image: ghcr.io/nvidia/k8s-samples:nvbandwidth-v0.7-8d103163
              name: nccl
              securityContext:
                privileged: false
              resources:
                claims:
                - name: imex-channel
                requests:
                  cpu: 110
                  memory: 960Gi
                  nvidia.com/gpu: 4
                limits:
                  memory: 960Gi
                  nvidia.com/gpu: 4
              volumeMounts:
                - mountPath: /dev/shm
                  name: dshm
          affinity:
            podAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
              - labelSelector:
                  matchExpressions:
                  - key: app
                    operator: In
                    values:
                    - nvbandwidth-test-worker
                topologyKey: nvidia.com/gpu.clique
          volumes:
            - emptyDir:
                medium: Memory
              name: dshm
```

### Verify resource allocation

After submitting a workload, verify that `ResourceClaims` are in `allocated,reserved` state by listing them in your namespace:

```bash theme={"system"}
kubectl get resourceclaim -n [YOUR-NAMESPACE]
```

Expected output:

```text theme={"system"}
NAME                                         STATE                AGE
dra-example-gb200-4x-worker-0-imex-...      allocated,reserved   30s
dra-example-gb200-4x-worker-1-imex-...      allocated,reserved   30s
```

If claims remain unallocated, verify that:

* The `ComputeDomain` for your rack is active: `kubectl get computedomain -A`.
* The `resourceClaimTemplateName` in your Pod spec exactly matches an available `ResourceClaimTemplate`.
* All Pods are scheduled on Nodes within the same `nvidia.com/gpu.clique` domain.
