Skip to main content

Select GPU driver versions in CKS Node Pools

Update the GPU driver for your Node Pools

Preview feature

GPU driver management is currently a preview feature, and has the following limitations:

  • No Cloud Console UI: Configuration must be done through Kubernetes manifests.
  • No automatic minor or patch version upgrades: Only major version selection is currently supported.
  • Release channels are not yet implemented: The latest and stable release channels are not yet implemented in Node Pools.

Create a new Node Pool with a specific driver version

Driver versions are configured in the Node Pool manifest. To select a driver version, add the gpu section to your Node Pool manifest's spec section, specifying the desired major version without dots.

For example, for an H100 Node Pool, you would specify the driver version as 570.172.08-0ubuntu1:

Example
apiVersion: compute.coreweave.com/v1alpha1
kind: NodePool
metadata:
name: example-nodepool
spec:
computeClass: default
instanceType: gd-8xh100ib-i128
targetNodes: 1
gpu:
version: "570" # Specify driver version by major version

If no driver is specified, the Node Pool automatically uses the latest available driver.

Update the driver version on an existing Node Pool

If a driver is currently specified on an existing Node Pool, you can update it to a new major version by modifying the existing Node Pool manifest.

Example
# Original Node Pool
apiVersion: compute.coreweave.com/v1alpha1
kind: NodePool
metadata:
name: test-nodepool
spec:
computeClass: default
instanceType: gd-8xh100-i128
targetNodes: 1
gpu:
version: "570"

The Node Pool manifest would be updated to:

Example
# Updated Node Pool with new driver version
apiVersion: compute.coreweave.com/v1alpha1
kind: NodePool
metadata:
name: test-nodepool
spec:
computeClass: default
instanceType: gd-8xh100-i128
targetNodes: 1
gpu:
version: "580"

Target driver versions using Node labels and selectors

Driver version information is exposed on Nodes through Kubernetes labels. You can use these labels to get information on current driver versions and to target specific driver versions in your workloads.

Example
# Check the current driver version label on nodes
$
kubectl get nodes --show-labels | grep driver-version

Node labels are in the format gpu.coreweave.cloud/driver-version=<major>.<minor>.<patch>, where the value (570, in this example) represents the full driver version. For example, a Node with the label gpu.coreweave.cloud/driver-version=570 is running driver version 570.

Note

The gpu.coreweave.cloud/driver-version label is always applied to Nodes, even if no driver version is specified in the Node Pool manifest.

Target specific driver versions in workloads

The gpu.coreweave.cloud/driver-version label allows you to target Nodes with exact driver version matches.

Info

For detailed information about scheduling workloads on Nodes with specific driver versions, see Scheduling Workloads. It is strongly recommended to avoid scheduling across multiple driver versions in a single Node Pool.

Scheduling workloads on Nodes with specific driver versions

For workloads that require a specific driver version, use an exact match with the nodeSelector field:

Example
apiVersion: v1
kind: Pod
metadata:
name: gpu-workload
spec:
containers:
- name: gpu-container
image: nvidia/cuda:11.8-base
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
gpu.coreweave.cloud/driver-version: "570.172.08-0ubuntu1"

Troubleshooting scheduling issues

If Pods fail to schedule due to driver version constraints, check the available driver versions in your cluster:

Example
# Check available driver versions
$
kubectl get nodes --show-labels | grep driver-version
# Check Pod events for scheduling failures
$
kubectl describe pod <pod-name> | grep -A 10 Events:

Common scheduling issues may include:

  • No Nodes available with the exact driver version specified
  • Nodes with the required driver version are unavailable due to resource constraints
  • Driver version constraints conflict with other scheduling requirements

Troubleshooting

Common error conditions

If you encounter issues with driver configuration, check the Node Pool status for error conditions:

Example
Status:
Conditions:
Last Transition Time: 2025-06-30T19:25:16Z
Message: unable to create configuration for NodePool
Reason: InternalError
Status: False
Type: Validated
Node Pool errors

For more information about Node Pool events and possible error conditions, see Node Pool events.

Verify the driver version

To verify your Node Pool configuration and driver status, you can:

Describe the Node Pool:

Example
# Check Node Pool status
$
kubectl describe nodepool your-nodepool-name

Check the Node labels for driver version:

Example
# Check node labels for driver version
$
kubectl get nodes --show-labels | grep driver-version

Or, check the GPU driver information on the Nodes by running nvidia-smi on a Pod running on the Node:

Example
# Check GPU driver information on nodes
$
kubectl exec -it pod-name -- nvidia-smi