Select GPU driver versions in CKS Node Pools
Update the GPU driver for your Node Pools
GPU driver management is currently a preview feature, and has the following limitations:
- No Cloud Console UI: Configuration must be done through Kubernetes manifests.
- No automatic minor or patch version upgrades: Only major version selection is currently supported.
- Release channels are not yet implemented: The
latest
andstable
release channels are not yet implemented in Node Pools.
Create a new Node Pool with a specific driver version
Driver versions are configured in the Node Pool manifest. To select a driver version, add the gpu
section to your Node Pool manifest's spec
section, specifying the desired major version without dots.
For example, for an H100 Node Pool, you would specify the driver version as 570.172.08-0ubuntu1
:
apiVersion: compute.coreweave.com/v1alpha1kind: NodePoolmetadata:name: example-nodepoolspec:computeClass: defaultinstanceType: gd-8xh100ib-i128targetNodes: 1gpu:version: "570" # Specify driver version by major version
If no driver is specified, the Node Pool automatically uses the latest available driver.
Update the driver version on an existing Node Pool
If a driver is currently specified on an existing Node Pool, you can update it to a new major version by modifying the existing Node Pool manifest.
# Original Node PoolapiVersion: compute.coreweave.com/v1alpha1kind: NodePoolmetadata:name: test-nodepoolspec:computeClass: defaultinstanceType: gd-8xh100-i128targetNodes: 1gpu:version: "570"
The Node Pool manifest would be updated to:
# Updated Node Pool with new driver versionapiVersion: compute.coreweave.com/v1alpha1kind: NodePoolmetadata:name: test-nodepoolspec:computeClass: defaultinstanceType: gd-8xh100-i128targetNodes: 1gpu:version: "580"
Target driver versions using Node labels and selectors
Driver version information is exposed on Nodes through Kubernetes labels. You can use these labels to get information on current driver versions and to target specific driver versions in your workloads.
# Check the current driver version label on nodes$kubectl get nodes --show-labels | grep driver-version
Node labels are in the format gpu.coreweave.cloud/driver-version=<major>.<minor>.<patch>
, where the value (570
, in this example) represents the full driver version. For example, a Node with the label gpu.coreweave.cloud/driver-version=570
is running driver version 570
.
The gpu.coreweave.cloud/driver-version
label is always applied to Nodes, even if no driver version is specified in the Node Pool manifest.
Target specific driver versions in workloads
The gpu.coreweave.cloud/driver-version
label allows you to target Nodes with exact driver version matches.
For detailed information about scheduling workloads on Nodes with specific driver versions, see Scheduling Workloads. It is strongly recommended to avoid scheduling across multiple driver versions in a single Node Pool.
Scheduling workloads on Nodes with specific driver versions
For workloads that require a specific driver version, use an exact match with the nodeSelector
field:
apiVersion: v1kind: Podmetadata:name: gpu-workloadspec:containers:- name: gpu-containerimage: nvidia/cuda:11.8-basecommand: ["nvidia-smi"]resources:limits:nvidia.com/gpu: 1nodeSelector:gpu.coreweave.cloud/driver-version: "570.172.08-0ubuntu1"
Troubleshooting scheduling issues
If Pods fail to schedule due to driver version constraints, check the available driver versions in your cluster:
# Check available driver versions$kubectl get nodes --show-labels | grep driver-version# Check Pod events for scheduling failures$kubectl describe pod <pod-name> | grep -A 10 Events:
Common scheduling issues may include:
- No Nodes available with the exact driver version specified
- Nodes with the required driver version are unavailable due to resource constraints
- Driver version constraints conflict with other scheduling requirements
Troubleshooting
Common error conditions
If you encounter issues with driver configuration, check the Node Pool status for error conditions:
Status:Conditions:Last Transition Time: 2025-06-30T19:25:16ZMessage: unable to create configuration for NodePoolReason: InternalErrorStatus: FalseType: Validated
For more information about Node Pool events and possible error conditions, see Node Pool events.
Verify the driver version
To verify your Node Pool configuration and driver status, you can:
Describe the Node Pool:
# Check Node Pool status$kubectl describe nodepool your-nodepool-name
Check the Node labels for driver version:
# Check node labels for driver version$kubectl get nodes --show-labels | grep driver-version
Or, check the GPU driver information on the Nodes by running nvidia-smi
on a Pod running on the Node:
# Check GPU driver information on nodes$kubectl exec -it pod-name -- nvidia-smi