Limitations
GPU driver management in CKS has the following limitations:- No Cloud Console support: Configuration must be done through Kubernetes manifests.
- Limited version updates: You can only select major version updates. When minor version updates are available, CKS generates a new pending configuration on the Node Pool. Pending configurations can be found on the Node Pool’s
status.pendingNodeConfigurationfield. See Manage Node Pool configuration for more details. - Release channels are not supported: The
latestandstablerelease channels are not supported in Node Pools.
Create a new Node Pool with a specific driver version
Driver versions are configured in the Node Pool manifest. To select a driver version, add thegpu section to your Node Pool manifest’s spec section, specifying the desired major version without dots.
For example, for an H100 Node Pool, specify the driver version as 570:
Update the driver version on an existing Node Pool
This section shows how to change the driver version on a Node Pool that already specifies one. If a driver is currently specified on an existing Node Pool, you can update it to a new major version by modifying the existing Node Pool manifest.Apply GPU driver updates
With the default node configuration update strategyOnSpecUpdate, updating the driver version automatically stages the new configuration onto the Node Pool. You can then reconfigure-reboot existing Nodes to apply the change. For more information about configuration management, see Manage Node Pool configuration.
Target driver versions using Node labels and selectors
Once your Node Pool is configured with a driver version, you can identify and target Nodes by their driver version from within Kubernetes. Driver version information is exposed on Nodes through Kubernetes labels. You can use these labels to get information on current driver versions and to target specific driver versions in your workloads.gpu.coreweave.cloud/driver-version=[DRIVER-VERSION], where [DRIVER-VERSION] is the full driver version string. For example, a Node with the label gpu.coreweave.cloud/driver-version=570.172.08-0ubuntu1 is running driver version 570.172.08-0ubuntu1.
The
gpu.coreweave.cloud/driver-version label is always applied to Nodes, even if no driver version is specified in the Node Pool manifest.Target specific driver versions in workloads
Thegpu.coreweave.cloud/driver-version label lets you target Nodes with exact driver version matches.
For detailed information about scheduling workloads on Nodes with specific driver versions, see Scheduling Workloads. Avoid scheduling across multiple driver versions in a single Node Pool.
Schedule workloads on Nodes with specific driver versions
For workloads that require a specific driver version, use an exact match with thenodeSelector field:
Troubleshoot scheduling issues
If Pods fail to schedule due to driver version constraints, check the available driver versions in your cluster. Replace[POD-NAME] with the name of your Pod.
- No Nodes available with the exact driver version specified.
- Nodes with the required driver version are unavailable due to resource constraints.
- Driver version constraints conflict with other scheduling requirements.
Troubleshooting
This section covers common error conditions and how to verify the active driver version on your Nodes.Common error conditions
If you encounter issues with driver configuration, check the Node Pool status for error conditions:Node Pool errorsFor more information about Node Pool events and possible error conditions, see Node Pool events.
Verify the driver version
To verify your Node Pool configuration and driver status, use any of the following methods. Describe the Node Pool: Replace[NODE-POOL-NAME] with the name of your Node Pool.
nvidia-smi on a Pod running on the Node. Replace [POD-NAME] with the name of your Pod.