This guide shows you how to set up Ray with Kueue on CKS. It covers the following:Documentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
- How to install Ray and Kueue
- Create Ray clusters that can queue jobs efficiently
- Use helper scripts to manage distributed computing workloads
Prerequisites
Before beginning, we recommend creating a PVC for shared use. This can make it easier for ML engineers working together to maintain persistent directories across ephemeral or autoscaling clusters. For information on creating a PVC, see Create PVCs. Note that if you want to connect to your Pods using SSH, you need to build your Ray image with an SSH sever.Install Ray
To install Ray, complete the following steps:-
Run the following command to add the KubeRay Helm repo:
You should see output similar to the following:
-
Install KubeRay on your CKS cluster by running the following command:
You should see output similar to the following:
Install Kueue
To install Kueue, follow the Kueue documentation. After installing Kueue, be sure to deploy the Kueue configuration file withkubectl apply -f <kueue-config-file>
Currently Kueue is configured to find the local queue in the default namespace. Work within the default namespace for this guide.
Create a Ray cluster
To make it easier to create and manage a Ray cluster, we recommend using the following script:create_ray_box.
To use the script, complete the following steps:
-
Download the script using curl:
-
Make the script executable:
-
Modify the lines related to GPUs, CPUs, and memory:
./create_ray_box without arguments to see its options.
Examples
The following two examples show how to create a Ray cluster.- Using provided script
- Using a YAML file
The following example creates a Ray dev box with a four worker Nodes and one head Node.Note the following:
-
--nodes: The number of Nodes is the number of workers (in addition to the head Node). This example creates a Ray dev box with a total of four worker Nodes and one head Node. All Nodes can run jobs. If there are insufficient resources, the cluster will be queued. -
--image: This is the Ray Docker image. We highly recommend building a custom Ray image using CoreWeave’s nccl-tests as a base image for InfiniBand support. If you don’t include the--imageoption, the script uses the rayproject/ray-ml:2.9.0-gpu image.
Working with Ray Clusters
| Task | Command | Notes |
|---|---|---|
| List Ray clusters | kubectl get raycluster | Shows all clusters in the namespace. |
| List Pods | kubectl get pods | Lists the head and worker Pods. |
| Log into head Pod | kubectl exec -it <cluster-name>-head -- /bin/bash | Replace <cluster-name> with your cluster’s name. |
| Log into worker Pod | kubectl exec -it <worker-pod-name> -- /bin/bash | Replace <worker-pod-name> with the actual Pod name. |
| List jobs in queue | kubectl get queue | View pending and admitted workloads. |
| Get queue details | kubectl describe queue | Shows detailed information about the queue and jobs. |
| Delete a Ray Cluster | kubectl delete raycluster <cluster-name> | Replace <cluster-name> with your cluster’s name |
| Shared storage path | /mnt/vast | Where shared storage is configured. |
Testing NCCL
To test that InfiniBand is correctly configured with your container and cluster, you can run the scriptnccl-test/all_reduce_ray.py.
Additional helper scripts
In the CoreWeave/reference-architecture repository, you can find helper scripts for working with Ray clusters. These utility Python scripts make it easier to work with a small team on a Ray cluster. For example, to view the Ray cluster capacity, navigate to theray-kueue directory in the reference architecture repository and run the following command: