- Install Ray and Kueue.
- Create Ray clusters that can queue jobs.
- Use helper scripts to manage distributed computing workloads.
Prerequisites
Before you begin, we recommend creating a PVC for shared use. This makes it easier for ML engineers working together to maintain persistent directories across ephemeral or autoscaling clusters. For information on creating a PVC, see Create PVCs. If you want to connect to your Pods using SSH, you need to build your Ray image with an SSH server.Install Ray
Ray provides the distributed compute runtime that executes your jobs across the cluster. Install the KubeRay operator on your CKS cluster to manage Ray clusters as Kubernetes resources. To install Ray, complete the following steps:-
Add the KubeRay Helm repository:
You should see output similar to the following:
-
Install KubeRay on your CKS cluster:
You should see output similar to the following:
Install Kueue
With the KubeRay operator running, the next step is to install Kueue, which handles job queuing and quota-aware scheduling for the Ray clusters you create later in this guide. To install Kueue, follow the Kueue documentation. After you install Kueue, be sure to deploy the Kueue configuration file withkubectl apply -f [KUEUE-CONFIG-FILE].
Kueue is configured to find the local queue in the default namespace. Work within the default namespace for this guide.
Create a Ray cluster
With Ray and Kueue installed, you can now create a Ray cluster that submits its workloads through Kueue. To make it easier to create and manage a Ray cluster, we recommend using the following script:create_ray_box.
To use the script, complete the following steps:
-
Download the script using curl:
-
Make the script executable:
-
Modify the lines related to GPUs, CPUs, and memory so the Ray head Node’s resource requests match the capacity of your Node Pool:
./create_ray_box without arguments to see its options.
Examples
The following two examples show how to create a Ray cluster.- Using provided script
- Using a YAML file
The following example creates a Ray dev box with four worker Nodes and one head Node. Replace The command uses the following options:
[RAY-DOCKER-IMAGE] with the Ray Docker image you want to use.-
--nodes: The number of Nodes is the number of workers, in addition to the head Node. This example creates a Ray dev box with a total of four worker Nodes and one head Node. All Nodes can run jobs. If resources are insufficient, the cluster is queued. -
--image: The Ray Docker image. We recommend building a custom Ray image using CoreWeave’s nccl-tests as a base image for InfiniBand support. If you don’t include the--imageoption, the script uses the rayproject/ray-ml:2.9.0-gpu image.
Work with Ray clusters
After your Ray cluster is up, use the following commands to inspect clusters, log into Pods, and manage queued jobs.| Task | Command | Notes |
|---|---|---|
| List Ray clusters | kubectl get raycluster | Shows all clusters in the namespace. |
| List Pods | kubectl get pods | Lists the head and worker Pods. |
| Log into head Pod | kubectl exec -it [CLUSTER-NAME]-head -- /bin/bash | Replace [CLUSTER-NAME] with your cluster’s name. |
| Log into worker Pod | kubectl exec -it [WORKER-POD-NAME] -- /bin/bash | Replace [WORKER-POD-NAME] with the actual Pod name. |
| List jobs in queue | kubectl get queue | View pending and admitted workloads. |
| Get queue details | kubectl describe queue | Shows detailed information about the queue and jobs. |
| Delete a Ray cluster | kubectl delete raycluster [CLUSTER-NAME] | Replace [CLUSTER-NAME] with your cluster’s name. |
| Shared storage path | /mnt/vast | Where shared storage is configured. |
Test NCCL
Before running production workloads, we recommend verifying that NCCL communication over InfiniBand works correctly across your Ray Nodes. To test that InfiniBand is correctly configured with your container and cluster, you can run the scriptnccl-test/all_reduce_ray.py.
Additional helper scripts
In the CoreWeave/reference-architecture repository, you can find helper scripts for working with Ray clusters. These utility Python scripts make it easier to work with a small team on a Ray cluster. For example, to view the Ray cluster capacity, navigate to theray-kueue directory in the reference architecture repository and use the capacity script: