Run Ray with Kueue
Learn how to run Ray with Kueue
This guide shows you how to set up Ray with Kueue on CKS. It covers the following:
- How to install Ray and Kueue
- Create Ray clusters that can queue jobs efficiently
- Use helper scripts to manage distributed computing workloads
This creates a distributed compute environment where Ray executes jobs, and Kueue ensures those jobs are efficiently scheduled and queued inside your CKS cluster.
Prerequisites
Before beginning, we recommend creating a PVC for shared use. This can make it easier for ML engineers working together to maintain persistent directories across ephemeral or autoscaling clusters. For information on creating a PVC, see Create PVCs.
Note that if you want to connect to your Pods using SSH, you need to build your Ray image with an SSH sever.
Install Ray
To install Ray, complete the following steps:
-
Run the following command to add the KubeRay Helm repo:
Example$helm repo add kuberay https://ray-project.github.io/kuberay-helm/You should see output similar to the following:
"kuberay" has been added to your repositories -
Install KubeRay on your CKS cluster by running the following command:
Example$helm install kuberay-operator kuberay/kuberay-operator --version 1.4.0You should see output similar to the following:
NAME: kuberay-operatorLAST DEPLOYED: Tue Aug 19 14:46:56 2025NAMESPACE: defaultSTATUS: deployedREVISION: 1TEST SUITE: None
Install Kueue
To install Kueue, follow the Kueue documentation.
After installing Kueue, be sure to deploy the Kueue configuration file with kubectl apply -f <kueue-config-file>
Note: Currently Kueue is configured to find the local queue in the default namespace. Work within the default namespace for this guide.
Create a Ray cluster
To make it easier to create and manage a Ray cluster, we recommend using the following script: create_ray_box
.
To use the script, complete the following steps:
-
Download the script using curl:
Example$curl -o create_ray_box https://raw.githubusercontent.com/coreweave/reference-architecture/main/ray-kueue/scripts/create_ray_box -
Make the script executable:
Example$chmod +x create_ray_box -
Modify the lines related to GPUs, CPUs, and memory:
# Set resource requests and limits for the Ray head node.resources:limits:# Modify CPU based on your the CPUs in your Node Pool.cpu: "120"# Modify memory based on your the memory in your Node Pool.memory: "2000G"# Modify the number of GPUs based the GPUs in your Node Pool.nvidia.com/gpu: 8rdma/ib: "1"requests:# For production use-cases, we recommend specifying integer CPU reqests and limits.# We also recommend setting requests equal to limits for both CPU and memory.# Modify CPU based on your the CPUs in your Node Pool.cpu: "120"# Modify memory based on your the memory in your Node Pool.memory: "2000G"# Modify the number of GPUs based the GPUs in your Node Pool.nvidia.com/gpu: 8rdma/ib: "1"
You can now run ./create_ray_box
without arguments to see its options.
Examples
The following two examples show how to create a Ray cluster.
- Using provided script
- Using a YAML file
The following example creates a Ray dev box with a four worker Nodes and one head Node.
$create_ray_box --nodes 4 --name mydevbox –-image <ray-docker-image>
Note the following:
-
--nodes
: The number of Nodes is the number of workers (in addition to the head Node). This example creates a Ray dev box with a total of four worker Nodes and one head Node. All Nodes can run jobs. If there are insufficient resources, the cluster will be queued. -
--image
: This is the Ray Docker image. We highly recommend building a custom Ray image using CoreWeave's nccl-tests as a base image for InfiniBand support. If you don't include the--image
option, the script uses the rayproject/ray-ml:2.9.0-gpu image.
You can also create a Ray cluster with only YAML files:
The following command creates a shared PVC. You don't need to run this command if you've already created a PVC.
$kubectl apply -f https://raw.githubusercontent.com/coreweave/reference-architecture/refs/heads/main/ray-kueue/yamls/pvc.yaml
The following command uses the ray-cluster-sample to create a Ray cluster:
$kubectl apply -f https://raw.githubusercontent.com/coreweave/reference-architecture/refs/heads/main/ray-kueue/yamls/ray-cluster-sample.yaml
Working with Ray Clusters
Task | Command | Notes |
---|---|---|
List Ray clusters | kubectl get raycluster | Shows all clusters in the namespace. |
List Pods | kubectl get pods | Lists the head and worker Pods. |
Log into head Pod | kubectl exec -it <cluster-name>-head -- /bin/bash | Replace <cluster-name> with your cluster's name. |
Log into worker Pod | kubectl exec -it <worker-pod-name> -- /bin/bash | Replace <worker-pod-name> with the actual Pod name. |
List jobs in queue | kubectl get queue | View pending and admitted workloads. |
Get queue details | kubectl describe queue | Shows detailed information about the queue and jobs. |
Delete a Ray Cluster | kubectl delete raycluster <cluster-name> | Replace <cluster-name> with your cluster's name |
Shared storage path | /mnt/vast | Where shared storage is configured. |
Testing NCCL
To test that InfiniBand is correctly configured with your container and cluster, you can run the script nccl–test/all_reduce_ray.py
.
Additional helper scripts
In the CoreWeave/reference-architecture repository, you can find helper scripts for working with Ray clusters. These utility Python scripts make it easier to work with a small team on a Ray cluster. For example, to view the Ray cluster capacity, navigate to the ray-kueue
directory in the reference architecture repository and run the following command:
$python3 scripts/capacity