Run Ray with Kueue

Learn how to run Ray with Kueue

This guide shows you how to set up Ray with Kueue on CKS. It covers the following:

How to install Ray and Kueue
Create Ray clusters that can queue jobs efficiently
Use helper scripts to manage distributed computing workloads

This creates a distributed compute environment where Ray executes jobs, and Kueue ensures those jobs are efficiently scheduled and queued inside your CKS cluster.

Prerequisites

Before beginning, we recommend creating a PVC for shared use. This can make it easier for ML engineers working together to maintain persistent directories across ephemeral or autoscaling clusters. For information on creating a PVC, see Create PVCs.

Note that if you want to connect to your Pods using SSH, you need to build your Ray image with an SSH sever.

Install Ray

To install Ray, complete the following steps:

Run the following command to add the KubeRay Helm repo:

Example

$
helm repo add kuberay https://ray-project.github.io/kuberay-helm/

You should see output similar to the following:

"kuberay" has been added to your repositories

Install KubeRay on your CKS cluster by running the following command:

Example

$
helm install kuberay-operator kuberay/kuberay-operator --version 1.4.0

You should see output similar to the following:

NAME: kuberay-operator
LAST DEPLOYED: Tue Aug 19 14:46:56 2025
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None

Install Kueue

To install Kueue, follow the Kueue documentation.

After installing Kueue, be sure to deploy the Kueue configuration file with kubectl apply -f <kueue-config-file>

Use default namespace

Note: Currently Kueue is configured to find the local queue in the default namespace. Work within the default namespace for this guide.

Create a Ray cluster

To make it easier to create and manage a Ray cluster, we recommend using the following script: create_ray_box.

To use the script, complete the following steps:

Download the script using curl:

Example

$
curl -o create_ray_box https://raw.githubusercontent.com/coreweave/reference-architecture/main/ray-kueue/scripts/create_ray_box

Make the script executable:

Example
```
$
chmod +x create_ray_box
```

Modify the lines related to GPUs, CPUs, and memory:

# Set resource requests and limits for the Ray head node.
resources:
  limits:
    # Modify CPU based on your the CPUs in your Node Pool.
    cpu: "120"

    # Modify memory based on your the memory in your Node Pool.
    memory: "2000G"

    # Modify the number of GPUs based the GPUs in your Node Pool.
    nvidia.com/gpu: 8
    rdma/ib: "1"
  requests:
    # For production use-cases, we recommend specifying integer CPU reqests and limits.
    # We also recommend setting requests equal to limits for both CPU and memory.

    # Modify CPU based on your the CPUs in your Node Pool.
    cpu: "120"

    # Modify memory based on your the memory in your Node Pool.
    memory: "2000G"

    # Modify the number of GPUs based the GPUs in your Node Pool.
    nvidia.com/gpu: 8
    rdma/ib: "1"

You can now run ./create_ray_box without arguments to see its options.

Examples

The following two examples show how to create a Ray cluster.

Using provided script
Using a YAML file

The following example creates a Ray dev box with a four worker Nodes and one head Node.

Example

$
create_ray_box --nodes 4 --name mydevbox –-image <ray-docker-image>

Note the following:

--nodes: The number of Nodes is the number of workers (in addition to the head Node). This example creates a Ray dev box with a total of four worker Nodes and one head Node. All Nodes can run jobs. If there are insufficient resources, the cluster will be queued.
--image: This is the Ray Docker image. We highly recommend building a custom Ray image using CoreWeave's nccl-tests as a base image for InfiniBand support. If you don't include the --image option, the script uses the rayproject/ray-ml:2.9.0-gpu image.

You can also create a Ray cluster with only YAML files:

The following command creates a shared PVC. You don't need to run this command if you've already created a PVC.

Example

$
kubectl apply -f https://raw.githubusercontent.com/coreweave/reference-architecture/refs/heads/main/ray-kueue/yamls/pvc.yaml

The following command uses the ray-cluster-sample to create a Ray cluster:

Example

$
kubectl apply -f https://raw.githubusercontent.com/coreweave/reference-architecture/refs/heads/main/ray-kueue/yamls/ray-cluster-sample.yaml

Working with Ray Clusters

Task	Command	Notes
List Ray clusters	`kubectl get raycluster`	Shows all clusters in the namespace.
List Pods	`kubectl get pods`	Lists the head and worker Pods.
Log into head Pod	`kubectl exec -it <cluster-name>-head -- /bin/bash`	Replace `<cluster-name>` with your cluster's name.
Log into worker Pod	`kubectl exec -it <worker-pod-name> -- /bin/bash`	Replace `<worker-pod-name>` with the actual Pod name.
List jobs in queue	`kubectl get queue`	View pending and admitted workloads.
Get queue details	`kubectl describe queue`	Shows detailed information about the queue and jobs.
Delete a Ray Cluster	`kubectl delete raycluster <cluster-name>`	Replace `<cluster-name>` with your cluster's name
Shared storage path	`/mnt/vast`	Where shared storage is configured.

Testing NCCL

To test that InfiniBand is correctly configured with your container and cluster, you can run the script nccl–test/all_reduce_ray.py.

Additional helper scripts

In the CoreWeave/reference-architecture repository, you can find helper scripts for working with Ray clusters. These utility Python scripts make it easier to work with a small team on a Ray cluster. For example, to view the Ray cluster capacity, navigate to the ray-kueue directory in the reference architecture repository and run the following command:

Example

$
python3 scripts/capacity

Prerequisites​

Install Ray​

Install Kueue​

Create a Ray cluster​

Examples​

Working with Ray Clusters​

Testing NCCL​

Additional helper scripts​

Prerequisites

Install Ray

Install Kueue

Create a Ray cluster

Examples

Working with Ray Clusters

Testing NCCL

Additional helper scripts