Skip to main content

Run Ray with Kueue

Learn how to run Ray with Kueue

This guide shows you how to set up Ray with Kueue on CKS. It covers the following:

  • How to install Ray and Kueue
  • Create Ray clusters that can queue jobs efficiently
  • Use helper scripts to manage distributed computing workloads

This creates a distributed compute environment where Ray executes jobs, and Kueue ensures those jobs are efficiently scheduled and queued inside your CKS cluster.

Prerequisites

Before beginning, we recommend creating a PVC for shared use. This can make it easier for ML engineers working together to maintain persistent directories across ephemeral or autoscaling clusters. For information on creating a PVC, see Create PVCs.

Note that if you want to connect to your Pods using SSH, you need to build your Ray image with an SSH sever.

Install Ray

To install Ray, complete the following steps:

  1. Run the following command to add the KubeRay Helm repo:

    Example
    $
    helm repo add kuberay https://ray-project.github.io/kuberay-helm/

    You should see output similar to the following:

    "kuberay" has been added to your repositories
  2. Install KubeRay on your CKS cluster by running the following command:

    Example
    $
    helm install kuberay-operator kuberay/kuberay-operator --version 1.4.0

    You should see output similar to the following:

    NAME: kuberay-operator
    LAST DEPLOYED: Tue Aug 19 14:46:56 2025
    NAMESPACE: default
    STATUS: deployed
    REVISION: 1
    TEST SUITE: None

Install Kueue

To install Kueue, follow the Kueue documentation.

After installing Kueue, be sure to deploy the Kueue configuration file with kubectl apply -f <kueue-config-file>

Use default namespace

Note: Currently Kueue is configured to find the local queue in the default namespace. Work within the default namespace for this guide.

Create a Ray cluster

To make it easier to create and manage a Ray cluster, we recommend using the following script: create_ray_box.

To use the script, complete the following steps:

  1. Download the script using curl:

    Example
    $
    curl -o create_ray_box https://raw.githubusercontent.com/coreweave/reference-architecture/main/ray-kueue/scripts/create_ray_box
  2. Make the script executable:

    Example
    $
    chmod +x create_ray_box
  3. Modify the lines related to GPUs, CPUs, and memory:

    # Set resource requests and limits for the Ray head node.
    resources:
    limits:
    # Modify CPU based on your the CPUs in your Node Pool.
    cpu: "120"
    # Modify memory based on your the memory in your Node Pool.
    memory: "2000G"
    # Modify the number of GPUs based the GPUs in your Node Pool.
    nvidia.com/gpu: 8
    rdma/ib: "1"
    requests:
    # For production use-cases, we recommend specifying integer CPU reqests and limits.
    # We also recommend setting requests equal to limits for both CPU and memory.
    # Modify CPU based on your the CPUs in your Node Pool.
    cpu: "120"
    # Modify memory based on your the memory in your Node Pool.
    memory: "2000G"
    # Modify the number of GPUs based the GPUs in your Node Pool.
    nvidia.com/gpu: 8
    rdma/ib: "1"

You can now run ./create_ray_box without arguments to see its options.

Examples

The following two examples show how to create a Ray cluster.

The following example creates a Ray dev box with a four worker Nodes and one head Node.

Example
$
create_ray_box --nodes 4 --name mydevbox –-image <ray-docker-image>

Note the following:

  • --nodes: The number of Nodes is the number of workers (in addition to the head Node). This example creates a Ray dev box with a total of four worker Nodes and one head Node. All Nodes can run jobs. If there are insufficient resources, the cluster will be queued.

  • --image: This is the Ray Docker image. We highly recommend building a custom Ray image using CoreWeave's nccl-tests as a base image for InfiniBand support. If you don't include the --image option, the script uses the rayproject/ray-ml:2.9.0-gpu image.

Working with Ray Clusters

TaskCommandNotes
List Ray clusterskubectl get rayclusterShows all clusters in the namespace.
List Podskubectl get podsLists the head and worker Pods.
Log into head Podkubectl exec -it <cluster-name>-head -- /bin/bashReplace <cluster-name> with your cluster's name.
Log into worker Podkubectl exec -it <worker-pod-name> -- /bin/bashReplace <worker-pod-name> with the actual Pod name.
List jobs in queuekubectl get queueView pending and admitted workloads.
Get queue detailskubectl describe queueShows detailed information about the queue and jobs.
Delete a Ray Clusterkubectl delete raycluster <cluster-name>Replace <cluster-name> with your cluster's name
Shared storage path/mnt/vastWhere shared storage is configured.

Testing NCCL

To test that InfiniBand is correctly configured with your container and cluster, you can run the script nccl–test/all_reduce_ray.py.

Additional helper scripts

In the CoreWeave/reference-architecture repository, you can find helper scripts for working with Ray clusters. These utility Python scripts make it easier to work with a small team on a Ray cluster. For example, to view the Ray cluster capacity, navigate to the ray-kueue directory in the reference architecture repository and run the following command:

Example
$
python3 scripts/capacity