Skip to main content
This guide shows ML engineers and platform operators how to set up Ray with Kueue on CKS so that Kueue can schedule and queue distributed compute workloads on shared GPU resources. It covers the following:
  • Install Ray and Kueue.
  • Create Ray clusters that can queue jobs.
  • Use helper scripts to manage distributed computing workloads.
By the end of this guide, you have a distributed compute environment where Ray executes jobs and Kueue schedules and queues those jobs inside your CKS cluster.

Prerequisites

Before you begin, we recommend creating a PVC for shared use. This makes it easier for ML engineers working together to maintain persistent directories across ephemeral or autoscaling clusters. For information on creating a PVC, see Create PVCs. If you want to connect to your Pods using SSH, you need to build your Ray image with an SSH server.

Install Ray

Ray provides the distributed compute runtime that executes your jobs across the cluster. Install the KubeRay operator on your CKS cluster to manage Ray clusters as Kubernetes resources. To install Ray, complete the following steps:
  1. Add the KubeRay Helm repository:
    helm repo add kuberay https://ray-project.github.io/kuberay-helm/
    
    You should see output similar to the following:
    "kuberay" has been added to your repositories
    
  2. Install KubeRay on your CKS cluster:
    helm install kuberay-operator kuberay/kuberay-operator --version 1.4.0
    
    You should see output similar to the following:
    NAME: kuberay-operator
    LAST DEPLOYED: Tue Aug 19 14:46:56 2025
    NAMESPACE: default
    STATUS: deployed
    REVISION: 1
    TEST SUITE: None
    

Install Kueue

With the KubeRay operator running, the next step is to install Kueue, which handles job queuing and quota-aware scheduling for the Ray clusters you create later in this guide. To install Kueue, follow the Kueue documentation. After you install Kueue, be sure to deploy the Kueue configuration file with kubectl apply -f [KUEUE-CONFIG-FILE].
Kueue is configured to find the local queue in the default namespace. Work within the default namespace for this guide.

Create a Ray cluster

With Ray and Kueue installed, you can now create a Ray cluster that submits its workloads through Kueue. To make it easier to create and manage a Ray cluster, we recommend using the following script: create_ray_box. To use the script, complete the following steps:
  1. Download the script using curl:
    curl -o create_ray_box https://raw.githubusercontent.com/coreweave/reference-architecture/main/ray-kueue/scripts/create_ray_box
    
  2. Make the script executable:
    chmod +x create_ray_box
    
  3. Modify the lines related to GPUs, CPUs, and memory so the Ray head Node’s resource requests match the capacity of your Node Pool:
    # Set resource requests and limits for the Ray head node.
    resources:
      limits:
        # Modify CPU based on the CPUs in your Node Pool.
        cpu: "120"
    
        # Modify memory based on the memory in your Node Pool.
        memory: "2000G"
    
        # Modify the number of GPUs based on the GPUs in your Node Pool.
        nvidia.com/gpu: 8
        rdma/ib: "1"
      requests:
        # For production use-cases, we recommend specifying integer CPU requests and limits.
        # We also recommend setting requests equal to limits for both CPU and memory.
    
        # Modify CPU based on the CPUs in your Node Pool.
        cpu: "120"
    
        # Modify memory based on the memory in your Node Pool.
        memory: "2000G"
    
        # Modify the number of GPUs based on the GPUs in your Node Pool.
        nvidia.com/gpu: 8
        rdma/ib: "1"
    
You can now run ./create_ray_box without arguments to see its options.

Examples

The following two examples show how to create a Ray cluster.
The following example creates a Ray dev box with four worker Nodes and one head Node. Replace [RAY-DOCKER-IMAGE] with the Ray Docker image you want to use.
create_ray_box --nodes 4 --name mydevbox --image [RAY-DOCKER-IMAGE]
The command uses the following options:
  • --nodes: The number of Nodes is the number of workers, in addition to the head Node. This example creates a Ray dev box with a total of four worker Nodes and one head Node. All Nodes can run jobs. If resources are insufficient, the cluster is queued.
  • --image: The Ray Docker image. We recommend building a custom Ray image using CoreWeave’s nccl-tests as a base image for InfiniBand support. If you don’t include the --image option, the script uses the rayproject/ray-ml:2.9.0-gpu image.

Work with Ray clusters

After your Ray cluster is up, use the following commands to inspect clusters, log into Pods, and manage queued jobs.
TaskCommandNotes
List Ray clusterskubectl get rayclusterShows all clusters in the namespace.
List Podskubectl get podsLists the head and worker Pods.
Log into head Podkubectl exec -it [CLUSTER-NAME]-head -- /bin/bashReplace [CLUSTER-NAME] with your cluster’s name.
Log into worker Podkubectl exec -it [WORKER-POD-NAME] -- /bin/bashReplace [WORKER-POD-NAME] with the actual Pod name.
List jobs in queuekubectl get queueView pending and admitted workloads.
Get queue detailskubectl describe queueShows detailed information about the queue and jobs.
Delete a Ray clusterkubectl delete raycluster [CLUSTER-NAME]Replace [CLUSTER-NAME] with your cluster’s name.
Shared storage path/mnt/vastWhere shared storage is configured.

Test NCCL

Before running production workloads, we recommend verifying that NCCL communication over InfiniBand works correctly across your Ray Nodes. To test that InfiniBand is correctly configured with your container and cluster, you can run the script nccl-test/all_reduce_ray.py.

Additional helper scripts

In the CoreWeave/reference-architecture repository, you can find helper scripts for working with Ray clusters. These utility Python scripts make it easier to work with a small team on a Ray cluster. For example, to view the Ray cluster capacity, navigate to the ray-kueue directory in the reference architecture repository and use the capacity script:
python3 scripts/capacity
Last modified on June 10, 2026