Skip to main content

Three-layer orchestration model

Understand CoreWeave's three-layer orchestration model

"Orchestration" refers to the management of distinct but interconnected functionalities, each with their own requirements and best practices. Orchestration for complex, large-scale HPC clusters presents unique challenges and requires a thoughtful approach. CoreWeave uses a three-layer orchestration model to help simplify this process.

Understanding the orchestration model and the options associated with it allows you to craft full-stack solutions based on your specific requirements, while maximizing the benefits provided by each solution.

This article outlines CoreWeave's three-layer orchestration model and solutions using various combinations of Kubernetes, Slurm, and Ray.

The three layers of orchestration

CoreWeave's orchestration model defines the following layers:

LayerPurpose
ComputeManages physical resources available to customers
WorkloadManages the workloads running on the available resources
ProcessControls the code that executes across your resources

Compute orchestration

Tip

Compute refers to the physical resources that CoreWeave makes available to customers.

The Compute layer manages the Node lifecycle. It takes action in the event of a hardware failure, preempts spot Nodes, and autoscales to provide additional compute during periods of high demand. A strong solution on this layer ensures on-demand access to the best performing Nodes.

Workload orchestration

The Workload layer manages the lifecycle of the actual workloads running on the Nodes made available by the Compute layer. Workload lifecycle management includes scaling up replicas to meet increases in demand, gang scheduling, ordering jobs by priority within a queue, and assigning resources to jobs based on physical topologies.

Process orchestration

The process orchestrator controls the code that executes across your resources. This can address simple cases, such as single-GPU or single-Node workloads, or more complex, large-scale, distributed workloads, such as training LLMs.

A process orchestrator must be flexible and intuitive enough to handle complex dataflows, such as those seen in Reinforcement Learning jobs, while maximizing the Compute allocated to your workload.

Example stacks

A single system can span multiple layers, or you can choose to combine different systems across these layers. This allows you to focus different systems to the layer they are most suited for.

In the following examples, we'll discuss different stacks including Kubernetes, Slurm, and Ray - each of which can be applied as orchestrators to any of these three layers.

Full Kubernetes stack

Kubernetes is a popular choice for orchestration across each of the three layers.

A full Kubernetes stack is arranged as follows:

LayerSolutionMechanism
Compute allocationKuberneteskubelet
Workload allocationKuberneteskube-scheduler, Kueue, Volcano
Process allocationKubernetesPod.spec.command

The Compute Nodes run kubelet, which makes their resources available for scheduling. Managed providers, such as CoreWeave Kubernetes Service (CKS), also come fully equipped with lifecyle controllers for triaging issues, and systems for autoscaling, spot instances, and more.

Workload allocation is handled through the wide ranges of different resources you can submit, each with their own lifecycle behaviors. Third-party add-ons are also available to extend this functionality in various ways.

Process allocation is handled by the container image's entry point, or overridden by commands and arguments in the Kubernetes resource's configuration.

For example, a kind: JobSet for training ResNet might resemble the following:

Example
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
name: pytorch
labels:
kueue.x-k8s.io/queue-name: training
spec:
replicatedJobs:
- name: workers
template:
spec:
parallelism: 4
completions: 4
backoffLimit: 0
template:
spec:
containers:
- name: pytorch
image: gcr.io/k8s-staging-jobset/pytorch-resnet:latest
ports:
- containerPort: 3389
env:
- name: MASTER_ADDR
value: "pytorch-workers-0-0.pytorch"
- name: MASTER_PORT
value: "3389"
command:
- bash
- -xc
- |
torchrun --nproc_per_node=1 --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT resnet.py --backend=gloo

Once submitted, Kueue handles quota management, kube-scheduler assigns the resources to each of the four replicas, and the containers start training by running the specified torchrun command.

Many CoreWeave customers use a full Kubernetes orchestration stack to utilize large-scale Compute clusters. However, to help support large-scale training, we add several Kubernetes-native workload orchestration features (Kueue and JobSets, in this example) to the ecosystem.

Slurm on Kubernetes (SUNK)

Slurm is a strong solution for workload orchestration, particularly of batch jobs on large-scale Compute clusters, and has been the trusted choice of many of the largest HPC clusters in the world for years.

Although Slurm and Kubernetes are both standard solutions for orchestration, they have typically been left as two separate systems, without combination.

CoreWeave developed SUNK to address the needs of large-scale LLM training by harnessing the strengths of both Slurm and Kubernetes. Since SUNK's release, this use of Kubernetes for Compute and Slurm for workload orchestration has become a standard, and is recognized in industry reports like SemiAnalysis's ClusterMAX.

The SUNK stack is arranged as shown in the following table:

LayerSolutionMechanism
Compute allocationKuberneteskubelet
Workload allocationSlurmslurmctld
Process allocationSlurmsrun

In this stack, a Kubernetes cluster orchestrates the Compute Nodes and runs slurmd on each Node as a Kubernetes resource.

Users submit their Slurm jobs to the Slurm controller, slurmctld, which handles the workload allocation.

Within the definition of the submitted Slurm jobs, the srun Slurm command launches specified processes across the Compute Nodes assigned to each job.

Ray on Kubernetes

Anyscale's open-source Ray Core library provides an alternative process orchestrator to Slurm. It uses the Actor paradigm, which implements the actor model through RPC calls wrapped under Python abstractions.

With Ray as a process orchestrator, users submit a "driver" program that uses the Ray library to instantiate Actors across the Compute allocated to the workload, and trigger different processes to run within each.

With Ray on Kubernetes, the stack is arranged as follows:

LayerSolutionMechanism
Compute allocationKuberneteskubelet
Workload allocationKuberneteskube-scheduler
Process allocationRayraylet

Resources like JobSet schedule raylet across the Compute allocated to a workload, as demonstrated in the following pseudocode:

Example
kind: JobSet
metadata:
name: ray-cluster-example
spec:
replicatedJobs:
- name: ray-head
template:
spec:
command: ["ray", "start", "--head"]
- name: ray-workers
replicas: 2
template:
spec:
command: ["ray", "start", "--address", "ray-head-0.ray-cluster-example.default.svc:6379"]

Once the Ray cluster starts, a driver program similar to Slurm's sbatch is submitted into the Ray cluster, as shown in the pseudocode below:

Example
# /// script
# requires-python = ">=3.12"
# dependencies = [
# "ray",
# ]
# [tool.uv]
# exclude-newer = "2025-10-06T00:00:00Z"
# ///
import ray
@ray.remote
class Counter:
def __init__(self):
self.value = 0
def increment(self):
self.value += 1
return self.value
def get_counter(self):
return self.value
if __name__ == "__main__":
ray.init()
counter = Counter.remote()
for _ in range(5):
print(ray.get(counter.get_counter.remote()))
counter.increment.remote()

Ray on SUNK

As discussed earlier, Slurm can take the role of the workload orchestrator from Kubernetes. This functionality can be combined with the above Ray on Kubernetes stack.

Ray on SUNK provides a stack with a separate system being used for each layer, each utilizing its unique strengths in its respective layer.

With Ray on SUNK, the layers function as follows:

LayerSolutionMechanism
Compute allocationKuberneteskubelet
Workload allocationSlurmslurmd
Process allocationRayraylet

Instead of using a JobSet or a kind: RayCluster to orchestrate the creation of a Ray cluster, we modify Slurm's sbatch directive to create a Ray cluster within the job before submitting a driver program in a similar fashion.

For a full example of this stack, see our Run Ray on SUNK guide.