Syncer - CoreWeave Docs

The Syncer bidirectionally synchronizes the state between Slurm nodes and the respective Kubernetes Pods. It translates various state information into formats understood by each side, such as Ready, Running, or Drain, and allows operations made on either side to update the state of the other. The Syncer is deployed along with each Slurm cluster.

Information flow and reconcile operations

The Syncer has several different flows of information and operations supported.

Slurm drains from Kubernetes

When certain conditions happen on the Kubernetes side that make the respective Slurm node either inoperable or undesired for continued Slurm job scheduling, the Syncer propagates these conditions as a drain on the Slurm node. If the condition clears, the Syncer then removes the drain. A drain from Kubernetes is prefixed with k8s: to indicate within Slurm that the drain originated on the Kubernetes side.

The Syncer will only remove or update drain reasons that are prefixed with k8s:. If a non-prefixed drain is present it will be left as is.

Some of the possible conditions that apply a drain are:

The Kubernetes Pod associated with this Slurm node is not ready.
The Kubernetes Pod associated with this Slurm node has been deleted.
The Kubernetes Pod associated with this Slurm node is pending deletion. See NodeSet Controller.
The Kubernetes Pod associated with this Slurm node is Cordoned. See Pod Controller.

Many Kubernetes-side drains are removed automatically when the originating condition is resolved on the Kubernetes side. If the drain is removed in Slurm before the originating condition is resolved, the Syncer reapplies the drain.

Slurm downed nodes

Nodes in Slurm may be set down by three different routes:

Automatically by the Slurm Controller
Manually by the user from within Slurm
Upon Pod deletion in Kubernetes

When a node is downed, the Syncer will not automatically resume or drain the node. This allows for more flexibility when managing nodes in Slurm. The Slurm controller transitions nodes out of down per the ReturnToService configuration value.

The default (and recommended) configuration of the Slurm chart uses ReturnToService=2, which will automatically resume any down node that starts communicating with the Slurm controller. If this behavior is not desired, then this value should be adjusted. Using a non-default value will require the user to take action within Slurm after a Pod is updated in Kubernetes, before the node will be usable in Slurm.

Slurm node deletion

The Syncer updates the state within Slurm following NodeSlice changes. For example, when nodes are removed from NodeSets, these changes are reflected in the underlying NodeSlice(s). After detecting removed NodeSlice entries, the Syncer requests deletion of the corresponding Slurm nodes. Enable this functionality using the Slurm chart option .syncer.config.syncer.slurmNodeCleanUp.

Slurm Node status

The Syncer converts the current running, responding, and drain Slurm states into conditions and labels on the Pod. The conditions SlurmDrain, SlurmRunning and SlurmNotResponding mirror the state within Slurm. The reason for the drain is propagated into the Message for the SlurmDrain condition. The labels sunk.coreweave.com/running, sunk.coreweave.com/drain, and sunk.coreweave.com/not_responding are there to aid in dashboards that use metrics from kube-state-metrics and are not used for any logic within the Operator or Syncer. If a Slurm node is drained from within Slurm, that will propagate up to the Node as well. See NodeController for more information.

NHC Drain and HPC Verification

While this feature was implemented for CoreWeave’s particular environment, it can be used for similar workflows in other environments.

NHC (Node Health Check) may be used within Slurm, which is often useful to run within prolog or epilog scripts to verify node functionality. CoreWeave Kubernetes HPC verification workflows run similar checks which will trigger un-drain of older NHC failures. The Syncer identifies drains that may be undrained through the presence of verify-undrain anywhere in the nodes’s drain reason. If this string is present, then the HPCVerification condition of the pod is checked to see if a newer verification pass has happened since the drain.

Extra field

Nodes in Slurm have an Extra field which may be used to store additional user specified information. SUNK uses this to store several pieces of information, primarily to provide better visibility within Slurm to conditions on the Kubernetes side. The Syncer manages updating the Extra field to reflect the information. The information is stored as JSON in the Extra field to allow for easy parsing and manipulation.

Users may add additional information into the Extra field in Slurm. The Extra field must contain valid JSON, or the contents will be cleared and replaced at the next synchronization. JSON fields set by the Syncer will overwrite those set by the user if there are conflicts.

Hook API

The hook API provided by the Syncer allows events in Slurm to directly trigger various operations within Kubernetes. Some of these hooks are to facilitate blocking synchronization or immediate actions. The Syncer provides several hooks for node objects.

Pre-Hook

The pre-hook endpoint is used to ensure that other jobs running outside Slurm on the Node are removed before the Slurm jobs start. It also begins the state propagation that triggers the Pod Controller and Node Controller to perform additional actions.

Reboot

This endpoint is only available if the Syncer has permissions to perform operations on the Nodes. The Syncer node permissions are set with the Slurm chart option .syncer.nodePermissions.enabled.

The reboot endpoint allows rebooting the Kubernetes Node associated with a Slurm node. By default, this endpoint sets the PhaseState condition on the Node along with the associated reason production-powerreset which then triggers other Node management tooling to reboot the Node. The condition type and associated reason can be modified under .syncer.hooksAPI.nodeRebootCondition and .syncer.hooksAPI.nodeRebootReason.

Metrics

The Syncer provides a scrapeable metrics endpoint, which exposes various metrics for the nodes, jobs, and the overall Slurm cluster. The PodMonitor deployed with the SUNK chart labels all metrics with their associated Slurm cluster using the slurm_cluster label. The Syncer also exports additional metrics for the standard go runtime and controller runtime. Labels added by the code are shown below. Additional labels may be added by the scrape configuration. Label definitions:

account: Slurm account name
id: Slurm job ID
name: Slurm job name
node: Slurm node name
partition: Slurm partition
state: Slurm job current state
user: Slurm user name
message_type: Slurm RPC message type

Metric	Unit	Labels	Description
slurm_controller_rpc_count	count	message_type	RPC count per message type.
slurm_controller_rpc_mean_duration_seconds	seconds	message_type	RPC mean duration per message type.
slurm_job_state	—	partition, account, user, id, name, state	Current state of the job represented in the `state` label.
slurm_job_cpus_allocated	count	partition, account, user, id, node	The number of CPUs allocated to a job by Slurm node. Only present for running jobs.
slurm_job_gpus_allocated	count	partition, account, user, id, node	The number of GPUs allocated to a job by Slurm node. Only present for running jobs.
slurm_job_uptime_seconds	seconds	partition, account, user, id, node	The number of seconds a job has been running. Only present for running jobs.
slurm_jobs_pending	count	partition, account, user	The number of pending jobs in the Slurm cluster.
slurm_jobs_running	count	partition, account, user	The number of running jobs in the Slurm cluster.
slurm_jobs_suspended	count	partition, account, user	The number of suspended jobs in the Slurm cluster.
slurm_node_state	—	node ,partition ,state	Current state of the node represented in the `state` label.
slurm_node_cpu_alloc	count	node	The number of CPUs allocated per node.
slurm_node_cpu_idle	count	node	The number of CPUs idle per node.
slurm_node_cpu_total	count	node	The total number of CPUs per node.
slurm_node_mem_alloc	MB	node	The amount of allocated memory per node.
slurm_node_mem_total	MB	node	The total amount of memory per node.
slurm_node_gpu_alloc	count	node	The number of GPUs allocated per node.
slurm_node_gpu_idle	count	node	The number of GPUs idle per node.
slurm_node_gpu_total	count	node	The total number of GPUs per node.
slurm_nodes_alloc	count	—	The number of nodes with state allocated.
slurm_nodes_comp	count	—	The number of nodes with state completing.
slurm_nodes_down	count	—	The number of nodes with state down.
slurm_nodes_drain	count	—	The number of nodes with state drain.
slurm_nodes_err	count	—	The number of nodes with state error.
slurm_nodes_fail	count	—	The number of nodes with state fail.
slurm_nodes_idle	count	—	The number of nodes with state idle.
slurm_nodes_maint	count	—	The number of nodes with state maintenance.
slurm_nodes_mix	count	—	The number of nodes with state mix.
slurm_nodes_resv	count	—	The number of nodes with state reserved.
slurm_nodes_total	count	—	The total number of nodes.
slurm_nodes_not_responding	count	—	The number of nodes with state not_responding.
slurm_partition_cpu_alloc	count	partition	The number of CPUs allocated in a partition.
slurm_partition_cpu_idle	count	partition	The number of CPUs idle in a partition.
slurm_partition_cpu_total	count	partition	The total number of CPUs in a partition.
slurm_partition_mem_alloc	MB	partition	The amount of allocated memory in a partition.
slurm_partition_mem_total	MB	partition	The total memory in a partition.
slurm_partition_gpu_alloc	count	partition	The number of GPUs allocated in a partition.
slurm_partition_gpu_idle	count	partition	The number of GPUs idle in a partition.
slurm_partition_gpu_total	count	partition	The total number of GPUs in a partition.
slurm_queue_canceled	count	—	The number of canceled jobs in the Slurm cluster (only those still tracked by slurmctld).
slurm_queue_completed	count	—	The number of completed jobs in the Slurm cluster (only those still tracked by slurmctld).
slurm_queue_completing	count	—	The number of completing jobs in the Slurm cluster.
slurm_queue_configuring	count	—	The number of configuring jobs in the Slurm cluster.
slurm_queue_failed	count	—	The number of failed jobs in the Slurm cluster (only those still tracked by slurmctld).
slurm_queue_node_fail	count	—	The number of job stopped due to node failure in the cluster (only those still tracked by slurmctld).
slurm_queue_pending	count	—	The number of pending jobs in the Slurm scheduler queue.
slurm_queue_pending_dependency	count	—	The number of pending jobs in the Slurm scheduler queue with unsatisfied dependencies.
slurm_queue_preempted	count	—	The number of preempted jobs in the Slurm cluster (only those still tracked by slurmctld).
slurm_queue_running	count	—	The number of running jobs in the Slurm cluster.
slurm_queue_suspended	count	—	The number of suspended jobs in the Slurm cluster.
slurm_queue_timeout	count	—	The number of timed out jobs in the Slurm cluster (only those still tracked by slurmctld).
slurm_scheduler_backfill_cycle_last_seconds	seconds	—	The duration of the last scheduler backfill cycle.
slurm_scheduler_backfill_cycle_mean_seconds	seconds	—	The mean duration of the scheduler backfill cycles.
slurm_scheduler_backfill_depth_mean	count	—	The mean depth of the scheduler backfill.
slurm_scheduler_backfilled_jobs_total	count	—	The number of jobs started due to backfilling since last Slurm start.
slurm_scheduler_backfilled_jobs_cycle_total	count	—	The number of jobs started due to backfilling since last time stats where reset.
slurm_scheduler_backfilled_jobs_heterogeneous_total	count	—	The number of heterogeneous jobs started due to backfilling since last Slurm start.
slurm_scheduler_cycle_last_seconds	seconds	—	The duration of the last scheduler cycle.
slurm_scheduler_cycle_mean_seconds	seconds	—	The mean duration of the scheduler cycles.
slurm_scheduler_cycles_per_minute	opm	—	The number of scheduler cycles per minute.
slurm_scheduler_dbd_queue	count	—	The number of items in the scheduler dbd agent queue.
slurm_scheduler_jobs_submitted	count	—	The number of submitted jobs reported by the scheduler.
slurm_scheduler_jobs_started	count	—	The number of jobs started by the scheduler.
slurm_scheduler_jobs_completed	count	—	The number of jobs completed by the scheduler.
slurm_scheduler_jobs_failed	count	—	The number of jobs failed by the scheduler.
slurm_scheduler_jobs_cancelled	count	—	The number of jobs canceled by the scheduler.
slurm_scheduler_jobs_pending	count	—	The number of jobs pending in the scheduler queue.
slurm_scheduler_jobs_running	count	—	The number of jobs currently running in the scheduler.
slurm_scheduler_queue	count	—	The number of items in the scheduler queue.
slurm_scheduler_threads	count	—	The number of scheduler threads.
slurm_scheduler_cycle_mean_depth	count	—	The mean depth of the scheduler cycles.

SUNK

Documentation Index

​Information flow and reconcile operations

​Slurm drains from Kubernetes

​Slurm downed nodes

​Slurm node deletion

​Slurm Node status

​NHC Drain and HPC Verification

​Extra field

​Hook API

​Pre-Hook

​Reboot

​Metrics

Information flow and reconcile operations

Slurm drains from Kubernetes

Slurm downed nodes

Slurm node deletion

Slurm Node status

NHC Drain and HPC Verification

Extra field

Hook API

Pre-Hook

Reboot

Metrics