Why is my node drained or cordoned?

A Slurm node enters drained state when it stops accepting new jobs. The drain reason (visible in scontrol show node or through the sn, dl, and dld aliases on the login pod) identifies the cause. For the full drain workflow, the list of common drain reasons (NHC: [REASON], prolog pre-hook failed, scheduler: k8s pod deletion timeout, batch job complete failure, and the k8s:* family), the manual undrain procedure, and the sunk:verify-undrain automation marker, see Drain and undrain Slurm nodes. Contact CoreWeave support if a node remains drained for an extended period or if the drain reason is not listed in the canonical doc.

Nodes Server Errors