In Slurm,Documentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
COMPLETING is an interim state that occurs while the job epilog script and cleanup operations are running. If these cleanup operations fail, Slurm jobs may remain in the COMPLETING state indefinitely.
In SUNK v6.8.0 and above, a built-in automation handles cleanup of jobs and nodes stuck in the COMPLETING state. For SUNK versions earlier than v6.8.0, follow the manual remediation steps in this guide to address a Slurm node stuck in COMPLETING.
Troubleshooting stuck nodes in SUNK v6.8.0 or below
List the stuck nodes
Identify the job stuck in For more information about this and other Slurm commands, see CoreWeave’s documentation about monitoring Slurm node states and Slurm job states.
COMPLETING and fetch its associated node list, using the squeue command:Set nodes to the DOWN state
For each node, use This step is necessary because Slurm’s default
scontrol to set the node state to DOWN with the reason "cleanup completing", as follows:cleanup_completing logic only addresses nodes in the DOWN state.For nodes in INVALID_REG
If the nodes are also in the If
INVALID_REG state, set them to DOWN as shown in the previous step.Delete the node from Slurm for a clean re-registration:scontrol delete returns Requested nodes are busy, the node is part of an active reservation. Remove the node from the reservation before deleting it, then add it back after the node re-registers:-
Identify the reservation the node belongs to:
-
Remove the node from the reservation:
-
Delete the node:
-
After the node re-registers, add it back to the reservation:
DOWN state, the Slurm job transitions from the COMPLETING state to the COMPLETED state. After a short time, the nodes will transition back to IDLE.
If the node is DOWN and the Slurm job remains in the COMPLETING state, first try to cancel the job. If the Slurm job is still stuck in the COMPLETING state, reconfigure.