COMPLETING is an interim state that occurs while the job epilog script and cleanup operations are running. If these cleanup operations fail, Slurm jobs may remain in the COMPLETING state indefinitely.
In SUNK v6.8.0 and later, a built-in automation handles cleanup of jobs and nodes stuck in the COMPLETING state. For SUNK versions earlier than v6.8.0, follow the manual remediation steps in this guide to address a Slurm node stuck in COMPLETING.
List the stuck nodes
Use the For more information about this and other Slurm commands, see the documentation for monitoring Slurm node states and Slurm job states.
squeue command to identify the job stuck in COMPLETING and fetch its associated node list:Set nodes to the DOWN state
For each node, use This step is necessary because Slurm’s default
scontrol to set the node state to DOWN with the reason "cleanup completing". Replace [NODENAME] with the name of the stuck node:cleanup_completing logic only addresses nodes in the DOWN state.Optional: Re-register nodes in INVALID_REG
Complete this step only if the nodes are also in the If
INVALID_REG state. If the nodes are in INVALID_REG, set them to DOWN as shown in the previous step.Then delete the node from Slurm for a clean re-registration:scontrol delete returns Requested nodes are busy, the node is part of an active reservation. Remove the node from the reservation before deleting it, then add it back after the node re-registers. Replace [RESERVATION-NAME] with the name of the reservation:-
Identify the reservation the node belongs to:
-
Remove the node from the reservation:
-
Delete the node:
-
After the node re-registers, add it back to the reservation:
DOWN state. The Slurm job then transitions from the COMPLETING state to the COMPLETED state, and after a short time, the nodes transition back to IDLE.
If the node is DOWN and the Slurm job remains in the COMPLETING state, cancel the job. If the Slurm job is still stuck in the COMPLETING state, reconfigure.