Skip to main content
In Slurm, COMPLETING is an interim state that occurs while the job epilog script and cleanup operations are running. If these cleanup operations fail, Slurm jobs may remain in the COMPLETING state indefinitely. In SUNK v6.8.0 and later, a built-in automation handles cleanup of jobs and nodes stuck in the COMPLETING state. For SUNK versions earlier than v6.8.0, follow the manual remediation steps in this guide to address a Slurm node stuck in COMPLETING.
Do not perform a sync with Argo during this process. Syncing with Argo overwrites the in-cluster changes prematurely.
1

List the stuck nodes

Use the squeue command to identify the job stuck in COMPLETING and fetch its associated node list:
squeue -t CG -o "%.18i %.9P %.8j %.8u %.2t %.10M %.10L %R"
For more information about this and other Slurm commands, see the documentation for monitoring Slurm node states and Slurm job states.
2

Set nodes to the DOWN state

For each node, use scontrol to set the node state to DOWN with the reason "cleanup completing". Replace [NODENAME] with the name of the stuck node:
scontrol update node=[NODENAME] state=DOWN reason="cleanup completing"
This step is necessary because Slurm’s default cleanup_completing logic only addresses nodes in the DOWN state.
3

Optional: Re-register nodes in INVALID_REG

Complete this step only if the nodes are also in the INVALID_REG state. If the nodes are in INVALID_REG, set them to DOWN as shown in the previous step.Then delete the node from Slurm for a clean re-registration:
scontrol delete node=[NODENAME]
If scontrol delete returns Requested nodes are busy, the node is part of an active reservation. Remove the node from the reservation before deleting it, then add it back after the node re-registers. Replace [RESERVATION-NAME] with the name of the reservation:
  1. Identify the reservation the node belongs to:
    scontrol show node [NODENAME] | grep -i reserv
    
  2. Remove the node from the reservation:
    scontrol update ReservationName=[RESERVATION-NAME] Nodes-=[NODENAME]
    
  3. Delete the node:
    scontrol delete node=[NODENAME]
    
  4. After the node re-registers, add it back to the reservation:
    scontrol update ReservationName=[RESERVATION-NAME] Nodes=+[NODENAME]
    
After completing these steps, the impacted nodes should be in the DOWN state. The Slurm job then transitions from the COMPLETING state to the COMPLETED state, and after a short time, the nodes transition back to IDLE. If the node is DOWN and the Slurm job remains in the COMPLETING state, cancel the job. If the Slurm job is still stuck in the COMPLETING state, reconfigure.
Last modified on May 27, 2026