Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt

Use this file to discover all available pages before exploring further.

In Slurm, COMPLETING is an interim state that occurs while the job epilog script and cleanup operations are running. If these cleanup operations fail, Slurm jobs may remain in the COMPLETING state indefinitely. In SUNK v6.8.0 and above, a built-in automation handles cleanup of jobs and nodes stuck in the COMPLETING state. For SUNK versions earlier than v6.8.0, follow the manual remediation steps in this guide to address a Slurm node stuck in COMPLETING.

Troubleshooting stuck nodes in SUNK v6.8.0 or below

Do not perform a sync with Argo during this process. Syncing with Argo will overwrite the in-cluster changes prematurely.
1

List the stuck nodes

Identify the job stuck in COMPLETING and fetch its associated node list, using the squeue command:
squeue -t CG -o "%.18i %.9P %.8j %.8u %.2t %.10M %.10L %R"
For more information about this and other Slurm commands, see CoreWeave’s documentation about monitoring Slurm node states and Slurm job states.
2

Set nodes to the DOWN state

For each node, use scontrol to set the node state to DOWN with the reason "cleanup completing", as follows:
scontrol update node=[NODENAME] state=DOWN reason="cleanup completing"
This step is necessary because Slurm’s default cleanup_completing logic only addresses nodes in the DOWN state.
3

For nodes in INVALID_REG

If the nodes are also in the INVALID_REG state, set them to DOWN as shown in the previous step.Delete the node from Slurm for a clean re-registration:
scontrol delete node=[NODENAME]
If scontrol delete returns Requested nodes are busy, the node is part of an active reservation. Remove the node from the reservation before deleting it, then add it back after the node re-registers:
  1. Identify the reservation the node belongs to:
    scontrol show node [NODENAME] | grep -i reserv
    
  2. Remove the node from the reservation:
    scontrol update ReservationName=[RESERVATION-NAME] Nodes-=[NODENAME]
    
  3. Delete the node:
    scontrol delete node=[NODENAME]
    
  4. After the node re-registers, add it back to the reservation:
    scontrol update ReservationName=[RESERVATION-NAME] Nodes=+[NODENAME]
    
After the impacted nodes are in the DOWN state, the Slurm job transitions from the COMPLETING state to the COMPLETED state. After a short time, the nodes will transition back to IDLE. If the node is DOWN and the Slurm job remains in the COMPLETING state, first try to cancel the job. If the Slurm job is still stuck in the COMPLETING state, reconfigure.
Last modified on April 27, 2026