> ## Documentation Index
> Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Address nodes stuck during cleanup operations

> Troubleshooting steps for Slurm nodes stuck in the Completing state

In Slurm, `COMPLETING` is an interim state that occurs while the job [epilog script](/products/sunk/run_workloads/prolog-epilog) and cleanup operations are running. If these cleanup operations fail, Slurm jobs may remain in the `COMPLETING` state indefinitely.

In SUNK v6.8.0 and later, a built-in automation handles cleanup of jobs and nodes stuck in the `COMPLETING` state. For SUNK versions earlier than v6.8.0, follow the manual remediation steps in this guide to address a Slurm node stuck in `COMPLETING`.

<Warning>
  Do not perform a sync with Argo during this process. Syncing with Argo overwrites the in-cluster changes prematurely.
</Warning>

<Steps>
  <Step title="List the stuck nodes">
    Use the `squeue` command to identify the job stuck in `COMPLETING` and fetch its associated node list:

    ```bash theme={"system"}
    squeue -t CG -o "%.18i %.9P %.8j %.8u %.2t %.10M %.10L %R"
    ```

    For more information about this and other Slurm commands, see the documentation for monitoring [Slurm node states](/products/sunk/manage_sunk/slurm-node-states) and [Slurm job states](/products/sunk/manage_sunk/slurm-job-states).
  </Step>

  <Step title="Set nodes to the DOWN state">
    For each node, use `scontrol` to set the node state to `DOWN` with the reason `"cleanup completing"`. Replace `[NODENAME]` with the name of the stuck node:

    ```bash theme={"system"}
    scontrol update node=[NODENAME] state=DOWN reason="cleanup completing"
    ```

    This step is necessary because Slurm's default `cleanup_completing` logic only addresses nodes in the `DOWN` state.
  </Step>

  <Step title="Optional: Re-register nodes in INVALID_REG">
    Complete this step only if the nodes are also in the `INVALID_REG` state. If the nodes are in `INVALID_REG`, set them to `DOWN` as shown in the previous step.

    Then delete the node from Slurm for a clean re-registration:

    ```bash theme={"system"}
    scontrol delete node=[NODENAME]
    ```

    If `scontrol delete` returns `Requested nodes are busy`, the node is part of an active reservation. Remove the node from the reservation before deleting it, then add it back after the node re-registers. Replace `[RESERVATION-NAME]` with the name of the reservation:

    1. Identify the reservation the node belongs to:

       ```bash theme={"system"}
       scontrol show node [NODENAME] | grep -i reserv
       ```

    2. Remove the node from the reservation:

       ```bash theme={"system"}
       scontrol update ReservationName=[RESERVATION-NAME] Nodes-=[NODENAME]
       ```

    3. Delete the node:

       ```bash theme={"system"}
       scontrol delete node=[NODENAME]
       ```

    4. After the node re-registers, add it back to the reservation:

       ```bash theme={"system"}
       scontrol update ReservationName=[RESERVATION-NAME] Nodes=+[NODENAME]
       ```
  </Step>
</Steps>

After completing these steps, the impacted nodes should be in the `DOWN` state. The Slurm job then transitions from the `COMPLETING` state to the `COMPLETED` state, and after a short time, the nodes transition back to `IDLE`.

If the node is `DOWN` and the Slurm job remains in the `COMPLETING` state, cancel the job. If the Slurm job is still stuck in the `COMPLETING` state, reconfigure.
