> ## Documentation Index
> Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Run custom scripts with s6

> Run custom longrun and oneshot scripts on SUNK compute and login nodes using the s6 service manager

SUNK can run custom scripts on Compute and Login nodes with [s6](https://skarnet.org/software/s6/) using [s6-rc](https://skarnet.org/software/s6-rc/), a service manager for s6-based systems. This guide explains how to set up and run [two types of scripts](#determine-the-appropriate-script-type): `longrun` for continuous processes and `oneshot` for tasks that execute and terminate.

This guide is for cluster administrators who need to automate custom scripts on Compute and Login nodes within a SUNK cluster. Whether you install packages or keep services running, this method provides a straightforward way to manage processes.

## Define scripts in `values.yaml`

Define scripts in the appropriate sections of the Slurm chart's `values.yaml` file. Define Compute node scripts in the [`compute.s6`](/products/sunk/reference/slurm-parameters) section, and Login node scripts in the [`login.s6`](/products/sunk/reference/slurm-parameters) section. Each script needs a name, type, and the script itself. The following example shows a script definition within a Compute node:

```yaml theme={"system"}
compute:
  s6:
    packages:
      type: oneshot
      script: |
        #!/usr/bin/env bash
        apt -y update
        apt -y install nginx
    nginx:
      type: longrun
      timeoutUp: 30000 # 30 seconds
      timeoutDown: 0
      script: |
        #!/usr/bin/env bash
        nginx -g "daemon off;"
```

The preceding example includes two scripts:

* `packages`: This `oneshot` script installs `nginx` using the package manager.
* `nginx`: A `longrun` script that starts the `nginx` process and keeps it running.

The `nginx` script is assigned a [`timeoutUp` of 30000 milliseconds](#set-timeouts-for-scripts), which means it has up to 30 seconds to start successfully.

## Define and schedule different script types

The following sections explain how to choose the right script type for your task and how to avoid scheduling conflicts that can occur when scripts extend node startup time.

### Determine the appropriate script type

Decide whether the script is a `longrun` or a `oneshot` based on its purpose:

* Use `longrun` for scripts that should run continuously, like a web server.
* Use `oneshot` for scripts that run once to perform a setup task, like installing software.

### Avoid scheduling conflicts

If your `oneshot` job installs many packages or performs tasks that otherwise extend startup time, you must account for this by modifying the value of the `orphanedPodDelay` parameter in the `syncer` configuration section of the [Slurm `values.yaml` chart](/products/sunk/reference/slurm-parameters).

The full path for this parameter is `syncer.config.syncer.orphanedPodDelay`. By default, the value of `orphanedPodDelay` is `120s`, or 120 seconds.

If the time required for a `oneshot` job to run exceeds the value set in `orphanedPodDelay`, increase the value to avoid scheduling conflicts.

## Set timeouts for scripts

Timeouts prevent scripts from hanging indefinitely and help keep nodes healthy. For finer control, you can set timeouts for your scripts. The `timeoutUp` parameter sets the time allowed for the script to start, and `timeoutDown` sets the time allowed for it to stop. These parameters are optional and set in milliseconds. By default, they're set to `0`, which means the script doesn't time out.

<Warning>
  Without a configured timeout, the s6 script can become unresponsive indefinitely and cause the Slurm compute Pod to stay in a `Not Ready` state.
</Warning>

* For `oneshot` scripts, only `timeoutUp` is relevant as it's the maximum completion time for the script.
* For `longrun` scripts, both `timeoutUp` and `timeoutDown` control how long the process has to start and stop.

## Define behavior for failed scripts

When a user-defined script fails, the container can continue running silently, provide an error message, or stop. In SUNK, containers stop on script failure by default.

You can control this behavior with the `S6_BEHAVIOUR_IF_STAGE2_FAILS` parameter in the appropriate `env` section of the `values.yaml` file.

* For **Login nodes**, change the value of the parameter in the `login.env` section of the `values.yaml` file.
* For **Compute nodes**, change the value of the parameter in the `compute.nodes.[TARGET-NODE].env` sections of the `values.yaml` file.

The `S6_BEHAVIOUR_IF_STAGE2_FAILS` parameter can contain the following values:

| Value | Behavior                                                                                                    |
| ----- | ----------------------------------------------------------------------------------------------------------- |
| `0`   | If a script fails, the container continues to run silently, without providing an error message.             |
| `1`   | If a script fails, the container continues to run, but provides an error message warning about the failure. |
| `2`   | If a script fails, the container stops running. This is the default setting on SUNK.                        |

For more information about the values for `S6_BEHAVIOUR_IF_STAGE2_FAILS`, see the [s6-overlay](https://github.com/just-containers/s6-overlay?tab=readme-ov-file#customizing-s6-overlay-behaviour) customization options.
