Run custom scripts with s6
Run user-defined scripts on SUNK
SUNK can run custom scripts on Compute and Login nodes with s6 using s6-rc, a service manager for s6-based systems. This guide explains how to set up and run two types of scripts: longrun
for continuous processes and oneshot
for tasks that execute and terminate.
With this setup, you can automate the execution of custom scripts on Compute nodes within the SUNK cluster. Whether you're installing necessary packages or keeping essential services running, this method provides a straightforward solution to manage processes efficiently.
Define scripts in values.yaml
Define scripts in the appropriate sections of the Slurm chart's values.yaml
file. Define Compute node scripts in the compute.s6
section, and Login node scripts in the login.s6
section. Each script needs a name, type, and the script itself. The following example shows a script definition within a Compute node:
compute:s6:packages:type: oneshotscript: |#!/usr/bin/env bashapt -y updateapt -y install nginxnginx:type: longruntimeoutUp: 30000 # 30 secondstimeoutDown: 0script: |#!/usr/bin/env bashnginx -g "daemon off;"
The above example includes two scripts:
packages
: Thisoneshot
script installsnginx
using the package manager.nginx
: Alongrun
script that starts thenginx
process and keeps it running.
The nginx
script is assigned a timeoutUp
of 30000 milliseconds, which means it has up to 30 seconds to start successfully.
Define and schedule different script types
Determine the appropriate script type
Decide whether the script is a longrun
or a oneshot
based on its purpose:
- Use
longrun
for scripts that should run continuously, like a web server. - Use
oneshot
for scripts that run once to perform a setup task, like installing software.
Avoid scheduling conflicts
If your oneshot
job installs numerous packages or performs tasks that otherwise extend startup time, you must account for this by modifying the value of the orphanedPodDelay
parameter in the syncer
configuration section of the Slurm values.yaml
chart.
The full path for this parameter is syncer.config.syncer.orphanedPodDelay
. By default, the value of orphanedPodDelay
is 120s
, or 120 seconds.
If the time required for a oneshot
job to run exceeds the value set in orphanedPodDelay
, increase the value to avoid scheduling conflicts.
Set timeouts for scripts
For finer control, you can set timeouts for your scripts. The timeoutUp
parameter sets the time allowed for the script to start, and the timeoutDown
parameter is for it to stop. These parameters are optional and set in milliseconds. By default, they are set to 0
, which indicates that the script will not time out.
- For
oneshot
scripts, onlytimeoutUp
is relevant as it's the maximum completion time for the script. - For
longrun
scripts, bothtimeoutUp
andtimeoutDown
control how long the process has to start and stop.
Define behavior for failed scripts
When a user-defined script fails, the container can continue running silently, provide an error message, or stop completely. In SUNK, the default setting has containers stop upon script failure.
You can control this behavior with the S6_BEHAVIOUR_IF_STAGE2_FAILS
parameter in the appropriate env
section of the values.yaml
file.
- For Login nodes, change the value of the parameter in the
login.env
section of thevalues.yaml
file. - For Compute nodes, change the value of the parameter in the
compute.nodes.<target node>.env
sections of thevalues.yaml
file.
The S6_BEHAVIOUR_IF_STAGE2_FAILS
parameter can contain the following values:
Value | Behavior |
---|---|
0 | If a script fails, the container will continue to run silently, without providing an error message. |
1 | If a script fails, the container will continue to run, but will provide an error message warning about the failure. |
2 | If a script fails, the container will stop running completely. This is the default setting on SUNK. |
For more information on the values for S6_BEHAVIOUR_IF_STAGE2_FAILS
, see the s6-overlay customization options.