Run Prolog and Epilog scripts on SUNK

This page explains how SUNK runs Slurm Prolog and Epilog scripts, the best practices to follow when authoring them, and how SUNK behaves when these scripts fail. Use this information to customize job setup and teardown on SUNK clusters while keeping nodes healthy and workloads consistent. In Slurm, Prolog and Epilog scripts run before and after each job, respectively. They’re commonly used in HPC environments to perform setup and teardown tasks. These scripts are essential for customizing job behavior and maintaining consistency across workloads.

Prolog scripts run on the primary compute node before the job starts. They ensure pre-job tasks are done, such as setting environment variables, loading software modules, or mounting file systems. They can also collect job metadata for debugging or optimization.
Epilog scripts run after the job finishes, regardless of success or failure. They handle post-job tasks, such as logging metrics, notifying users, archiving outputs, transferring data to external systems, or cleaning up files.

In SUNK, these scripts are deployed and managed as Kubernetes ConfigMaps, providing a centralized way to manage them. This approach ensures that script changes propagate consistently across all Slurm nodes in the cluster. Slurm also supports PrologSlurmctld and EpilogSlurmctld scripts that run on the control node for system-wide policy enforcement. For more information, see the Slurm Prolog and Epilog Guide.

How SUNK handles Prolog and Epilog

While Slurm typically uses a single script for Prolog and Epilog, SUNK extends this with entrypoint scripts, prolog.sh and epilog.sh, that set up the environment and call run-parts.sh to run all scripts in the /etc/slurm/prolog.d/ and /etc/slurm/epilog.d/ directories. This modular approach helps you manage workflows by breaking them into smaller, reusable scripts.

Best practices

When using Prolog and Epilog scripts in SUNK, follow these best practices for efficiency, reliability, and security: Minimize resource usage: Keep scripts lightweight to avoid exceeding Kubernetes pod limits. Scripts should use minimal CPU and memory to prevent interference with user jobs. Keep scripts as short and fast as possible to avoid delays in job start or cleanup. Avoid Slurm commands such as squeue, scontrol, or sacctmgr in the script, because they can cause performance issues. Test thoroughly: Validate scripts in a test environment before deployment to prevent disruptions in production. Ensure idempotency: Scripts must run multiple times without causing unintended effects, especially during retries and restarts. Implement error handling and log errors for debugging. Handle environment dependencies: Be aware of environment dependencies and ensure they’re met. Because scripts don’t have a search path set, use fully qualified paths or set a PATH environment variable to run programs. Log script execution: Include logging to track script execution, identify issues, and capture relevant job details such as job ID and node assignments. Don’t hardcode credentials in logs. Use secure methods for secrets. Review and update regularly: Update scripts regularly to reflect changes in the environment, job requirements, or security policies. Keep scripts version-controlled and document changes to ensure team awareness.

Failure handling

Understanding how SUNK reacts when a Prolog or Epilog script fails helps you predict job and node behavior and design scripts that recover safely. Failure handling in Prolog and Epilog scripts depends on the type of script and the context in which it runs. The following rules apply to Prolog and Epilog scripts in Slurm when the job fails with a non-zero exit code:

If Prolog fails, the node is set to a DRAIN state, and the job is requeued. The job is placed in a held state unless nohold_on_prolog_fail is configured in SchedulerParameters.
If a PrologSlurmctld batch job fails, the job is requeued.
If a PrologSlurmctld interactive job fails, such as salloc or srun, the job is canceled.
If Epilog fails, the node is set to a DRAIN state.
If EpilogSlurmctld fails, the failure is only logged, and the job isn’t requeued.

​How SUNK handles Prolog and Epilog

​Best practices

​Failure handling

How SUNK handles Prolog and Epilog

Best practices

Failure handling