Skip to main content

Prolog and Epilog Scripts

In Slurm, prolog and epilog scripts are executed before and after each job runs, respectively. They're commonly used in HPC environments to perform setup and teardown tasks. These scripts are essential for customizing job behavior and maintaining consistency across workloads.

In SUNK, these scripts are deployed and managed as Kubernetes ConfigMaps, providing a scalable and centralized way to manage them. This approach ensures that script changes are consistently propagated across all Slurm nodes in the cluster.

This guide explains the best practices for using prolog and epilog scripts in SUNK and how failures are handled.

Overview

  • Prolog scripts run on the primary compute node before the job starts. They ensure pre-job tasks are done, like setting environment variables, loading software modules, or mounting file systems. They can also collect job metadata for debugging or optimization.

  • Epilog scripts run after the job finishes, regardless of success or failure. They handle post-job tasks, such as logging metrics, notifying users, archiving outputs, transferring data to external systems, or cleaning up files.

Slurm also supports PrologSlurmctld and EpilogSlurmctld scripts that run on the control node for system-wide policy enforcement.

For more details, see the Slurm Prolog and Epilog Guide.

How SUNK handles Prolog and Epilog

While Slurm typically uses a single script for Prolog and Epilog, SUNK extends this with entrypoint scripts—prolog.sh and epilog.sh—that set up the environment and call run-parts.sh to execute all scripts in the /etc/slurm/prolog.d/ and /etc/slurm/epilog.d/ directories. This modular approach makes it easy to manage complex workflows by breaking them into smaller, reusable scripts.

Best practices

When using prolog and epilog scripts in SUNK, follow these best practices for efficiency, reliability, and security:

Minimize Resource Usage: Ensure scripts are lightweight to avoid exceeding Kubernetes pod limits. They should use minimal CPU and memory to prevent interference with user jobs. Keep scripts as short and fast as possible to avoid delays in job start or cleanup. Avoid using Slurm commands like squeue, scontrol, or sacctmgr in the script, as they can cause performance issues.

Test Thoroughly: Validate scripts in a test environment before deployment to prevent disruptions in production.

Ensure Idempotency: Scripts should be able to run multiple times without causing unintended effects, especially during retries and restarts. Implement error handling and log errors for debugging.

Handle Environment Dependencies: Be aware of environment dependencies and ensure they are met. Since scripts don't have a search path set, use fully qualified paths or set a PATH environment variable for executing programs.

Logging: Include logging to track script execution, identify issues, and capture relevant job details like job ID and node assignments. Avoid hardcoding credentials in logs—use secure methods for secrets.

Regular Reviews and Updates: Regularly update scripts to reflect changes in the environment, job requirements, or security policies. Keep scripts version-controlled and document changes to ensure team awareness.

Failure handling

Failure handling in prolog and epilog scripts depends on the type of script and the context in which it is executed. The following rules apply to prolog and epilog scripts in Slurm when the job fails with a non-zero exit code:

  • If Prolog fails, the node is set to a DRAIN state, and the job is requeued. The job is placed in a held state unless nohold_on_prolog_fail is configured in SchedulerParameters.
  • If a PrologSlurmctld batch job fails, the job is requeued.
  • If a PrologSlurmctld interactive job fails, such as salloc or srun, the job is canceled.
  • If Epilog fails, the node is set to a DRAIN state.
  • If EpilogSlurmctld fails, the failure is only logged, and the job is not requeued.