2. Submit a simple job to the cluster

Prepare the job script

The most common approach to running jobs on Slurm is to submit job scripts to Slurm. Typically, scripts consist of three sections:

Directives: Slurm directives begin with #SBATCH. These directives tell Slurm which resources and settings are needed to run the script.
Environment variables: Environment variables are those variables that are required for your application to run. In this section, it is common practice to load modules and set up environment variables such as the $PATH and $LD_LIBRARY_PATH.
Code: The actual code that runs your application.

Below is an example of a basic Slurm job script, which sets Slurm directives using sbatch, then calls a Python program:

#!/bin/bash

# SBATCH directives
#SBATCH --job-name=my_job # The name of your job
#SBATCH --output=my_job.out # The output file location of your job
#SBATCH --error=my_job.err # The error file location of your job
#SBATCH --time=01:00:00 # The maximum time your job can run - in this case, 1 hour
#SBATCH --nodes=1 # The number of nodes to use
#SBATCH --ntasks-per-node=1 # The number of tasks to run per node
#SBATCH --gres=gpu:1 # The number of GPUs to use

# Export environment variables
export VARIABLE1="variable_1"

echo "Starting job"
date
hostname

echo "Running calculation..."

# In this section, your application is called.
# For example, to run a Python script called `myprog.py`:
python3 /home/username/myprog.py

echo "Calculation complete"

echo "Ending job"
date

Sbatch directives

The #SBATCH directives are used to configure the job submission. Each directive specifies a different aspect of the job, such as its name, output files, time limits, and resource requirements. Below is a table summarizing some common directives:

Directive	Description
`#SBATCH --job-name=`	Sets the name of the job; useful for locating it in status reports later
`#SBATCH --output=`	Specifies the output file for standard output; you can use a full path, or one relative to the submission directory
`#SBATCH --error=`	Specifies the output file for standard error
`#SBATCH --time=`	Sets the time limit for the job; in the example above, 1 hour. If the job is still running after the specified duration, the job is stopped
`#SBATCH --nodes=`	How many nodes the job should request; in the example above, 1 node
`#SBATCH --ntasks-per-node=`	How many tasks per node this job should request; in the example above, 1 task per node
`#SBATCH --gres=gpu:`	How many GPUs this task should request; in the example above, 1 GPU

Submit the job

Submit your job to the Slurm queue by running the script using sbatch. This command submits the my_job_script.sh script above to the Slurm scheduler. Then, Slurm allocates resources to run your job based on the sbatch directives provided in the script. If there are available resources matching these directives, the job may start immediately; otherwise, it may have to wait in a queue until other jobs are finished.

Example command

sbatch my_job_script.sh

After job submission, a job ID is returned. In this example, the job ID is 371.

Example output

Submitted batch job 371

Check the job status

To check the status of your job, use the squeue command. The squeue command shows the status of all jobs in the cluster. To see a specific job, target the job ID from the submission command. In this example, where the job ID is 371, the job is located using squeue 371. Jobs can also be targeted by the person who submitted them, identified by username, by running squeue -u myname. Running squeue without any arguments will show all jobs in the queue, including their status, time running, and the nodes they are running on.

squeue

You should see output similar to the following:

JOBID PARTITION     NAME     USER   ST      TIME   NODES NODELIST(REASON)
      h100     my_job   user1  R       2:01      1   h100-208-101
      h100     calc     user2  R       1:38      1   h100-208-153
      h100     calc2    user3  R       1:25      1   h100-208-153

Job state	Explanation
`PD`	Job is pending. The Nodelist(reason) column will detail why.
`R`	Job is running. The Nodelist(reason) column will show which nodes are in use by the job.
`CG/CD`	Job has completed or is completing.
`F`	The job has failed.
`TO`	Because of a time out, the job was terminated, the time limit is usually specified in the job script, or partition configuration.

For more detailed metrics on your job, view your job in the Slurm / Job Metrics Grafana Dashboard.

Cancel your job

To cancel your job, use the scancel command.

scancel 371

If your job is pending, it will be removed from the queue. If your job is running, it will be stopped.

View job output

Any standard output and error output are collected in the files specified in the sbatch directives in the job script. After your job has completed, you can view the output in these files.

It is recommended to use a text editor to view these files, as their contents may be lengthy.

View cluster information

To view more details about your cluster, use the sinfo command.

Example command

sinfo

This example output displays information on a small cluster comprised of 3 nodes. The displayed information includes the partition name, availability, time limit, number of nodes, state, and node list.

Example output

PARTITION      AVAIL  TIMELIMIT  NODES  STATE NODELIST
h100              up   infinite      1    mix h100-208-101
h100              up   infinite      1   idle h100-208-[101,103,125]
h100              up   infinite      1  alloc h100-208-153

Node states

Status	Description
`idle`	Nodes are currently idle
`alloc`	Nodes are fully allocated to jobs
`mix`	Nodes are running jobs, but still have some available space

Additional statuses

The job status is only visible via squeue while jobs are in a pending or running state, and for a few hours after completion. To see the status of a completed or failed job, use the sacct command instead.

Example command

sacct

In this example output, we can see that the selected job completed successfully. The states of this simple job are shown in three parts:

Example output

JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
371              my_job       h100       root          1  COMPLETED      0:0
371.batch         batch                  root          1  COMPLETED      0:0
371.0        all_reduc+                  root          1  COMPLETED      0:0

The line with a JobID of 371 and a JobName of my_job displays the state of the entire job.
The line prefaced with a JobID of 371.batch and a JobName of batch displays the state of the submitted script.
The final line, with a JobID of 371.0 and a JobName of all_reduc+, displays the status of one of the tests that CoreWeave runs after each job to ensure that the cluster is performing correctly. It typically takes a few seconds to run.

Busy clusters may contain thousands of jobs. It can be helpful to restrict the output to a specific time range and/or user, for example:To see jobs only from a specific user:

Example command

sacct --user=user1

To see jobs from a specific time range:

Example command

sacct --starttime=2025-03-24T06:10 --endtime=2025-03-24T07:20

The --starttime option provides a parameter to display only jobs that have begun after the given start time, and the --endtime option provides a parameter to display jobs that have ended before the given end time. All times are specified in ISO format.

SUNK

Documentation Index

​Prepare the job script

​Sbatch directives

​Submit the job

​Check the job status

​Cancel your job

​View job output

​View cluster information

​Node states

​Additional statuses

Prepare the job script

Sbatch directives

Submit the job

Check the job status

Cancel your job

View job output

View cluster information

Node states

Additional statuses