Documentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
Prepare the job script
The most common approach to running jobs on Slurm is to submit job scripts to Slurm. Typically, scripts consist of three sections:- Directives: Slurm directives begin with
#SBATCH. These directives tell Slurm which resources and settings are needed to run the script. - Environment variables: Environment variables are those variables that are required for your application to run. In this section, it is common practice to load modules and set up environment variables such as the
$PATHand$LD_LIBRARY_PATH. - Code: The actual code that runs your application.
Sbatch directives
The#SBATCH directives are used to configure the job submission. Each directive specifies a different aspect of the job, such as its name, output files, time limits, and resource requirements. Below is a table summarizing some common directives:
| Directive | Description |
|---|---|
#SBATCH --job-name= | Sets the name of the job; useful for locating it in status reports later |
#SBATCH --output= | Specifies the output file for standard output; you can use a full path, or one relative to the submission directory |
#SBATCH --error= | Specifies the output file for standard error |
#SBATCH --time= | Sets the time limit for the job; in the example above, 1 hour. If the job is still running after the specified duration, the job is stopped |
#SBATCH --nodes= | How many nodes the job should request; in the example above, 1 node |
#SBATCH --ntasks-per-node= | How many tasks per node this job should request; in the example above, 1 task per node |
#SBATCH --gres=gpu: | How many GPUs this task should request; in the example above, 1 GPU |
Submit the job
Submit your job to the Slurm queue by running the script usingsbatch. This command submits the my_job_script.sh script above to the Slurm scheduler. Then, Slurm allocates resources to run your job based on the sbatch directives provided in the script. If there are available resources matching these directives, the job may start immediately; otherwise, it may have to wait in a queue until other jobs are finished.
Example command
371.
Example output
Check the job status
To check the status of your job, use thesqueue command. The squeue command shows the status of all jobs in the cluster. To see a specific job, target the job ID from the submission command. In this example, where the job ID is 371, the job is located using squeue 371.
Jobs can also be targeted by the person who submitted them, identified by username, by running squeue -u myname.
Running squeue without any arguments will show all jobs in the queue, including their status, time running, and the nodes they are running on.
| Job state | Explanation |
|---|---|
PD | Job is pending. The Nodelist(reason) column will detail why. |
R | Job is running. The Nodelist(reason) column will show which nodes are in use by the job. |
CG/CD | Job has completed or is completing. |
F | The job has failed. |
TO | Because of a time out, the job was terminated, the time limit is usually specified in the job script, or partition configuration. |
For more detailed metrics on your job, view your job in the Slurm / Job Metrics Grafana Dashboard.
Cancel your job
To cancel your job, use thescancel command.
View job output
Any standard output and error output are collected in the files specified in the sbatch directives in the job script. After your job has completed, you can view the output in these files.View cluster information
To view more details about your cluster, use thesinfo command.
Example command
Example output
Node states
| Status | Description |
|---|---|
idle | Nodes are currently idle |
alloc | Nodes are fully allocated to jobs |
mix | Nodes are running jobs, but still have some available space |
Additional statuses
The job status is only visible viasqueue while jobs are in a pending or running state, and for a few hours after completion. To see the status of a completed or failed job, use the sacct command instead.
Example command
Example output
- The line with a
JobIDof371and aJobNameofmy_jobdisplays the state of the entire job. - The line prefaced with a
JobIDof371.batchand aJobNameofbatchdisplays the state of the submitted script. - The final line, with a
JobIDof371.0and aJobNameofall_reduc+, displays the status of one of the tests that CoreWeave runs after each job to ensure that the cluster is performing correctly. It typically takes a few seconds to run.