SLURM user guide
💡 Please have a look at your tutorial for beginners: tutorials/analysis_slurm
SLURM is the queue manager used on the ABiMS HPC Cluster. You must use SLURM to submit jobs to the cluster.
All the commands presented in this guide are to be used from the host ssh.sb-roscoff.fr
Prerequisites#
- An account in the ABiMS -> https://my.sb-roscoff.fr
- An active project -> request one https://my.sb-roscoff.fr or join one of your collaborators
- The ABiMS provide a
demo
SLURM account which allow few hours of calculation for testing purpose. But you need to request a proper project (including a SLURM account and a project working space) to really use the cluster
SLURM partitions and nodes#
The ABiMS HPC Cluster is organized into several SLURM partitions. Each partition gathers a set of compute nodes that have similar usage.
Partitions | Time out | Max resources / user | Purpose |
---|---|---|---|
fast |
<= 24 hours | cpu=300, mem=1500GB | Default - Regular jobs |
long |
<= 30 days | cpu=300, mem=1500GB | Long jobs |
bigmem |
<= 60 days | mem=2500GB | On demand - For jobs requiring a lot of RAM |
clc |
<= 30 days | cpu=300, mem=1500GB | On demand - Access CLC Assembly Cell |
gpu |
<= 10 days | cpu=300, mem=1500GB | On demand - Access GPU cards |
The default values#
Param | Default value |
---|---|
--mem |
2GB |
--cpus |
1 |
Submitting a job to the cluster#
They are two commands to submit a job to the cluster:
srun
to run jobs interactively
sbatch
to submit a batch job
Submit a job using srun
#
To learn more about the srun
command, see the official documentation
Usage: stand alone#
The job will start immediately after you execute the srun command. The outputs are returned to the terminal. You have to wait until the job has terminated before starting a new job. This works with ANY command.
srun hostname
💡 This example will show you that the job is running on one node and not on the login node.
Usage: interactif#
If an interaction is needed:
module load r
srun --mem 20GB --pty R
--pty
: will keep the interaction possible--mem 20GB
: will allow 20GB of memory to your job instead of the 2GB by default
Submit a job using sbatch
#
To learn more about the sbatch
command, see the official documentation
Usage#
The job starts when resources are available. The command only returns the job id. The outputs are sent to file(s). This works ONLY with shell scripts. The batch script may be given to sbatch through a file name on the command line, or if no file name is specified, sbatch will read in a script from standard input.
Batch scripts rules#
The script can contain srun commands. Each srun is a job step. The script must start with shebang (#!) followed by the path of the interpreter
#!/bin/bash
or
#!/usr/bin/env python
The execution parameters can be set within the shell bowtie2.sbatch
itself:
#!/bin/bash
#
#SBATCH --mem 40GB
srun bowtie2 -x hg19 -1 sample_R1.fq.gz -2 sample_R2.fq.gz -S sample_hg19.sam
sbatch bowtie2.sbatch
The scripts can contain slurm options just after the shebang but before the script commands → #SBATCH
Note that the syntax #SBATCH
is important and doesn't contain any !
(as in the Shebang)
Advice: We recommend to set as many parameters as you can in the script to keep a track of your execution parameters for a future submission.
Execution parameters#
These parameters are common to the commands srun
and sbatch
.
Account#
Accounts are the accounts to which job allowances are credited. So far, no real limits on the ABiMS HPC Cluster. But the SLURM account is mandatory and allow different feature as such conccurent job limit or reservation.
A new user have by default a demo
account. But you need to request of join a proper project to use the cluster.
As soon as it's done:
- To change your default account:
sacctmgr update user $USER set defaultaccount=<project-name>
- that way you won't need to specify the account at each job. - To submit a job at runtime on a specific account
sbatch -A <project-name> script.sbatch
or in your sbatch script:
#!/bin/bash
#
#SBATCH -A world_peace
Parameters for log#
#!/bin/bash
#
#SBATCH -o slurm.%N.%j.out # STDOUT file with the Node name and the Job ID
#SBATCH -e slurm.%N.%j.err # STDERR file with the Node name and the Job ID
Parameters to control the job#
--partition=<partition_names>
, -p
Request a specific partition for the resource allocation. Each partition (queue in SGE) have their own limits: time, memory, nodes ...
--mem=<size[units]>
Specify the real memory required per node. The default units is MB
(Default: 2GB)
The job is killed if it exceeds the limit
Note that you can use the variable $SLURM_MEM_PER_NODE
in the command line to synchronize the software settings and the resource allocated.
--time=<time>
, -t
Set a limit on the total run time of the job allocation.
Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds".
Parameters for multithreading#
--cpus-per-task=<ncpus>
, --cpus
, -c
Request a number of CPUs (default 1)
Note that you can use the variable $SLURM_CPUS_PER_TASK
in the command line to avoid mistake between the resource allocated and the job.
#!/bin/bash
#
#SBATCH --cpus-per-task=8
srun bowtie2 --threads $SLURM_CPUS_PER_TASK -x hg19 -1 sample_R1.fq.gz -2 sample_R2.fq.gz -S sample_hg19.sam
Job information on running job#
List a user's current jobs:
squeue -u <username>
List a user's running jobs:
squeue -u <username> -t RUNNING
List a user's pending jobs:
squeue -u <username> -t PENDING
View accounting information for all user's job for the current day :
sacct --format=JobID,JobName,User,Submit,ReqCPUS,ReqMem,Start,NodeList,State,CPUTime,MaxVMSize%15 -u <username>
View accounting information for all user's job for the 2 last days (it worth an alias) :
sacct -a -S $(date --date='2 days ago' +%Y-%m-%dT%H:%M) --format=JobID,JobName,User%15,Partition,ReqCPUS,ReqMem,State,Start,End,CPUTime,MaxVMSize -u <username>
List detailed job information:
scontrol show -dd jobid=<jobid>
Manage jobs#
To cancel/stop a job:
scancel <jobid>
To cancel all jobs for a user:
scancel -u <username>
To cancel all pending jobs for a user:
scancel -t PENDING -u <username>
Job information on ended job#
Jobs that do not respect their resource reservation can be killed automatically by the cluster. On the contrary, it is not relevant to reserve too many resources.
You can obtain post-mortem information for your jobs.
acct
#
SLURM incorporates a mechanism to track the resource consumption of each job.
The sacct command is used to query the SLURM database to track resource consumption:
To consult the basic information of a job:
sacct -j $job_id
Display detailed information about a job:
sacct --format=JobID,JobName,User%15,Partition,ReqCPUS,ReqMem,MaxVMSize%15,State,Start,End,CPUTime,NodeList -j $job_id
💡 Advice: you can create an alias in your ~/.bashrc
alias sacctReq='sacct --format=JobID,JobName,User%15,Partition,ReqCPUS,ReqMem,MaxVMSize%15,State,Start,End,CPUTime,NodeList'
Job efficiency#
Efficiency is important here because calculations consume a lot of energy and use shared resources.
Have a look to the dedicated documentation: SLURM job efficiency