Skip to content

SLURM job efficiency#

Efficiency is important here because calculations consume a lot of energy and use shared resources.

The aim is to help users better target their needs, and in particular RAM and Time.

🌱 and 💸 : less purchase of equipment, less power consumption, so lower running costs.

Do not over-allocate resources#

For you#

With SLURM for your own sake it is important not to overallocate the resources while pending time slot for the computing job. If you overallocate then your job will be pending for resources for a long time.

For example if you request 50 GB of RAM for you computing job while it actually uses 5 GB. Then the job will wait for computing slot for 50 GB of job and this could mean that you will wait computing time for multiple days. Or if you are running multiple jobs parallel then your jobs will run one by one instead of running all jobs at the same time.

For them#

Other users also will suffer while they are waiting for computing resources that are reserved for no reason.

Prepare your jobs#

Do tests on small datasets before launching your whole analysis.

During with htop#

A job can be monitor at the runtime using htop, an interactive system monitor process viewer and process manager.

It shows a frequently updated list of the processes running on a computer, normally ordered by the amount of CPU usage. Unlike top, htop provides a full list of processes running, instead of the top resource-consuming processes. htop uses color and gives visual information about processor, swap and memory status. htop can also display the processes as a tree [source]

htop must be executed on the compute node where the job is running.

# Retrieve the compute node with your job ID
squeue -j 42532545
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          42532545      long   iqtree  romainl  R   19:43:03      1 **cpu-node-95**

# Execute htop on the corresponding node
ssh -t cpu-node-95 htop

Post-mortem analysis with reportseff#

It is possible to learn from these completed jobs using reportseff to adjust as closely as possible to requirements.

reportseff_bars is based on reportseff (report slurm efficiency)

module load reportseff

reportseff 36467270
  JobID         State       Elapsed  TimeEff   CPUEff   MemEff
  36467270    COMPLETED    00:06:54   0.0%      4.3%     0.9%

You can get additional information on your job using the --format option, such as --format +reqcpus,AveCPU,CPUTime,reqmem,MaxRSS,user,Account,start,end,NNodes,NodeList,QOS,Partition.

For example:

reportseff 36467270 --format +reqcpus,AveCPU,CPUTime,reqmem,MaxRSS,user,Account,start,end,NNodes,NodeList,QOS,Partition
          JobID    State       Elapsed  TimeEff   CPUEff   MemEff   ReqCPUS    AveCPU    CPUTime    ReqMem   MaxRSS   User            Account                   Start                  End           NNodes    NodeList      QOS     Partition 
  36449176_2848  COMPLETED    00:00:31   0.0%     87.1%     2.9%       1      00:00:01   00:00:31   2000M    59292K   croux   grey_zone_in_green_world   2023-12-05T18:50:51   2023-12-05T18:51:22     1      cpu-node-59   normal     fast

reportseff uses sacct to retrieve information, so you can use sacct options (time ranges, output format, etc.).

Moreover, you can retrieve this data for multiple jobs, such as:

# jobs for the slurm output files in the working directory
reportseff

# all of your jobs since November 1st
reportseff  -u $USER --since 2024-11-01

# all your completed jobs since yesterday until 4 pm with additional details
reportseff -u $USER --since now-1days --until teatime --state "COMPLETED" --format +reqcpus,AveCPU,CPUTime,reqmem,MaxRSS,user,Account,start,end,NNodes,NodeList,QOS,Partition

Not very brigth, ins't it! How can we improve this?

Optimisation#

Memory#

Every job requires a certain amount of memory (RAM) to run. Hence, it is necessary to request the appropriate amount of memory.

Not enought: As SLURM strictly imposes the memory your job can use, if insufficient memory is requested and allocated for your job, your program may crash.

Too much: if too much memory is allocated, the resources that can be used for other tasks will be wasted. Usually, you can request more memory than you think you’ll need for a specific job, then track how much you use it to fine-tune future requests for similar jobs.

Possible optimisations of the Memory#

  1. Plan you new job
    • Tips 1: If you plan a large scale analysis, you can benchmark with a few inputs or a subset.
    • Tips 2: If you will rerun an analysis with almost the same parameters or inputs
  2. Check reportseff
  3. Decrease the reserved memomry
#SBATCH --mem 1G

Note that the default memory is set to 2GB.

Time#

Between the short partition which limits to 1 day and the long partition which limits to 30 days, there is a big margin.

Specify as low a time limit as will realistically allow your job to complete; this will enhance your job's opportunity to be "backfilled".

Backfill is a mechanism that allows lower priority jobs to begin earlier in order to fill idle slots, as long as they are completed before the next high priority job is expected to begin based on resource availability. In other words, if your job is small enough, it can be backfilled and scheduled alongside a larger, higher-priority job.

Backfill

Possible optimisations of the Time#

  • Limit the duration of your jobs

Example for 2 days

#SBATCH --time=2-00:00

Format allow: "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds".

CPU#

CPU is not the easiest parameter to optimise (unless your tool is mono-threaded)

Although a parallel code execution can save significant time compared to execution on a single core, you may notice that the speed of your code execution does not increase in proportion to the number of IT resources used.

Indeed, the sequential (= non-parallelizable) portions of your code are not sensitive to the increase in the number of cores. Thus, depending on your code, from a certain number of resources the execution acceleration will reach its maximum threshold and it will therefore be useless to run this code on more resources. For more information on this subject, see Amdahl’s law.

Amdahl’s law

(https://en.wikipedia.org/wiki/Amdahl%27s_law)

Possible optimisations of the CPU#

  • Refer to the software publication, documentation and benchmark if there is a recommendaded number of CPU or a cap in their number

GPU#

Check your job is currently using the GPU, for example, you can use nvidia-smi command during processing. We can misused some libraries, parameters and finally not used the GPU.

For example, if your job runs on gpu-node-03:

ssh gpu-node-03 nvidia-smi

So we can check your software (process) are currently using all of GPU or part of GPU (MIG).


Inspirations: