Skip to content

SLURM job efficiency#

Efficiency is important here because calculations consume a lot of energy and use shared resources.

The aim is to help users better target their needs, and in particular RAM and Time.

🌱 and 💸 : less purchase of equipment, less power consumption, so lower running costs.

Do not over-allocate resources#

For you#

With SLURM for your own sake it is important not to overallocate the resources while pending time slot for the computing job. If you overallocate then your job will be pending for resources for a long time.

For example if you request 50 GB of RAM for you computing job while it actually uses 5 GB. Then the job will wait for computing slot for 50 GB of job and this could mean that you will wait computing time for multiple days. Or if you are running multiple jobs parallel then your jobs will run one by one instead of running all jobs at the same time.

For them#

Other users also will suffer while they are waiting for computing resources that are reserved for no reason.

Post-mortem analysis#

It is possible to learn from these completed jobs using reportseff to adjust as closely as possible to requirements.

reportseff_bars is based on reportseff (report slurm efficiency)

module load reportseff

reportseff 36467270
  JobID         State       Elapsed  TimeEff   CPUEff   MemEff
  36467270    COMPLETED    00:06:54   0.0%      4.3%     0.9%

Not very brigth! How can we improve this?

Optimisation#

Memory#

Every job requires a certain amount of memory (RAM) to run. Hence, it is necessary to request the appropriate amount of memory.

Not enought: As SLURM strictly imposes the memory your job can use, if insufficient memory is requested and allocated for your job, your program may crash.

Too much: if too much memory is allocated, the resources that can be used for other tasks will be wasted. Usually, you can request more memory than you think you’ll need for a specific job, then track how much you use it to fine-tune future requests for similar jobs.

Possible optimisations of the Memory#

  1. Plan you new job
    • Tips 1: If you plan a large scale analysis, you can benchmark with a few inputs or a subset.
    • Tips 2: If you will rerun an analysis with almost the same parameters or inputs
  2. Check reportseff
  3. Decrease the reserved memomry
#SBATCH --mem 1G

Note that the default memory is set to 2GB.

Time#

Between the short partition which limits to 1 day and the long partition which limits to 30 days, there is a big margin.

Specify as low a time limit as will realistically allow your job to complete; this will enhance your job's opportunity to be "backfilled".

Backfill is a mechanism that allows lower priority jobs to begin earlier in order to fill idle slots, as long as they are completed before the next high priority job is expected to begin based on resource availability. In other words, if your job is small enough, it can be backfilled and scheduled alongside a larger, higher-priority job.

Backfill

Possible optimisations of the Time#

  • Limit the duration of your jobs

Example for 2 days

#SBATCH --time=2-00:00

Format allow: "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds".

CPU#

CPU is not the easiest parameter to optimise (unless your tool is mono-threaded)

Although a parallel code execution can save significant time compared to execution on a single core, you may notice that the speed of your code execution does not increase in proportion to the number of IT resources used.

Indeed, the sequential (= non-parallelizable) portions of your code are not sensitive to the increase in the number of cores. Thus, depending on your code, from a certain number of resources the execution acceleration will reach its maximum threshold and it will therefore be useless to run this code on more resources. For more information on this subject, see Amdahl’s law.

Amdahl’s law

(https://en.wikipedia.org/wiki/Amdahl%27s_law)

Possible optimisations of the CPU#

  • Refer to the software publication, documentation and benchmark if there is a recommendaded number of CPU or a cap in their number

Inspirations: