Skip to content

Troubleshooting

💡 The following Troubleshooting can be completed by consulting the IFB Community Forum


[SLURM] Invalid account or account/partition combination specified

Complete message:

srun: error: Unable to allocate resources: Invalid account or account/partition combination specified

Explanation 1

Your current default SLURM account should be the demo one. You may have seen a red notice at login? You can check that using:

$ sacctmgr list user $USER
      User   Def Acct     Admin
---------- ---------- ---------
   cnorris       demo      None
Solution

If you don't already have a project, you have to request one from the platform: https://my.sb-roscoff.fr/manager2/project

Otherwise, you already have a project/account, you can either:

  • Specify at each job your SLURM account:
srun -A my_account command
#!/bin/bash
#SBATCH -A my_account
command
  • Change your default account
sacctmgr update user $USER set defaultaccount=my_account

⚠️ status_bar is updated hourly. So it may still display demo as your default account by don't worry, it should have work.


[RStudio] Timeout or do not start

Try to clean session files and cache:

# Remove (rm) or move (mv) RStudio files
# mv ~/.rstudio ~/.rstudio.backup-2022-27-02
rm -rf ~/.rstudio
rm -rf ~/.local/share/rstudio
rm .RData

Retry.

If it doesn't work, try to remove your configuration (settings will be lost)

rm -rf ~/.config/rstudio

Retry.

If it doesn't work, contact the support (IFB Community Forum)


[JupyterHUB] Timeout or do not start

Kill your job/session using the web interface (Menu "File" --> "Hub Control Panel" --> "Stop server") or in command line:

# Remove running jupyter job
scancel -u $USER -n jupyter

Clean session files, cache:

# Remove (rm) or move (mv) JupyterHUB directories
# mv ~/.jupyter ~/.jupyter.backup-2022-27-02
rm -rf ~/.jupyter 
rm -rf ~/.local/share/jupyter

[GPU] How to know the availability of GPU nodes

We can use sinfo command with "Generic resources (gres)" information.

For example:

sinfo -N -O nodelist,partition:15,Gres:30,GresUsed:50 -p gpu
      NODELIST            PARTITION      GRES                          GRES_USED                                         
      gpu-node-01         gpu            gpu:1g.5gb:14                 gpu:1g.5gb:0(IDX:N/A)                             
      gpu-node-02         gpu            gpu:3g.20gb:2,gpu:7g.40gb:1   gpu:3g.20gb:1(IDX:0),gpu:7g.40gb:0(IDX:N/A)       
      gpu-node-03         gpu            gpu:7g.40gb:2                 gpu:7g.40gb:2(IDX:0-1)    

In other words: * gpu-node-01: 14 profiles 1g.5gb, 0 used * gpu-node-02: 2 profiles 3g.20gb, 1 used * gpu-node-02: 1 profile 7g.40gb, 0 used * gpu-node-03: 1 profile 7g.40gb, 2 used

So we can see which GPU/profiles are immediately available.

More information about this "profile" ("Multi-Instance GPU"): * https://ifb-elixirfr.gitlab.io/cluster/doc/slurm/slurm_at/#gpu-nodes * https://docs.nvidia.com/datacenter/tesla/mig-user-guide/


[SLURM] How to use resources wisely

Be vigilant about the proper use of resources.

Do tests on small datasets before launching your whole analysis.

And check the resources usage:

CPU / Memory

You can use: * htop: on the node, during the job * seff: once your job is finished. * sacct, ...

For example with the seff command, you can check the CPU and memory usage (once your job is finished):

# for the jobid `2435594`
seff 2435594
      Job ID: 2435594
      Cluster: core
      User/Group: myuser/mygroup
      State: COMPLETED (exit code 0)
      Nodes: 1
      Cores per node: 50
      CPU Utilized: 182-04:57:51
      CPU Efficiency: 52.31% of 348-07:04:10 core-walltime
      Job Wall-clock time: 6-23:10:53
      Memory Utilized: 45.86 GB
      Memory Efficiency: 18.34% of 250.00 GB

Here we have requested 50 CPU and 250GB of memory, during several days:

Only 52.31% of CPU is being used (100% of 50 CPU on 52.31% of total time, 52.31% of 50 CPU on 100% of total time, or a mix). It's not really efficient. It could be explained sometimes by I/O operations like read, write or get data over Internet (so CPU are just waiting for data), but it deserves further investigations.

Memory used is only 45.86 GB of 250.00 GB allocated (18.34%). So, next time, ask for less (something like 60 GB should be sufficient).

GPU

Check your job is currently using the GPU, for example, you can use nvidia-smi command during processing. We can misused some libraries, parameters and finally not used the GPU.

For example, if your job runs on gpu-node-03:

ssh gpu-node-03 nvidia-smi

So we can check your software (process) are currently using all of GPU or part of GPU (MIG).


[SLURM][RStudio] /tmp No space left on device / Error: Fatal error: cannot create 'R_TempDir'

Explanation 1

The server on which the job ran must have a full on its /tmp/. Indeed, by default, R by default, is writing temporary files in the /tmp/ directory of the server.

The local directory /tmp/ is limited and shared. It's not a good practice to let a software writing on local disk.

Solution

The solution is to change the default temporary directory and expect that the tool is well developed (and the /tmp not hard-coded).

Please add the following lines at the beginning of your sbatch script.

#!/bin/bash
# SBATCH -p fast

TMPDIR="/shared/projects/my_intereting_project/tmp/"
TMP="${TMPDIR}"
TEMP="${TMPDIR}"
mkdir -p "${TMPDIR}"
export TMPDIR TMP TEMP

module load r/4.1.1
Rscript my_script.R

[SLURM] Illegal instruction (core dumped) on some nodes

Explanation

We have different generations of computing nodes with different brands and generations of CPU.

It can result that some of our CPU have the needed instruction set architecture for your software and some not (too old?).

Solution

It's possible to select a set of nodes with some --constraint based on the nodes Features:

  • Vendor: intel or amd
  • CPU familly: broadwell, haswell, epyc_rome ...
  • CPU instruction set: avx2

To get the features available on the node:

$ sinfo -e --format "%.26N %.4c %.7z %.7m %.20G %f"
                  NODELIST CPUS   S:C:T  MEMORY                 GRES AVAIL_FEATURES
               n[56-57,59]   48  2:24:1  257000               (null) intel,haswell,avx2
  n[70,73,77,83,85,95,113]   96  2:48:1  257000               (null) amd,epyc_rome,avx2
               n[75,96-98]  256 2:128:1  257000               (null) amd,epyc_rome,avx2
               n[76,78-79]   48  4:12:1  257000               (null) amd,opteron_k10
                n[115-118]   48  2:24:1  257000               (null) intel,broadwell,avx2
                       n99   80  4:20:1 1032000               (null) intel,westmere
                      n100  128  4:32:1 2064000               (null) intel,broadwell,avx2
               gpu-node-01   40  2:20:1  128000     gpu:k80:2(S:0-1) intel,broadwell,avx2

Thus, you can for example target nodes that allow avx2 and from amd

#SBATCH --constraint=avx2&amd

💡 If you think that some features are needed, don't hesitate to contact us.