Skip to content

Troubleshooting

💡 The following Troubleshooting can be completed by consulting the IFB Community Forum


[SLURM] Invalid account or account/partition combination specified#

Complete message:

srun: error: Unable to allocate resources: Invalid account or account/partition combination specified

Explanation 1#

Your current default SLURM account should be the demo one. You may have seen a red notice at login? You can check that using:

$ sacctmgr list user $USER
      User   Def Acct     Admin
---------- ---------- ---------
   cnorris       demo      None
Solution#

If you don't already have a project, you have to request one from the platform: https://my.sb-roscoff.fr/manager2/project

Otherwise, you already have a project/account, you can either:

  • Specify at each job your SLURM account:
srun -A my_account command
#!/bin/bash
#SBATCH -A my_account
command
  • Change your default account
sacctmgr update user $USER set defaultaccount=my_account

⚠️ status_bar is updated hourly. So it may still display demo as your default account by don't worry, it should have work.


[RStudio] Timeout or do not start#

Try to clean session files and cache:

# Remove (rm) or move (mv) RStudio files
# mv ~/.rstudio ~/.rstudio.backup-2022-27-02
rm -rf ~/.rstudio
rm -rf ~/.local/share/rstudio
rm .RData

Retry.

If it doesn't work, try to remove your configuration (settings will be lost)

rm -rf ~/.config/rstudio

Retry.

If it doesn't work, contact the support (IFB Community Forum)


[JupyterHUB] Timeout or do not start#

Kill your job/session using the web interface (Menu "File" --> "Hub Control Panel" --> "Stop server") or in command line:

# Remove running jupyter job
scancel -u $USER -n jupyter

Clean session files, cache:

# Remove (rm) or move (mv) JupyterHUB directories
# mv ~/.jupyter ~/.jupyter.backup-2022-27-02
rm -rf ~/.jupyter 
rm -rf ~/.local/share/jupyter

[GPU] How to know the availability of GPU nodes#

We can use sinfo command with "Generic resources (gres)" information.

For example:

sinfo -N -O nodelist,partition:15,Gres:30,GresUsed:50 -p gpu
      NODELIST            PARTITION      GRES                          GRES_USED                                         
      gpu-node-01         gpu            gpu:1g.5gb:14                 gpu:1g.5gb:0(IDX:N/A)                             
      gpu-node-02         gpu            gpu:3g.20gb:2,gpu:7g.40gb:1   gpu:3g.20gb:1(IDX:0),gpu:7g.40gb:0(IDX:N/A)       
      gpu-node-03         gpu            gpu:7g.40gb:2                 gpu:7g.40gb:2(IDX:0-1)    

In other words:

  • gpu-node-01: 14 profiles 1g.5gb, 0 used
  • gpu-node-02: 2 profiles 3g.20gb, 1 used
  • gpu-node-02: 1 profile 7g.40gb, 0 used
  • gpu-node-03: 1 profile 7g.40gb, 2 used

So we can see which GPU/profiles are immediately available.

More information about this "profile" ("Multi-Instance GPU"):


[SLURM] How to use resources wisely#

Be vigilant about the proper use of resources.

Do tests on small datasets before launching your whole analysis.

And check the resources usage:

CPU / Memory#

You can use:

  • htop: on the node, during the job
  • seff: once your job is finished.
  • sacct, ...

For example with the seff command, you can check the CPU and memory usage (once your job is finished):

# for the jobid `2435594`
seff 2435594
      Job ID: 2435594
      Cluster: core
      User/Group: myuser/mygroup
      State: COMPLETED (exit code 0)
      Nodes: 1
      Cores per node: 50
      CPU Utilized: 182-04:57:51
      CPU Efficiency: 52.31% of 348-07:04:10 core-walltime
      Job Wall-clock time: 6-23:10:53
      Memory Utilized: 45.86 GB
      Memory Efficiency: 18.34% of 250.00 GB

Here we have requested 50 CPU and 250GB of memory, during several days:

Only 52.31% of CPU is being used (100% of 50 CPU on 52.31% of total time, 52.31% of 50 CPU on 100% of total time, or a mix). It's not really efficient. It could be explained sometimes by I/O operations like read, write or get data over Internet (so CPU are just waiting for data), but it deserves further investigations.

Memory used is only 45.86 GB of 250.00 GB allocated (18.34%). So, next time, ask for less (something like 60 GB should be sufficient).

GPU#

Check your job is currently using the GPU, for example, you can use nvidia-smi command during processing. We can misused some libraries, parameters and finally not used the GPU.

For example, if your job runs on gpu-node-03:

ssh gpu-node-03 nvidia-smi

So we can check your software (process) are currently using all of GPU or part of GPU (MIG).


[SLURM][RStudio] /tmp No space left on device / Error: Fatal error: cannot create 'R_TempDir'#

Explanation 1#

The server on which the job ran must have a full on its /tmp/. Indeed, by default, R by default, is writing temporary files in the /tmp/ directory of the server.

The local directory /tmp/ is limited and shared. It's not a good practice to let a software writing on local disk.

Solution#

The solution is to change the default temporary directory and expect that the tool is well developed (and the /tmp not hard-coded).

Please add the following lines at the beginning of your sbatch script.

#!/bin/bash
# SBATCH -p fast

TMPDIR="/shared/projects/my_intereting_project/tmp/"
TMP="${TMPDIR}"
TEMP="${TMPDIR}"
mkdir -p "${TMPDIR}"
export TMPDIR TMP TEMP

module load r/4.1.1
Rscript my_script.R

[SLURM] Illegal instruction (core dumped) on some nodes#

Explanation#

We have different generations of computing nodes with different brands and generations of CPU.

It can result that some of our CPU have the needed instruction set architecture for your software and some not (too old?).

Solution#

It's possible to select a set of nodes with some --constraint based on the nodes Features:

  • Vendor: intel or amd
  • CPU familly: broadwell, haswell, epyc_rome ...
  • CPU instruction set: avx2

To get the features available on the node:

$ sinfo -e --format "%.26N %.4c %.7z %.7m %.20G %f"
                  NODELIST CPUS   S:C:T  MEMORY                 GRES AVAIL_FEATURES
               n[56-57,59]   48  2:24:1  257000               (null) intel,haswell,avx2
  n[70,73,77,83,85,95,113]   96  2:48:1  257000               (null) amd,epyc_rome,avx2
               n[75,96-98]  256 2:128:1  257000               (null) amd,epyc_rome,avx2
               n[76,78-79]   48  4:12:1  257000               (null) amd,opteron_k10
                n[115-118]   48  2:24:1  257000               (null) intel,broadwell,avx2
                       n99   80  4:20:1 1032000               (null) intel,westmere
                      n100  128  4:32:1 2064000               (null) intel,broadwell,avx2
               gpu-node-01   40  2:20:1  128000     gpu:k80:2(S:0-1) intel,broadwell,avx2

Thus, you can for example target nodes that allow avx2 and from amd

#SBATCH --constraint=avx2&amd

💡 If you think that some features are needed, don't hesitate to contact us.


[SLURM] newgrp and other Permission non accordée#

The NFS protocol we use to mount project spaces has a limitation on the number of ldap groups it takes into account. It only takes into account the first 16 groups to which the user belongs.

newgrp allows you to get around this limitation, by modifying the user's main group, if the user belongs to more than 16 groups.

We're all agree that newgrp is a pain!

Access to your project folders with newgrp#

[cnorris@slurm1 ~]$ cd /shared/projects/facts/
-bash: cd: /shared/projects/facts/: Permission non accordée

[cnorris@slurm1 ~]$ newgrp facts

[cnorris@slurm1 ~]$ cd /shared/projects/facts/

Run commands with sg#

sg allow to launch command with a specific group right (without having to switch group with newgrp)

[cnorris@slurm1 ~]$ ls /shared/projects/facts/
ls: impossible d ouvrir le répertoire '/shared/projects/facts/': Permission non accordée

[cnorris@slurm1 ~]$ sg facts -c "ls /shared/projects/facts/"
impossible_script.sh       

TIPS: Recover colours and history after newgrp#

To recover colours and history, we suggest you run these 2 lines which will add "what might work" in the .bashrc :

echo "alias newgrp='export NEWGRP=1; newgrp'" >> ~/.bashrc
echo 'if [[ ! -z "$NEWGRP" && -z "$SLURM_JOB_ID" ]]; then unset NEWGRP; bash -l; fi' >> ~/.bashrc

Explanation:

alias newgrp='export NEWGRP=1; newgrp'

Before our newgrp, we'll create a NEWGRP variable which will be accessible in the newgrp subshell

if [[ ! -z "$NEWGRP" && -z "$SLURM_JOB_ID" ]]

A quick test, to see if the NEWGRP variable exists and to check that the SLURM_JOB_ID variable does not.

For the latter, we prefer to be cautious. There is a distinction between interactive and non-interactive shells. newgrp launches a non-interactive shell, which means that it does not load profile.d and ~/.bash_history

On the other hand, we would not like all shells to be considered as interactive, particularly in the case of a SLURM job.

bash -l

This launches a bash shell which will be interactive.

On the negative side, it creates a shell (bash -l) in a subshell (newgrp) in a shell (ssh slurm1). So there are 3 exits to go back.

unset NEWGRP

This is to avoid looping (I've tested ^^').

So basically, we empty the NEWGRP variable so as not to restart bash -l indefinitely at each bash -l :)


[SLURM] sqlite3 Error: disk I/O error#

$ sqlite3  FooBar_AnvioContigs.db ".tables"
Error: disk I/O error

Explanation#

Il peut y avoir des soucis lors de l'usage de sqlite3 à travers les montages NFS (/shared/projects/...) à cause de latence d'écriture. Cela dépend parfois du point de montage.

(Note que ce cas se présente aussi si un programme que vous utilisez fait appel à une base sqlite)

Solution#

One solution is to activate the option WAL for "Write-Ahead Log" on the database.

BUT to do this, you need to be in a space where the problem does not occur. So, perhaps use the /tmp of a node temporarily and if there is space. Otherwise, contact support for help.

cp my_database.db /tmp

module load sqlite/3.30.1
sqlite-utils enable-wal /tmp/my_database.db

cp /tmp/my_database.db .