Decomposing the Error Message
Your error message includes the following hint:
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1024 current, 2067021 max
The RLIMIT_NPROC
variable controls the total number of processes that user can have. More specifically, as it is a per process setting, when fork()
, clone()
, vfork()
, &c are called by a process, the RLIMIT_NPROC
value for that process is compared to the total process count for that process's parent user. If that value is exceeded, things shut down, as you've experienced.
The error message indicates that OpenBLAS was unable to create additional threads because your user had used all the threads RLIMIT_NPROC
had given it.
Since you're running on a cluster, it's unlikely that your user is running many threads (unlike, say, if you were on your personal machine and browsing the web, playing music, &c), so it's reasonable to conclude that OpenBLAS is trying to start multiple threads.
How OpenBLAS Uses Threads
OpenBLAS can use multiple threads to accelerate linear algebra. You may want many threads for solving a single, larger problem quickly. You may want fewer threads for solving many smaller problems simultaneously.
OpenBLAS has several ways to limit the number of threads it uses. These are controlled via:
export OPENBLAS_NUM_THREADS=4
export GOTO_NUM_THREADS=4
export OMP_NUM_THREADS=4
The priorities are OPENBLAS_NUM_THREADS > GOTO_NUM_THREADS > OMP_NUM_THREADS. (I think this means that OPENBLAS_NUM_THREADS
overrides OMP_NUM_THREADS
; however, OpenBLAS ignores OPENBLAS_NUM_THREADS
and GOTO_NUM_THREADS
when compiled with USE_OPENMP=1
.)
If none of the foregoing variables are set, OpenBLAS will run using a number of threads equal to the number of cores on your machine (32 on your machine)
Your Situation
Your cluster has 32-core CPUs. You're trying to run 36 instances of Python. Each instance requires 1 thread for Python + 32 threads for OpenBLAS. You'll also need 1 thread for your SSH connection and 1 thread for your shell. That means that you need 36*(32+1)+2=1190 threads.
The nuclear option for fixing the problem is to use:
export OPENBLAS_NUM_THREADS=1
which should bring you down to 36*(1+1)+2=74 threads.
Since you have spare capacity, you could adjust OPENBLAS_NUM_THREADS
to a higher value, but then the OpenBLAS instances owned by your separate Python processes will interfere with each other. So there's a trade-off between how fast you get one solution versus how fast you can get many solutions. Ideally, you can solve this trade-off by running fewer Pythons per node and using more nodes.