slurm_parallel_job_array_submit.sh [-t HH:MM:SS] [-p partition] [-C avx2|sse] [-N #cores] commands-file
The "slurm_parallel_job_array_submit.sh" tool allows the rapid submission and execution of a large number of short running jobs in an efficient and simple way.
Several types of problems can be brolen down into small short running jobs that can run independently. For instance processing a large number of input files or performing a parametric analysis - eg varying some input parameters on a single problem to find the best fit. While you can use a regular SLURM job array to process such problems, this is by no means the most efficient method as you need to wait for each job to be scheduled independently.
The "slurm_parallel_job_array_submit.sh" tool allows for the rapid submission and execution of a large number of short running jobs in an efficient and simple way. Simply prepare a list of all jobs you wish to execute in a single file which you pass as argument to "slurm_parallel_job_array_submit.sh".
Note that the tool supports OpenMP parallel applications - see below.
For example to run a series of R scripts on the same input but varying the paramemters:
Jobs File (commands.txt)
============================
`Rscript MyCode.R 0.05 input1
Rscript MyCode.R 0.10 input1
Rscript MyCode.R 0.15 input1
Rscript MyCode.R 0.20 input1
...
Rscript MyCode.R 0.80 input1
Rscript MyCode.R 0.85 input1
Rscript MyCode.R 0.90 input1
Rscript MyCode.R 0.95 input1`
To execute
==========
`> slurm_parallel_job_array_submit.sh commands.txt`
This will automatically submit the jobs are a job array where each job array instance regroups many jobs and runs them in parallel on compute nodes. Moreover the output of each job will be concatenated in order in a single output file per node:
Output File
===========
`> cat jobs-437798_1.out
============= job 1 ============
(output job 1)
============= job 2 ============
(output job 2)
============= job 3 ============
(output job 3)
============= job 4 ============
(output job 4)
...`
> cat jobs-437798_1.err
`============= job 1 ============
(error output job 1)
============= job 2 ============
(error output job 2)
============= job 3 ============
(error output job 3)
============= job 4 ============
(error output job 4)
...`
When jobs run they inherit the environment of your login shell. So all modules that you load, and environment variables that you set, prior to calling "slurm_parallel_job_array_submit.sh" are passed on to your jobs. Note that the jobs in the commands.txt file will each be allocated 2 cores to run in OpenMP mode.
`> module purge
> module load R/3.4.1
> export SOMEVARIABLENAME="1 2 3 4"
> export OMP_NUM_THREADS=2
> slurm_parallel_job_array_submit.sh commands.txt`
This tool allows you to set a few parameters:
1) Set maximum run time : -t HH:MM:SS
example: allow the job array instances to run for up to 24 hours
`> slurm_parallel_job_array_submit.sh -t 24:00:00 commands.txt`
2) Select the target partition: -p partition
example: use the serial partition
`> slurm_parallel_job_array_submit.sh -p serial commands.txt`
3) Select the type of node: -C type
example: use "avx2" nodes (other choice is "sse")
`> slurm_parallel_job_array_submit.sh -C avx2 commands.txt`
4) Set the maximum number of nodes to use: -N nodes
example: Allow up to 10 nodes - eg up to 10 job array instance can run concurrently
`> slurm_parallel_job_array_submit.sh -N 10 commands.txt`
NYUAD HPC Apps Team:
--------------------
- Benoit Marchand
- Guowei He
- Jorge Naranjo
Please refer to the online documentation available here