Slurm Quick Start Tutorial
2016-04-08 20:40
591 查看
Resource sharing on a supercomputer dedicated to technical and/or scientific computing is often organized by a piece of software called a resource manager or job scheduler. Users submit jobs, which are scheduled and allocated resources (CPU time, memory, etc.)
by the resource manager.
Slurm is a resource manager and job scheduler designed to do just that, and much more. It was originally
created by people at theLivermore Computing Center, and has grown into a full-fledge open-source software backed up by a large community, commercially
supported by the original developers, and installed in many of the Top500 supercomputers.
Slurm offers many commands you can use to interact with the system. For instance, the
gives an overview of the resources offered by the cluster, while the
shows to which jobs those resources are currently allocated.
By default,
a set of compute nodes (computers dedicated to ... computing,) grouped logically. Typical examples include partitions dedicated to batch processing, debugging, post processing, or visualization.
In the above example, we see two partitions, named batch and debug. The latter is the default partition as it is marked with an asterisk. All nodes of the debug partition are idle, while two of the batch partition are being used.
The command
with the argument
Note that with the
provided: number of CPUs, memory, temporary disk (also called scratch space), node weight (an internal parameter specifying preferences in nodes for allocations when there are multiple possibilities), features of the nodes (such as processor type for instance)
and the reason, if any, for which a node is down.
You can actually specify precisely what information you would like
output by using its
at the
The
(they are in the RUNNING state, noted as 'R') or waiting for resources (noted as 'PD').
The above output show that is one job running, whose name is job1 and whose jobid is 12345. The jobid is a unique identifier that is used by many Slurm commands when actions must be taken about one particular job. For instance, to
cancel job job1, you would use
time the job has been running until now. Node is the number of nodes which are allocated to the job, while the Nodelist column lists the nodes which have been allocated for running jobs. For pending jobs, that column gives the reason why the job is pending.
In the example, job 12346 is pending because resources (CPUs, or other) are not available in sufficient amounts, while job 12348 is waiting for job 12346, whose priority is higher, to run. Each job is indeed assigned a priority depending on
several parameters whose details are beyond the scope of this document. Note that the priority for pending jobs can be obtained with the
There are many switches you can use to filter the output by user (
by partition (
etc. As with the
output with the
Now the question is: How do you create a job?
A job consists in two parts:resource requests and job steps. Resource requests consist in a number of CPUs, computing expected duration, amounts of RAM or disk space, etc. Job steps describe tasks that must
be done, software which must be run.
The typical way of creating a job is to write a submission script. A submission script is a shell script, e.g. a Bash script,
whose comments, if they are prefixed with SBATCH, are understood by Slurm as parameters describing resource requests and other submissions options. You can get the complete list of parameters from the
(
The script itself is a job step. Other job steps are created with the
For instance, the following script, hypothetically named submit.sh,
would request one CPU for 10 minutes, along with 100 MB of RAM, in the default queue.. When started, the job would run a first job step
node on which the requested CPU was allocated. Then, a second jobs step will start the
Interestingly, you can get near-realtime information about your program (memory consumption, etc.) with the
Note that the
to the job and the
output of the job must be sent.
Once the submission script is written properly, you need to submit it to slurm through the
which, upon success, responds with the jobid attributed to the job. (The dollar sign below is the shell prompt)
The job then enters the queue in the PENDING state. Once resources become available and the job has highest priority, anallocation is created for it and it goes to the RUNNING state. If the job completes correctly, it goes to the COMPLETED state,
otherwise, it is set to the FAILED state.
Upon compeltion, the output file contains the result of the commands run in the script file. In the above example, you can see it with
Note that you can create an interactive job with the
or by issuing a
But, still, the real question is: How do you create a parallel job?
There are several ways a parallel job, one whose tasks are run simultaneously, can be created:
by running a multi-process program (SPMD paradigm, e.g. with MPI)
by running a multithreaded program (shared memory paradigm, e.g. with OpenMP or pthreads)
by running several instances of a single-threaded program (so-called embarrassingly parallel paradigm)
by running one master program controling several slave programs (master/slave paradigm)
In the Slurm context, a task is to be understood as a process. So a multi-process program is made of several tasks. By contrast, a multithreaded program is composed of only one task, which uses several CPUs.
Tasks are requested/created with the
for the multithreaded programs, are requested with the
Tasks cannot be split across several compute nodes, so requesting several CPUs with the
will ensure all CPUs are allocated on the same compute node. By contrast, requesting the same amount of CPUs with the
may lead to several CPUs being allocated on several, distinct compute nodes.
Here are some quick sample submission scripts. For more detailed information, make sure to have a look at the Slurm FAQ and to follow our training
sessions. There is also Script Generation Wizard you can use to help you in submission scripts creation.
Request four cores on the cluster for 10 minutes, using 100 MB of RAM per core. Assuming my_mpi_program was compiled with MPI support,
create four instances of it, on the nodes allocated by Slurm.
You can try the above example by downloading the example hello world program from Wikipedia (name it for
instance
The
The job it will be run in an allocation where four cores have been reserved on the same compute node.
You can try it by using the hello world program from Wikipedia (name it for instance wiki_omp_example.c) and compiling it with
The
In that configuration, the
and each will have its environment variable
distinct value.
This setup is useful if the program is based on random draws (e.g. Monte-Carlo simulations): the application permitting, you can have four programs drawing 1000 samples and combine their output (with another program) to get the equivalent of
drawing 4000 samples.
Another typical use of this setting is parameter sweep where the same computation is carried on by each program except that some high-level parameter has distinct values in each case. Examples include optimisation of an integer-valued parameter
through range scanning. In the latter case, each instance of the program simply has to lookup the
variable and decide, accordingly, what values of the parameter to test.
The same can be set up to process several data files for instance. Each instance of the program just has to decide which file to read based upon the value set in its
variable.
Upon completion, the above job will create a file
lines:
With file multi.conf being, for example, as follows
The above instructs Slurm to create four tasks (or processes), one running
This is typically used in a producer/consumer setup where one program (the master) create computing tasks for the other program (the slaves) to perform.
Upon completion of the above job, file
by the resource manager.
Slurm is a resource manager and job scheduler designed to do just that, and much more. It was originally
created by people at theLivermore Computing Center, and has grown into a full-fledge open-source software backed up by a large community, commercially
supported by the original developers, and installed in many of the Top500 supercomputers.
Gathering information
Slurm offers many commands you can use to interact with the system. For instance, the sinfocommand
gives an overview of the resources offered by the cluster, while the
squeuecommand
shows to which jobs those resources are currently allocated.
By default,
sinfolists the partitions that are available. A partition is
a set of compute nodes (computers dedicated to ... computing,) grouped logically. Typical examples include partitions dedicated to batch processing, debugging, post processing, or visualization.
# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST batch up infinite 2 alloc node[8-9] batch up infinite 6 idle node[10-15] debug* up 30:00 8 idle node[0-7]
In the above example, we see two partitions, named batch and debug. The latter is the default partition as it is marked with an asterisk. All nodes of the debug partition are idle, while two of the batch partition are being used.
The command
sinfocan output the information in a node-oriented fashion,
with the argument
-N
# sinfo -N -l NODELIST NODES PARTITION STATE CPUS MEMORY TMP_DISK WEIGHT FEATURES REASON node[0-1] 2 debug* idle 2 3448 38536 16 (null) (null) node[2,4-7] 5 debug* idle 2 3384 38536 16 (null) (null) node3 1 debug* idle 2 3394 38536 16 (null) (null) node[8-9] 2 batch allocated 2 246 82306 16 (null) (null) node[10-15] 6 batch idle 2 246 82306 16 (null) (null)
Note that with the
-largument, more information about the nodes is
provided: number of CPUs, memory, temporary disk (also called scratch space), node weight (an internal parameter specifying preferences in nodes for allocations when there are multiple possibilities), features of the nodes (such as processor type for instance)
and the reason, if any, for which a node is down.
You can actually specify precisely what information you would like
sinfoto
output by using its
--formatargument. For more details, have a look
at the
sinfomanpage with
man sinfo.
The
squeuecommand shows the list of jobs which are currently running
(they are in the RUNNING state, noted as 'R') or waiting for resources (noted as 'PD').
# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 12345 debug job1 dave R 0:21 4 node[9-12] 12346 debug job2 dave PD 0:00 8 (Resources) 12348 debug job3 ed PD 0:00 4 (Priority)
The above output show that is one job running, whose name is job1 and whose jobid is 12345. The jobid is a unique identifier that is used by many Slurm commands when actions must be taken about one particular job. For instance, to
cancel job job1, you would use
scancel 12345. Time is the
time the job has been running until now. Node is the number of nodes which are allocated to the job, while the Nodelist column lists the nodes which have been allocated for running jobs. For pending jobs, that column gives the reason why the job is pending.
In the example, job 12346 is pending because resources (CPUs, or other) are not available in sufficient amounts, while job 12348 is waiting for job 12346, whose priority is higher, to run. Each job is indeed assigned a priority depending on
several parameters whose details are beyond the scope of this document. Note that the priority for pending jobs can be obtained with the
spriocommand.
There are many switches you can use to filter the output by user (
--user),
by partition (
--partition) by state (
--state),
etc. As with the
sinfocommand, you can choose what you want
sprioto
output with the
--formatparameter.
Creating a job
Now the question is: How do you create a job?A job consists in two parts:resource requests and job steps. Resource requests consist in a number of CPUs, computing expected duration, amounts of RAM or disk space, etc. Job steps describe tasks that must
be done, software which must be run.
The typical way of creating a job is to write a submission script. A submission script is a shell script, e.g. a Bash script,
whose comments, if they are prefixed with SBATCH, are understood by Slurm as parameters describing resource requests and other submissions options. You can get the complete list of parameters from the
sbatchmanpage
(
man sbatch)
The script itself is a job step. Other job steps are created with the
sruncommand.
For instance, the following script, hypothetically named submit.sh,
#!/bin/bash # #SBATCH --job-name=test #SBATCH --output=res.txt # #SBATCH --ntasks=1 #SBATCH --time=10:00 #SBATCH --mem-per-cpu=100 srun hostname srun sleep 60
would request one CPU for 10 minutes, along with 100 MB of RAM, in the default queue.. When started, the job would run a first job step
srun hostname, which will launch the UNIX command
hostnameon the
node on which the requested CPU was allocated. Then, a second jobs step will start the
sleepcommand.
Interestingly, you can get near-realtime information about your program (memory consumption, etc.) with the
sstatcommand.
Note that the
--job-nameparameter allows giving a meaningful name
to the job and the
--outputparameter defines the file to which the
output of the job must be sent.
Once the submission script is written properly, you need to submit it to slurm through the
sbatchcommand,
which, upon success, responds with the jobid attributed to the job. (The dollar sign below is the shell prompt)
$ sbatch submit.sh sbatch: Submitted batch job 99999999
The job then enters the queue in the PENDING state. Once resources become available and the job has highest priority, anallocation is created for it and it goes to the RUNNING state. If the job completes correctly, it goes to the COMPLETED state,
otherwise, it is set to the FAILED state.
Upon compeltion, the output file contains the result of the commands run in the script file. In the above example, you can see it with
cat res.txt.
Note that you can create an interactive job with the
salloccommand
or by issuing a
sruncommand directly.
Going parallel
But, still, the real question is: How do you create a parallel job?There are several ways a parallel job, one whose tasks are run simultaneously, can be created:
by running a multi-process program (SPMD paradigm, e.g. with MPI)
by running a multithreaded program (shared memory paradigm, e.g. with OpenMP or pthreads)
by running several instances of a single-threaded program (so-called embarrassingly parallel paradigm)
by running one master program controling several slave programs (master/slave paradigm)
In the Slurm context, a task is to be understood as a process. So a multi-process program is made of several tasks. By contrast, a multithreaded program is composed of only one task, which uses several CPUs.
Tasks are requested/created with the
--ntasksoption, while CPUs,
for the multithreaded programs, are requested with the
--cpus-per-taskoption.
Tasks cannot be split across several compute nodes, so requesting several CPUs with the
--cpus-per-taskoption
will ensure all CPUs are allocated on the same compute node. By contrast, requesting the same amount of CPUs with the
--ntasksoption
may lead to several CPUs being allocated on several, distinct compute nodes.
More submission script examples
Here are some quick sample submission scripts. For more detailed information, make sure to have a look at the Slurm FAQ and to follow our trainingsessions. There is also Script Generation Wizard you can use to help you in submission scripts creation.
Message passing example (MPI)
#!/bin/bash # #SBATCH --job-name=test_mpi #SBATCH --output=res_mpi.txt # #SBATCH --ntasks=4 #SBATCH --time=10:00 #SBATCH --mem-per-cpu=100 module load openmpi mpirun hello.mpi
Request four cores on the cluster for 10 minutes, using 100 MB of RAM per core. Assuming my_mpi_program was compiled with MPI support,
mpirunwill
create four instances of it, on the nodes allocated by Slurm.
You can try the above example by downloading the example hello world program from Wikipedia (name it for
instance
wiki_mpi_example.c), and compiling it with
module load openmpi mpicc wiki_mpi_example.c -o hello.mpi
The
res_mpi.txtfile should contain something like
0: We have 4 processors 0: Hello 1! Processor 1 reporting for duty 0: Hello 2! Processor 2 reporting for duty 0: Hello 3! Processor 3 reporting for duty
Shared memory example (OpenMP)
#!/bin/bash # #SBATCH --job-name=test_omp #SBATCH --output=res_omp.txt # #SBATCH --ntasks=1 #SBATCH --cpus-per-task=4 #SBATCH --time=10:00 #SBATCH --mem-per-cpu=100 export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK ./hello.omp
The job it will be run in an allocation where four cores have been reserved on the same compute node.
You can try it by using the hello world program from Wikipedia (name it for instance wiki_omp_example.c) and compiling it with
gcc -fopenmp wiki_omp_example.c -o hello.omp
The
res_omp.txtfile should contain something like
Hello World from thread 0 Hello World from thread 3 Hello World from thread 1 Hello World from thread 2 There are 4 threads
Embarrassingly parallel workload example
#!/bin/bash # #SBATCH --job-name=test_emb #SBATCH --output=res_emb.txt # #SBATCH --ntasks=4 #SBATCH --time=10:00 #SBATCH --mem-per-cpu=100 srun printenv SLURM_PROCID
In that configuration, the
printenvcommand will be run four times,
and each will have its environment variable
SLURM_PROCIDset to a
distinct value.
This setup is useful if the program is based on random draws (e.g. Monte-Carlo simulations): the application permitting, you can have four programs drawing 1000 samples and combine their output (with another program) to get the equivalent of
drawing 4000 samples.
Another typical use of this setting is parameter sweep where the same computation is carried on by each program except that some high-level parameter has distinct values in each case. Examples include optimisation of an integer-valued parameter
through range scanning. In the latter case, each instance of the program simply has to lookup the
$SLURM_PROCIDenvironment
variable and decide, accordingly, what values of the parameter to test.
The same can be set up to process several data files for instance. Each instance of the program just has to decide which file to read based upon the value set in its
$SLURM_PROCIDenvironment
variable.
Upon completion, the above job will create a file
test_empwith four
lines:
0 1 2 3
Master/slave program example
#!/bin/bash # #SBATCH --job-name=test_ms #SBATCH --output=res_ms.txt # #SBATCH --ntasks=4 #SBATCH --time=10:00 #SBATCH --mem-per-cpu=100 srun --multi-prog multi.conf
With file multi.conf being, for example, as follows
0 echo 'I am the Master' 1-3 bash -c 'printenv SLURM_PROCID'
The above instructs Slurm to create four tasks (or processes), one running
echo 'I am the Master', and the other 3 running
bash -c 'printenv SLURM_PROCID'.
This is typically used in a producer/consumer setup where one program (the master) create computing tasks for the other program (the slaves) to perform.
Upon completion of the above job, file
res_ms.txtwill contain
I am the Master 1 2
3转载自:http://www.ceci-hpc.be/slurm_tutorial.html
相关文章推荐
- 【iOS开发】UITableView Cell 自适应高度
- CoreText NSTextView和Attribued String
- C# OpenFileDialog和PictrueBox
- 解决“Dynamic Web Module 3.0 requires Java 1.7 or newer.”错误
- iOS滑动条UISlider的使用方法
- UIButton
- marquee
- ThreadPoolExecutor使用和思考(上)-线程池大小设置与BlockingQueue的三种实现区别(总结)
- UILabel添加删除线
- POJ 2778 DNA sequence
- protues中常用元件
- postgresql duplicate key violates unique constraint
- iOS 3D UI_CALayer的transform扩展
- Android Bluetooth 框架简读 <3>
- [译]理解 Windows UI 动画引擎
- 《iOS Human Interface Guidelines》——Alert
- iOS中UIPickerview的应用和去掉边线
- ant打包:Build error referencing build.xml and proguard file: “null returned: 1”
- ElasticSearch报 EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@c0efba
- Android设计模式 Builder设计模式