This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| slurm_tutorial [2017/03/31 14:40] – [Tips] lenocil | slurm_tutorial [2019/01/16 08:35] (current) – [Less-common user commands] lenocil | ||
|---|---|---|---|
| Line 3: | Line 3: | ||
| Slurm is a resource manager and job scheduler. | Slurm is a resource manager and job scheduler. | ||
| Users can submit jobs (i.e. scripts containing execution instructions) to slurm so that it can schedule their execution and allocate the appropriate resources (CPU, RAM, etc..) on the basis of a user's preferences or the limits imposed by the system administrators. | Users can submit jobs (i.e. scripts containing execution instructions) to slurm so that it can schedule their execution and allocate the appropriate resources (CPU, RAM, etc..) on the basis of a user's preferences or the limits imposed by the system administrators. | ||
| - | Clearly, the advantages of using slurm on a computational cluster are multiple. For an overview of them please read [[https:// | + | The advantages of using slurm on a computational cluster are multiple. For an overview of them please read [[https:// |
| Slurm is **free** software distributed under the [[http:// | Slurm is **free** software distributed under the [[http:// | ||
| - | ==== What is a parallel | + | ==== What is parallel |
| - | //A parallel job consists of tasks that run simultaneously.// | + | //A parallel job consists of tasks that run simultaneously.// |
| - | * by running a multi-process program, for example using [[https:// | + | |
| - | * by running a multi-threaded program, for example see [[http:// | + | |
| - | + | ||
| - | A multi-process program consists of multiple tasks orchestrated by MPI and possibly executed by different nodes. On the other hand, a multi-threaded program consists of multiple task using several CPUs on the same node. | + | |
| - | + | ||
| - | Slurm' | + | |
| ==== Slurm' | ==== Slurm' | ||
| Line 55: | Line 49: | ||
| < | < | ||
| $ sinfo | $ sinfo | ||
| - | PARTITION | ||
| - | playground* | ||
| - | playground* | ||
| - | lowmem | ||
| - | lowmem | ||
| - | lowmem-inf | ||
| - | lowmem-inf | ||
| - | highmem | ||
| - | highmem-inf | ||
| - | notebook | ||
| - | notebook | ||
| - | |||
| - | |||
| </ | </ | ||
| A * near a partition name indicates the default partition. See '' | A * near a partition name indicates the default partition. See '' | ||
| - | **What jobs exist on the system?** | + | **Display all active |
| < | < | ||
| - | $squeue | + | $squeue -u < |
| - | JOBID PARTITION | + | |
| - | | + | |
| - | 12276 playgroun pkequal_ maxxxxel | + | |
| - | 8439 notebook obrien-j | + | |
| - | 8749 playgroun slurm_co oxxxxxkh | + | |
| - | 5801 notebook ostroukh oxxxxxkh | + | |
| - | 8750 playgroun slurm_en oxxxxxkh | + | |
| - | + | ||
| </ | </ | ||
| Line 92: | Line 64: | ||
| < | < | ||
| $ scontrol show partition notebook | $ scontrol show partition notebook | ||
| - | PartitionName=notebook | ||
| - | | ||
| - | | ||
| - | | ||
| - | | ||
| - | | ||
| - | | ||
| - | | ||
| - | | ||
| - | | ||
| - | |||
| </ | </ | ||
| < | < | ||
| $scontrol show node maris004 | $scontrol show node maris004 | ||
| - | NodeName=maris004 Arch=x86_64 CoresPerSocket=4 | ||
| - | | ||
| - | | ||
| - | | ||
| - | | ||
| - | | ||
| - | | ||
| - | | ||
| - | | ||
| - | | ||
| - | | ||
| - | | ||
| - | |||
| </ | </ | ||
| < | < | ||
| novamaris [1087] $ scontrol show jobs 1052 | novamaris [1087] $ scontrol show jobs 1052 | ||
| - | JobId=1052 JobName=slurm_engine.sbatch | ||
| - | | ||
| - | | ||
| - | | ||
| - | | ||
| - | | ||
| - | | ||
| - | | ||
| - | | ||
| - | | ||
| - | | ||
| - | | ||
| - | | ||
| - | | ||
| - | | ||
| - | | ||
| - | | ||
| - | | ||
| - | | ||
| - | | ||
| - | | ||
| - | | ||
| - | | ||
| - | | ||
| - | | ||
| - | |||
| </ | </ | ||
| Line 161: | Line 83: | ||
| 0: maris005 | 0: maris005 | ||
| 1: maris006 | 1: maris006 | ||
| + | </ | ||
| - | </ | ||
| **Create three tasks running on the same node** | **Create three tasks running on the same node** | ||
| < | < | ||
| Line 171: | Line 93: | ||
| </ | </ | ||
| **Create three tasks running on different nodes specifying which nodes should __at least__ be used** | **Create three tasks running on different nodes specifying which nodes should __at least__ be used** | ||
| + | |||
| < | < | ||
| srun -N3 -w " | srun -N3 -w " | ||
| Line 195: | Line 118: | ||
| **Create a job script and submit it to slurm for execution** | **Create a job script and submit it to slurm for execution** | ||
| - | :!: Use `swizard' | + | Suppose ''batch.sh'' |
| + | < | ||
| + | #!/bin/env bash | ||
| + | #SBATCH -n 2 | ||
| + | #SBATCH -w maris00[5-6] | ||
| + | srun hostname | ||
| + | </ | ||
| + | |||
| + | then submit it using '' | ||
| - | Or wrote your own script to submit using | + | See '' |
| ==== Less-common user commands ==== | ==== Less-common user commands ==== | ||
| Line 205: | Line 136: | ||
| * **sshare** | * **sshare** | ||
| * **sprio** | * **sprio** | ||
| + | * **sacct** | ||
| === sacctmgr === | === sacctmgr === | ||
| Line 211: | Line 143: | ||
| < | < | ||
| $ sacctmgr show qos format=Name, | $ sacctmgr show qos format=Name, | ||
| - | Name MaxCPUsPU MaxJobsPU | ||
| - | ---------- --------- --------- -------------------- | ||
| - | normal | ||
| - | playground | ||
| - | notebook | ||
| </ | </ | ||
| Line 232: | Line 159: | ||
| </ | </ | ||
| - | :!: If your job is serial (not parallel, that is not submitted using `srun' | + | :!: Note that in the example above the job is identified by id '' |
| - | + | ||
| - | :!: For parallel | + | |
| === sshare === | === sshare === | ||
| Line 240: | Line 165: | ||
| < | < | ||
| - | $ sshare -U -u xxxxx | + | $ sshare -U -u < |
| | | ||
| -------------------- ---------- ---------- ----------- ----------- ------------- ---------- | -------------------- ---------- ---------- ----------- ----------- ------------- ---------- | ||
| - | xxxxx yyyyyy | + | xxxxx yyyyyy |
| </ | </ | ||
| - | :!: On maris, usage parameters will decay over time according to a PriorityDecayHalfLife of 14 days. | ||
| === sprio === | === sprio === | ||
| Line 257: | Line 181: | ||
| </ | </ | ||
| + | To find what priority a running job was given type | ||
| + | < | ||
| + | squeue -o %Q -j < | ||
| + | </ | ||
| + | |||
| + | === sacct === | ||
| + | It displays accounting data for all jobs and job steps in the Slurm job accounting log or Slurm database. For instance | ||
| + | |||
| + | < | ||
| + | sacct -o JobID, | ||
| + | | ||
| + | ------------ ---------- --------- ---------- ---------- ---------- ---------- ------------------- ------------------- | ||
| + | 13180 | ||
| + | 13180.batch | ||
| + | 13183 | ||
| + | 13183.batch | ||
| + | 13183.0 | ||
| + | |||
| + | |||
| + | </ | ||
| + | |||
| + | :!: Use '' | ||
| ===== Tips ===== | ===== Tips ===== | ||
| - | To minimize the time your job spends in the queue you could specify multiple partitions so that the job can start as soon as possible. Use '' | + | To minimize the time your job spends in the queue you could specify multiple partitions so that the job could start as soon as possible. Use '' |
| To have a rough estimate of when your queued job will start type '' | To have a rough estimate of when your queued job will start type '' | ||
| Line 265: | Line 211: | ||
| To translate a job script written for a scheduler different than slurm to slurm' | To translate a job script written for a scheduler different than slurm to slurm' | ||
| http:// | http:// | ||
| + | |||
| + | === top-like node usage === | ||
| + | |||
| + | Should you want to monitor the usage of the cluster nodes in a top-like fashion type | ||
| + | |||
| + | < | ||
| + | sinfo -i 5 -S" | ||
| + | </ | ||
| + | |||
| + | === top-like job stats === | ||
| + | To monitor the resources consumed by your running job type | ||
| + | |||
| + | < | ||
| + | watch -n1 sstat --format JobID, | ||
| + | </ | ||
| + | |||
| + | === Make local file available to all nodes allocated to a slurm job === | ||
| + | |||
| + | To transmit a file to all nodes allocated to the currently active Slurm job use '' | ||
| + | |||
| + | < | ||
| + | > cat my.job | ||
| + | # | ||
| + | | ||
| + | srun / | ||
| + | |||
| + | > sbatch --nodes=8 my.job | ||
| + | srun: jobid 145 submitted | ||
| + | |||
| + | </ | ||
| + | |||
| + | === Specify nodes for a job === | ||
| + | |||
| + | For instance ''# | ||
| + | |||
| + | === Environment variables available to slurm jobs === | ||
| + | |||
| + | Type '' | ||
| + | |||