Browse Source

recipy to build slurm docker

trick slurmdb and slurm into connecting

added separate munge service

added tag to docker build step

bypass selinux to let docker volumes work

removed duplicate slurdbd start

added munge key generation

polishing

fasing out test system

Added munge files.

Munge will be run in a separate container.

Added ldap client to the docker image.

Needs to be tested though

Encrypted slurmdbd secret

Added firewall and Postfix

Needed for slurm but can probably be used elsewhere.

Set selinux to permissive to allow docker volumes.

and make starting the services optional.

ZOnder host networking maar lijkt niet te werken.

Obtained changes from live pg-scheduler

Changes after testing on production...

changed timezone to CEST

clean up reference to test system

Updated readme

Fixed mail from docker

This change installs ssmtp to connect to a mailserver. required settings
are to be entered in the hosts file.

updated hostname for mail to work

Added monk user and group

Fixed syntax error

Updated docker and changed Requires into wants.

Added a tag to the service files.

Changed to overlay driver as it is possible now.

also added various config files for slurm.

remove obsolete var/spool/slurm mount in db docker
pull/6/head
Egon Rijpkema 5 years ago
parent
commit
e855b05e3b
  1. 2
      .gitignore
  2. 2
      ansible.cfg
  3. 2
      hosts
  4. 5
      hosts-dev
  5. 2
      main.yml
  6. 20
      readme.md
  7. 28
      roles/common/tasks/firewall.yml
  8. 3
      roles/common/tasks/main.yml
  9. 89
      roles/common/templates/firewall.sh
  10. 80
      roles/slurm/files/Dockerfile
  11. 6
      roles/slurm/files/daemon.json
  12. 94
      roles/slurm/files/job_submit.lua
  13. 2
      roles/slurm/files/ldap.conf
  14. 57
      roles/slurm/files/munge.key
  15. 17
      roles/slurm/files/munge.service
  16. 8
      roles/slurm/files/nslcd.conf
  17. 63
      roles/slurm/files/nsswitch.conf
  18. 5
      roles/slurm/files/pam_ldap.conf
  19. 7
      roles/slurm/files/runslurmctld.sh
  20. 145
      roles/slurm/files/slurm.conf
  21. 20
      roles/slurm/files/slurm.service
  22. 29
      roles/slurm/files/slurmdbd.conf
  23. 18
      roles/slurm/files/slurmdbd.service
  24. 4
      roles/slurm/files/ssmtp.conf
  25. 107
      roles/slurm/tasks/main.yml
  26. 7
      roles/slurm/vars/main.yml
  27. 5
      slurm.yml

2
.gitignore vendored

@ -7,3 +7,5 @@ @@ -7,3 +7,5 @@
Session.vim
.netrwhist
*~
*.swp
.vault_pass.txt

2
ansible.cfg

@ -1,3 +1,3 @@ @@ -1,3 +1,3 @@
[defaults]
hostfile = hosts
inventory = hosts
host_key_checking = False

2
hosts

@ -1,5 +1,5 @@ @@ -1,5 +1,5 @@
[schedulers]
pg-scheduler
pg-node001 mailhub=172.23.56.1 rewrite_domain=knyft.hpc.rug.nl
[login]
pg-login

5
hosts-dev

@ -0,0 +1,5 @@ @@ -0,0 +1,5 @@
[slurm]
centos7-test
[interactive]
centos7-test

2
main.yml

@ -0,0 +1,2 @@ @@ -0,0 +1,2 @@
---
- include: slurm.yml

20
readme.md

@ -0,0 +1,20 @@ @@ -0,0 +1,20 @@
# ansible playbooks for peregrine
This repository contains an inventory and ansible playbooks for the peregrine cluster.
## Install slurm.
To install slurm:
```
ansible-playbook --vault-password-file=.vault_pass.txt slurm.yml
```
### Skip building of docker images.
The building of docker images takes al lot of time and is only nessecary when the docker file has been changed. You can skip this with the following command.
```
ansible-playbook --vault-password-file=.vault_pass.txt slurm.yml --skip-tags build
```
Furthermore, you can prevent the services from starting inmediately by providing the `--skip-tags start-service` flag.

28
roles/common/tasks/firewall.yml

@ -0,0 +1,28 @@ @@ -0,0 +1,28 @@
---
- file:
path: /root/firewall
state: directory
mode: 0700
- name: install firewall script
template:
src: ../templates/firewall.sh
dest: /root/firewall/firewall.sh
owner: root
group: root
mode: 0744
- command: /root/firewall/firewall.sh
# Docker needs to be restarted if present.
# Because it writs its own iptables rules.
- name: check if docker is present
command: /usr/bin/docker -v
register: result
ignore_errors: True
- service:
name: docker
state: restarted
when: result|succeeded

3
roles/common/tasks/main.yml

@ -0,0 +1,3 @@ @@ -0,0 +1,3 @@
---
- include: firewall.yml
tags: firewall

89
roles/common/templates/firewall.sh

@ -0,0 +1,89 @@ @@ -0,0 +1,89 @@
#!/bin/bash
# Modified version of firewall encountered on live pg-scheduler
# DIY-firewall (GS)
# scheduler02.hpc.rug.nl
# prevent SYNC-floods:
echo 1 > /proc/sys/net/ipv4/tcp_syncookies
# initialize:
/sbin/iptables -F
/sbin/iptables -X
/sbin/iptables -Z
/sbin/iptables -N LOGDROP
/sbin/iptables -A LOGDROP -j LOG
/sbin/iptables -A LOGDROP -j DROP
# config default policy's:
/sbin/iptables -P INPUT DROP
/sbin/iptables -P OUTPUT DROP
/sbin/iptables -P FORWARD DROP
# vars:
IFACE="{{ansible_default_ipv4.interface}}"
LOOPBACK="127.0.0.0/8"
OPERATOR="129.125.50.41/32" # Nagios server
# kernel tweaks:
/bin/echo "1" > /proc/sys/net/ipv4/icmp_echo_ignore_broadcasts
/bin/echo "0" > /proc/sys/net/ipv4/conf/all/accept_source_route
/bin/echo "1" > /proc/sys/net/ipv4/icmp_ignore_bogus_error_responses
/bin/echo "0" > /proc/sys/net/ipv4/ip_forward
# allow loopback:
/sbin/iptables -A INPUT -i lo -j ACCEPT
/sbin/iptables -A OUTPUT -o lo -j ACCEPT
# allow eth0 (interconnect):
/sbin/iptables -A INPUT -i eth0 -j ACCEPT
/sbin/iptables -A OUTPUT -o eth0 -j ACCEPT
# allow icmp:
/sbin/iptables -A INPUT -i $IFACE -p icmp -j ACCEPT
/sbin/iptables -A OUTPUT -o $IFACE -p icmp -j ACCEPT
# refuse loopback packets incoming eth1
/sbin/iptables -A INPUT -i $IFACE -d $LOOPBACK -j DROP
# allow DNS:
/sbin/iptables -A INPUT -i $IFACE -p tcp --sport 53 -j ACCEPT
/sbin/iptables -A OUTPUT -o $IFACE -p tcp --dport 53 -j ACCEPT
/sbin/iptables -A INPUT -i $IFACE -p udp --sport 53 -j ACCEPT
/sbin/iptables -A OUTPUT -o $IFACE -p udp --dport 53 -j ACCEPT
/sbin/iptables -A INPUT -i $IFACE -p udp --sport 123 -j ACCEPT
/sbin/iptables -A OUTPUT -o $IFACE -p udp --dport 123 -j ACCEPT
# allow smtp out:
/sbin/iptables -A INPUT -i $IFACE -p tcp --sport 25 -j ACCEPT
/sbin/iptables -A OUTPUT -o $IFACE -p tcp --dport 25 -j ACCEPT
# bwp rug:
/sbin/iptables -A INPUT -i $IFACE -p tcp -s 129.125.249.0/24 --dport 22 -m state --state NEW,ESTABLISHED -j ACCEPT
/sbin/iptables -A OUTPUT -o $IFACE -p tcp -d 129.125.249.0/24 --sport 22 -m state --state ESTABLISHED -j ACCEPT
# allow operator:
/sbin/iptables -A INPUT -i $IFACE -p tcp -s $OPERATOR -j ACCEPT
/sbin/iptables -A OUTPUT -o $IFACE -p tcp -d $OPERATOR -j ACCEPT
/sbin/iptables -A INPUT -i $IFACE -p udp -s $OPERATOR -j ACCEPT
/sbin/iptables -A OUTPUT -o $IFACE -p udp -d $OPERATOR -j ACCEPT
/sbin/iptables -A INPUT -i $IFACE -p icmp -s $OPERATOR -j ACCEPT
/sbin/iptables -A OUTPUT -o $IFACE -p icmp -d $OPERATOR -j ACCEPT
# allow gospel/slurm-db:
/sbin/iptables -A INPUT -i $IFACE -p tcp -s 129.125.36.145/32 -j ACCEPT
/sbin/iptables -A OUTPUT -o $IFACE -p tcp -d 129.125.36.145/32 -j ACCEPT
#log incoming packets:
/sbin/iptables -A INPUT -i $IFACE -d 129.125.50.193/32 -j LOGDROP
/sbin/iptables --list
# Save the newly generated config into system config.
# This ensures the firewall is loaded on boot.
/usr/sbin/iptables-save > /etc/sysconfig/iptables

80
roles/slurm/files/Dockerfile

@ -0,0 +1,80 @@ @@ -0,0 +1,80 @@
#
# Centos7 Slurm Server
#
FROM centos:7
MAINTAINER Egon Rijpkema <e.m.a.rijpkema@rug.nl>
# Openldap client, installing from spacewalk leads to conflicts.
RUN yum install -y openldap-clients nss-pam-ldapd openssh-ldap
# add openldap config
ADD ldap.conf /etc/openldap/ldap.conf
ADD nslcd.conf /etc/nslcd.conf
ADD pam_ldap.conf /etc/pam_ldap.conf
ADD nsswitch.conf /etc/nsswitch.conf
RUN chmod 600 /etc/nslcd.conf
# Add spacewalk client
RUN rpm -Uvh http://yum.spacewalkproject.org/2.4-client/RHEL/7/x86_64/spacewalk-client-repo-2.4-3.el7.noarch.rpm
RUN yum install rhn-client-tools rhn-check rhn-setup rhnsd m2crypto yum-rhn-plugin -y
RUN rhnreg_ks --force --serverUrl=http://spacewalk.hpc.rug.nl/XMLRPC --activationkey=1-ce5e67697e0e3e699dd236564faa2fc4
# empty /etc/yum.repos.d/ for spacewalk
RUN sed -i 's/enabled=1/enabled=0/g' /etc/yum.repos.d/*
RUN sed -i '/name=/a enabled=0' /etc/yum.repos.d/*
# Disable gpgcheck
RUN sed -i 's/gpgcheck = 1/gpgcheck = 0/g' /etc/yum/pluginconf.d/rhnplugin.conf
RUN adduser slurm
# Slurm and dependencies
RUN yum install -y slurm \
slurm-plugins \
slurm-lua \
slurm-slurmdbd \
slurm-sjobexit \
slurm-munge \
slurm-sql \
slurm-perlapi \
slurm-sjstat
# Slurm needs /sbin/mail to work in order to send mail
RUN yum install -y mailx ssmtp
# Add ssmtp config
ADD ssmtp.conf /etc/ssmtp/ssmtp.conf
RUN mkdir /var/log/slurm
RUN chown slurm: /var/log/slurm
RUN mkdir /var/spool/slurm
RUN chown slurm: /var/spool/slurm
ADD slurm.conf /etc/slurm/slurm.conf
ADD slurmdbd.conf /etc/slurm/slurmdbd.conf
ADD job_submit.lua /etc/slurm/job_submit.lua
RUN groupadd -g 500 beheer
RUN useradd -g beheer -u 500 ger
RUN useradd -g beheer -u 501 fokke
RUN useradd -g beheer -u 502 bob
RUN useradd -g beheer -u 505 wim
RUN useradd -g beheer -u 506 robin
RUN useradd -g beheer -u 507 wietze
RUN useradd -g beheer -u 508 ruben
RUN useradd -g beheer -u 509 cristian
RUN groupadd -g 1001 monk
RUN useradd -u 2071 -g monk monk
ADD runslurmctld.sh /runslurmctld.sh
RUN chmod +x /runslurmctld.sh
# our users find UTC confusing
RUN rm /etc/localtime
RUN ln -s /usr/share/zoneinfo/Europe/Amsterdam /etc/localtime

6
roles/slurm/files/daemon.json

@ -0,0 +1,6 @@ @@ -0,0 +1,6 @@
{
"storage-driver": "overlay2",
"storage-opts": [
"overlay2.override_kernel_check=true"
]
}

94
roles/slurm/files/job_submit.lua

@ -0,0 +1,94 @@ @@ -0,0 +1,94 @@
--[[
This lua script assigns the right QoS to each job, based on a predefined table and
assuming that each partition will have a QoS for short jobs and one for long jobs.
The correct QoS is chosen by comparing the time limit of the job to a given threshold.
The PARTITION_TO_QOS table contains these thresholds and QoS names for all partitions:
for jobs having a time limit below the threshold, the given short QoS will be applied.
Otherwise, the specified long QoS will be applied.
Note that this script should be named "job_submit.lua" and be stored
in the same directory as the SLURM configuration file, slurm.conf.
It will be automatically run by the SLURM daemon for each job submission.
--]]
-- PARTITION TIME LIMIT SHORT QOS LONG QOS
-- NAME THRESHOLD NAME NAME
-- (MINUTES!)
PARTITION_TO_QOS = {
nodes = {3*24*60, "nodes", "nodeslong" },
regular = {3*24*60, "regular", "regularlong" },
gpu = {1*24*60, "gpu", "gpulong" },
himem = {3*24*60, "himem", "himemlong" },
short = {30*60, "short", "short" },
nodestest = {3*24*60, "nodestest", "nodestestlong" },
target = {3*24*60, "target", "target" },
euclid = {3*24*60, "target", "target" }
}
-- Jobs that do not have a partition, will be routed to the following default partition.
-- Can also be found dynamically using something like:
-- sinfo | awk '{print $1}' | grep "*" | sed 's/\*$//'
-- Or by finding the partition in part_list that has flag_default==1
DEFAULT_PARTITION = "regular"
function slurm_job_submit(job_desc, part_list, submit_uid)
-- If partition is not set, set it to the default one
if job_desc.partition == nil then
job_desc.partition = DEFAULT_PARTITION
end
-- Find the partition in SLURM's partition list that matches the
-- partition of the job description.
local partition = false
for name, part in pairs(part_list) do
if name == job_desc.partition then
partition = part
break
end
end
-- To be sure, check if a valid partition has been found.
-- This should always be the case, otherwise the job would have been rejected.
if not partition then
return slurm.ERROR
end
-- If the job does not have a time limit, set it to
-- the default time limit of the job's partition.
-- For some reason (bug?), the nil value is passed as 4294967294.
if job_desc.time_limit == nil or job_desc.time_limit == 4294967294 then
job_desc.time_limit = partition.default_time
end
-- Now use the job's partition and the PARTITION_TO_QOS table
-- to assign the right QOS to the job.
local qos_map = PARTITION_TO_QOS[partition.name]
if job_desc.time_limit <= qos_map[1] then
job_desc.qos = qos_map[2]
else
job_desc.qos = qos_map[3]
end
--slurm.log_info("qos = %s", job_desc.qos)
return slurm.SUCCESS
end
function slurm_job_modify(job_desc, job_rec, part_list, modify_uid)
-- if job_desc.comment == nil then
-- local comment = "***TEST_COMMENT***"
-- slurm.log_info("slurm_job_modify: for job %u from uid %u, setting default comment value: %s",
-- job_rec.job_id, modify_uid, comment)
-- job_desc.comment = comment
-- end
return slurm.SUCCESS
end
slurm.log_info("initialized")
return slurm.SUCCESS

2
roles/slurm/files/ldap.conf

@ -0,0 +1,2 @@ @@ -0,0 +1,2 @@
TLS_CACERTDIR /etc/openldap/certs
SASL_NOCANON on

57
roles/slurm/files/munge.key

@ -0,0 +1,57 @@ @@ -0,0 +1,57 @@
$ANSIBLE_VAULT;1.1;AES256
31613263663136343138333434346139326262386431336236323262653537393137666431373134
6433666533396562323935373566373737353463343539660a656639326631636131336539346432
62636161616434363837636335336461343864333230323832653764633039303237653337666363
6337333663333731620a353362343163323636653237386139343333646164346530366462396439
37356532333064303066363937663564383465316231613065656436313238336336656136663361
62616666363038643233356331653162336164656661616662636266303966373036393831333034
61633130343564646666373938383236646633393764326465303239393933626633336161313034
31326638613632373466333661363637633632303363616562663239666130396231336137643335
34323638343231363239313334646662666535326339666636663161326138383436633234373636
64303839633931653833313266386334356434636235376162303837323032663533383536353939
66306661636265353638373133343163656530353366333637313861653162366630323361386437
38636463393634333162303161623063646437333364643961343836393366383035393061383962
62323362366338343132316234616338373861363465386566353935396162366138326665613834
38666166613965616133333133343434383633306234383638616134373834366566373739313162
63323738333733653830656261343664626364363436343765313634323736353961666630633963
34643764363561656431663535316530326263663531636539333537626530623766313931363965
35376439623732626534636634646266326336383535396237363732363134633762323965646635
65643236313338383435353933613235323537363865346337333835303065386263323866623532
39356534643434346135363164386361323563393633626337663666666364376637363765386134
63383133316330366333386266616632393131383338343331393330333632303337353166623133
65316131373133313465323765363663383263366639323635393335653639613936663731373735
39656666616231386364326137353334383331636662613436616537303734326634633933623832
32313366353336623938393932353734333862613765316536316563343366643839326162343261
39643065313564306465383463376436663836396133303339616130636566653333353134653234
65333232613465386530386232613135356538356237396264306635323734343739633766376138
38393732613132393063613263316535343464373762663664656138313833636332373537663964
61666662376434373137333730613063373864346433623237376464323933626663313635646233
64313365643337333064623932343832306431393033333235653237373032646232623234383761
66393835653762306636613136313264303564653964356438616162346263653936393436373263
39633731316666393633393135323461336536666131366338363666343961383962643165336138
38643365363330633937333263343333643534323035653836616535613865656265353566626433
62633162333463643739363063383832386534366635633461306230326233613265353065353036
30613636633636303634653963666463643735353830363935633637373935323463356161633736
37353664353836363038383332616665656632366565303534643636343632343930343138626539
61363463343864333364396332613533626231393139663966623037336130356466323736313138
38313966323230646163306436626136373964376561613463393439663537643933343031373539
61363930653965666437663633383162343962646532333133346334376531336233333332626562
30376261623932393734366661663166643664646565343461393537336465363766313764366465
61616536626535656661333635343335303034393661633430393531636564623663336534633135
66636535313136373032323632633232383964643762343465356439313561343066333765646361
38336330343234666562323564396336373135396338613561613664376332646238653935303537
37623931383961393539326135313632613634383736373130666564323562653362343333313535
64373437383766383539353237323031393838616661323037643062346164356362616364663464
33343364393932653438356136383265613436616436656263363235366363373036646361653564
30396238393439623865643463353964393632383237636663653631313461353833383632316435
61343734393662323938396530363339306636313666343039383839633334353830366161383861
38313065366161333265623733613238316138316635383738303236373130313936353665646362
34626437363866646239303437363437346232356161353936373730646362653264636339623365
33643961653864376233366138626438366664396564356138356639356130643939346230353535
66303230613833653839633437633036373332613032646262356136393431323235383466343330
64366635356464306234616138343736373937663835393766333233666164623065343463633633
38313066366364643165323836633435356436633261386161613030336161363862356639656431
34613462646230626539343831643763393932636530653739373736646233636463323864613636
33636539346531643931626461323831343731666165663463326133663762353633663034373937
33353636313639343833366265353465323266343336656361333262363839343832386331356236
65643838646434346533

17
roles/slurm/files/munge.service

@ -0,0 +1,17 @@ @@ -0,0 +1,17 @@
[Unit]
Description=Munge auth daemon
After=docker.service
Requires=docker.service
[Service]
TimeoutStartSec=0
Restart=always
ExecStartPre=-/usr/bin/docker stop %n
ExecStartPre=-/usr/bin/docker rm %n
ExecStart=/usr/bin/docker run --name munge --rm --name %n \
--volume /srv/slurm/volumes/etc/munge:/etc/munge \
--volume /srv/slurm/volumes/etc/munge:/var/run/munge/ \
hpc/slurm /usr/sbin/munged -f -F
[Install]
WantedBy=multi-user.target

8
roles/slurm/files/nslcd.conf

@ -0,0 +1,8 @@ @@ -0,0 +1,8 @@
uid nslcd
gid ldap
uri ldap://172.23.47.249
base ou=Peregrine,o=asds
ssl no
tls_cacertdir /etc/openldap/cacerts
binddn cn=clusteradminperegrine,o=asds
bindpw qwasqwas

63
roles/slurm/files/nsswitch.conf

@ -0,0 +1,63 @@ @@ -0,0 +1,63 @@
#
# /etc/nsswitch.conf
#
# An example Name Service Switch config file. This file should be
# sorted with the most-used services at the beginning.
#
# The entry '[NOTFOUND=return]' means that the search for an
# entry should stop if the search in the previous entry turned
# up nothing. Note that if the search failed due to some other reason
# (like no NIS server responding) then the search continues with the
# next entry.
#
# Valid entries include:
#
# nisplus Use NIS+ (NIS version 3)
# nis Use NIS (NIS version 2), also called YP
# dns Use DNS (Domain Name Service)
# files Use the local files
# db Use the local database (.db) files
# compat Use NIS on compat mode
# hesiod Use Hesiod for user lookups
# [NOTFOUND=return] Stop searching if not found so far
#
# To use db, put the "db" in front of "files" for entries you want to be
# looked up first in the databases
#
# Example:
#passwd: db files nisplus nis
#shadow: db files nisplus nis
#group: db files nisplus nis
passwd: ldap files
shadow: ldap files
group: ldap files
#hosts: db files nisplus nis dns
hosts: files dns myhostname
# Example - obey only what nisplus tells us...
#services: nisplus [NOTFOUND=return] files
#networks: nisplus [NOTFOUND=return] files
#protocols: nisplus [NOTFOUND=return] files
#rpc: nisplus [NOTFOUND=return] files
#ethers: nisplus [NOTFOUND=return] files
#netmasks: nisplus [NOTFOUND=return] files
bootparams: nisplus [NOTFOUND=return] files
ethers: files
netmasks: files
networks: files
protocols: files
rpc: files
services: files sss
netgroup: files sss
publickey: nisplus
automount: files
aliases: files nisplus

5
roles/slurm/files/pam_ldap.conf

@ -0,0 +1,5 @@ @@ -0,0 +1,5 @@
host 172.23.47.249
base ou=Peregrine,o=asds
binddn cn=clusteradminperegrine,o=asds
bindpw qwasqwas
port 389

7
roles/slurm/files/runslurmctld.sh

@ -0,0 +1,7 @@ @@ -0,0 +1,7 @@
#!/bin/bash
# Start the nslcd daemon in the background and then start slurm.
nslcd
/usr/sbin/slurmctld -D

145
roles/slurm/files/slurm.conf

@ -0,0 +1,145 @@ @@ -0,0 +1,145 @@
ClusterName=Peregrine
ControlMachine=knyft.hpc.rug.nl
ControlAddr=knyft.hpc.rug.nl
#BackupController=
#BackupAddr=
#
SlurmUser=root
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/var/spool/slurm
SlurmdSpoolDir=/var/spool/slurmd
SwitchType=switch/none
MpiDefault=pmi2
MpiParams=ports=12000-12999
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/cgroup
#PluginDir=
CacheGroups=0
#FirstJobId=
ReturnToService=1
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
Prolog=/etc/slurm/slurm.prolog
PrologFlags=Alloc
Epilog=/etc/slurm/slurm.epilog*
#SrunProlog=
#SrunEpilog=
TaskProlog=/etc/slurm/slurm.taskprolog
#TaskEpilog=/etc/slurm/slurm.taskepilog
#TaskPlugin=affinity
TaskPlugin=task/cgroup
JobSubmitPlugins=lua
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#CheckpointType=checkpoint/blcr
JobCheckpointDir=/var/slurm/checkpoint
#
# Terminate job immediately when one of the processes is crashed or aborted.
KillOnBadExit=1
# Do not automatically requeue jobs after a node failure
JobRequeue=0
# Cgroups already enforce resource limits, SLURM should not do this
MemLimitEnforce=no
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=43200
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=1
#
# SCHEDULING
SchedulerType=sched/backfill
SchedulerPort=7321
SchedulerParameters=bf_max_job_user=200,bf_max_job_test=10000,default_queue_depth=500,bf_window=14400,bf_resolution=300,kill_invalid_depend,bf_continue,bf_min_age_reserve=3600
SelectType=select/cons_res
# 13jan2016: disabled CR_ONE_TASK_PER_CORE (HT off) and CR_ALLOCATE_FULL_SOCKET (deprecated)
SelectTypeParameters=CR_Core_Memory
#SchedulerAuth=
#SchedulerRootFilter=
FastSchedule=1
PriorityType=priority/multifactor
PriorityFlags=MAX_TRES
PriorityDecayHalfLife=7-0
PriorityFavorSmall=NO
# Not necessary if there is a decay
#PriorityUsageResetPeriod=14-0
PriorityWeightAge=5000
PriorityWeightFairshare=100000
PriorityWeightJobSize=0
PriorityWeightPartition=0
PriorityWeightQOS=0
PriorityMaxAge=100-0
#
# Reservations
ResvOverRun=UNLIMITED
#
# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm/slurmd.log
JobCompType=jobcomp/filetxt
JobCompLoc=/var/log/slurm/slurm.jobcomp
#
# ACCOUNTING
#AcctGatherEnergyType=acct_gather_energy/rapl
#JobAcctGatherFrequency=energy=30
#JobAcctGatherType=jobacct_gather/linux
#JobAcctGatherParams=UsePss,NoOverMemoryKill
JobAcctGatherType=jobacct_gather/cgroup
# Users have to be in the accounting database
# (otherwise we don't have accounting records and fairshare)
#AccountingStorageEnforce=associations
AccountingStorageEnforce=limits,qos # will also enable: associations
#JobAcctGatherFrequency=30
#
#AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=knyft.hpc.rug.nl
#AccountingStorageLoc=/var/log/slurm/slurm.accounting
#AccountingStoragePass=
#AccountingStorageUser=
MaxJobCount=100000
#
# Job profiling
#
#AcctGatherProfileType=acct_gather_profile/hdf5
#JobAcctGatherFrequency=30
#
# Health Check
#
HealthCheckProgram=/usr/sbin/nhc
HealthCheckInterval=300
#
# Partitions
#
EnforcePartLimits=YES
PartitionName=DEFAULT State=UP DefMemPerCPU=2000
PartitionName=short Nodes=pg-node[004-210] MaxTime=00:30:00 DefaultTime=00:30:00 AllowQOS=short SelectTypeParameters=CR_Core_Memory TRESBillingWeights="CPU=1.0,Mem=0.1875G" Priority=1
PartitionName=gpu Nodes=pg-gpu[01-06] MaxTime=3-00:00:00 DefaultTime=00:30:00 AllowQOS=gpu,gpulong SelectTypeParameters=CR_Socket_Memory TRESBillingWeights="CPU=1.0,Mem=0.1875G"
PartitionName=himem Nodes=pg-memory[01-07] MaxTime=10-00:00:00 DefaultTime=00:30:00 AllowQOS=himem,himemlong SelectTypeParameters=CR_Core_Memory TRESBillingWeights="CPU=1.0,Mem=0.0234375G"
PartitionName=target Nodes=pg-node[100-103] MaxTime=3-00:00:00 DefaultTime=00:30:00 AllowGroups=pg-gpfs,monk AllowQOS=target SelectTypeParameters=CR_Core_Memory TRESBillingWeights="CPU=1.0,Mem=0.1875G" Priority=2
#PartitionName=euclid Nodes=pg-node[161,162] MaxTime=10-00:00:00 DefaultTime=00:30:00 AllowGroups=beheer,f111959,f111867,p251204,f113751 AllowQOS=target SelectTypeParameters=CR_Core_Memory TRESBillingWeights="CPU=1.0,Mem=0.1875G"
PartitionName=regular Nodes=pg-node[004-099,104-210] MaxTime=10-00:00:00 DefaultTime=00:30:00 AllowQOS=regular,regularlong SelectTypeParameters=CR_Core_Memory TRESBillingWeights="CPU=1.0,Mem=0.1875G" Default=YES
#
# COMPUTE NODES
#
GresTypes=gpu
NodeName=pg-node[004-162] Sockets=2 CoresPerSocket=12 ThreadsPerCore=1 State=UNKNOWN RealMemory=128500 Feature=24cores,centos7
NodeName=pg-gpu[01-06] Sockets=2 CoresPerSocket=12 ThreadsPerCore=1 State=UNKNOWN RealMemory=128500 Gres=gpu:k40:2 Feature=24cores,centos7
NodeName=pg-memory[01-03] Sockets=4 CoresPerSocket=12 ThreadsPerCore=1 State=UNKNOWN RealMemory=1031500 Feature=48cores,centos7
NodeName=pg-memory[04-07] Sockets=4 CoresPerSocket=12 ThreadsPerCore=1 State=UNKNOWN RealMemory=2063500 Feature=48cores,centos7
NodeName=pg-node[163-210] Sockets=2 CoresPerSocket=14 ThreadsPerCore=1 State=UNKNOWN RealMemory=128500 Feature=28cores,centos7

20
roles/slurm/files/slurm.service

@ -0,0 +1,20 @@ @@ -0,0 +1,20 @@
[Unit]
Description=Slurm queue daemon
After=slurmdbd.service
Wants=slurmdbd.service
[Service]
TimeoutStartSec=0
Restart=always
ExecStartPre=-/usr/bin/docker stop %n
ExecStartPre=-/usr/bin/docker rm %n
#Run daemon in the forground. systemd and docker do the daemonizing.
ExecStart=/usr/bin/docker run --hostname {{ ansible_fqdn }} --rm --name %n \
--network host \
--volume /srv/slurm/volumes/var/spool/slurm:/var/spool/slurm \
--volume /srv/slurm/volumes/etc/slurm:/etc/slurm \
--volumes-from munge.service \
hpc/slurm /runslurmctld.sh
[Install]
WantedBy=multi-user.target

29
roles/slurm/files/slurmdbd.conf

@ -0,0 +1,29 @@ @@ -0,0 +1,29 @@
ArchiveEvents=no
ArchiveJobs=yes
ArchiveResvs=no
ArchiveSteps=no
ArchiveSuspend=no
#ArchiveScript=/usr/sbin/slurm.dbd.archive
AuthInfo=/var/run/munge/munge.socket.2
AuthType=auth/munge
DbdHost=knyft.hpc.rug.nl
DebugLevel=info #was: 4
# Temporarily increased to find cause of crashes
#DebugLevel=debug5
PurgeEventAfter=2month
PurgeJobAfter=12months
PurgeResvAfter=1month
PurgeStepAfter=3months
PurgeSuspendAfter=1month
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
SlurmUser=slurm
StorageHost=gospel.service.rug.nl
#StorageHost=172.23.38.125
StoragePort=3306
StoragePass={{ slurm_storage_pass }}
#StoragePass=geheim
StorageType=accounting_storage/mysql
StorageUser=slurmacc_pg
#StorageUser=root
StorageLoc=slurm_pg_accounting

18
roles/slurm/files/slurmdbd.service

@ -0,0 +1,18 @@ @@ -0,0 +1,18 @@
[Unit]
Description=Slurm database daemon
After=munge.service
Requires=docker.service
[Service]
TimeoutStartSec=0
Restart=always
ExecStartPre=-/usr/bin/docker stop %n
ExecStartPre=-/usr/bin/docker rm %n
# Run daemon in the forground. systemd and docker do the daemonizing,
ExecStart=/usr/bin/docker run --network host --rm --name %n \
--volume /srv/slurm/volumes/etc/slurm:/etc/slurm \
--volumes-from munge.service \
hpc/slurm /usr/sbin/slurmdbd -D
[Install]
WantedBy=multi-user.target

4
roles/slurm/files/ssmtp.conf

@ -0,0 +1,4 @@ @@ -0,0 +1,4 @@
root=postmaster
Mailhub={{ mailhub }}
RewriteDomain={{ rewrite_domain }}
TLS_CA_File=/etc/pki/tls/certs/ca-bundle.crt

107
roles/slurm/tasks/main.yml

@ -0,0 +1,107 @@ @@ -0,0 +1,107 @@
# Build and install a docker image for slurm.
---
- name: Install yum dependencies
yum: name={{ item }} state=latest update_cache=yes
with_items:
- docker-ce
- docker-python
- ntp
- name: set selinux in permissive mode to allow docker volumes
selinux:
policy: targeted
state: permissive
- name: install docker config
template:
src: files/daemon.json
dest: /etc/docker/daemon.json
- name: make sure service is started
systemd:
name: "{{item}}"
state: started
with_items:
- docker
- ntpd
- name: Make docker build dir
file:
path: /srv/slurm
state: directory
mode: 0755
- name: Make dirs to be used as a volumes
file:
path: "/srv/slurm/volumes{{item}}"
state: directory
mode: 0777
with_items:
- /var/spool/slurm
- /etc/munge
- /etc/slurm
- name: Install munge_keyfile
copy:
src: files/munge.key
dest: /srv/slurm/volumes/etc/munge/munge.key
- name: install slurm config files
template:
src: files/{{ item }}
dest: /srv/slurm/volumes/etc/slurm
with_items:
- slurm.conf
- slurmdbd.conf
- job_submit.lua
- name: install build files
template:
src: files/{{ item }}
dest: /srv/slurm
with_items:
- Dockerfile
- ldap.conf
- nslcd.conf
- pam_ldap.conf
- runslurmctld.sh
- nsswitch.conf
- ssmtp.conf
- name: force (re)build slurm image
docker_image:
state: present
force: yes
path: /srv/slurm
name: hpc/slurm
nocache: yes
tags:
- build
- name: Install service files.
template:
src: files/{{item}}
dest: /etc/systemd/system/{{item}}
mode: 644
owner: root
group: root
with_items:
- munge.service
- slurmdbd.service
- slurm.service
tags:
- service-files
- name: install service files
command: systemctl daemon-reload
- name: make sure servcies are started.
systemd:
name: "{{item}}"
state: restarted
with_items:
- slurmdbd.service
- munge.service
- slurm.service #slurmctl
tags:
- start-service

7
roles/slurm/vars/main.yml

@ -0,0 +1,7 @@ @@ -0,0 +1,7 @@
$ANSIBLE_VAULT;1.1;AES256
31623737333935623739376631366131393038663161396361303463653639633430393037366335
6230313137353933323231613232366261666530393365610a636332616438386534663766343736
65626462303965653433646662666139343161656639643739643530363630376539323133396630
3737386464373064380a643964653433383239366639366330323539376631633738633531623235
30613534346361326265623663356637316266313331663537366136323162393230323562373537
6665353631383137323465633230613537323733396434633163

5
slurm.yml

@ -0,0 +1,5 @@ @@ -0,0 +1,5 @@
---
- hosts: schedulers
become: True
roles:
- slurm
Loading…
Cancel
Save