Egon Rijpkema
1ead4c42d1
Made the search a litte less naieve...
...
to prevent false positives.
4 years ago
Egon Rijpkema
6b8503d559
Prevent error mail when flag file does not exist.
4 years ago
Egon Rijpkema
43cfd2ccff
Prevent sending a mail every minute.
4 years ago
Egon Rijpkema
102c7e622b
Updated mail adress.
...
(i didn't get spammed so i dare to let it mail to
root@peregrine.hpc.rug.nl )
4 years ago
Egon Rijpkema
f37aedf6ff
Refactored the touch alert to own role.
...
I did this because i want to use it on interactive and login.
4 years ago
Egon Rijpkema
f42c6b4f74
Updated depricated include to import_tasks
4 years ago
Egon Rijpkema
ba0f6445b7
Merge branch 'feature/find-alert'
4 years ago
Egon Rijpkema
63a5816cef
Added ansible automation.
4 years ago
Egon Rijpkema
5545baf9f4
Alerting script
...
name needs to change
Cronjob needs to be added.
automation needs to be done
4 years ago
Egon Rijpkema
75a9a492d1
don't commit pyc files
5 years ago
Egon Rijpkema
603e0fd72c
Cleaner json and less verbose logging
5 years ago
Egon Rijpkema
101a0d30e6
Cleanup based on atime instead of mtime.
5 years ago
B.E. Droge
320221414b
Increased Waittime to 30 to prevent srun from killing tasks too soon
5 years ago
Egon Rijpkema
a688301c7b
Only monitor lustre mounts on the login node.
...
(to prevent duplicate warnings)
5 years ago
Egon Rijpkema
c99adf250e
Added this to node_exporter.yml
5 years ago
Egon Rijpkema
1c5fba8685
Merge branch 'feature/prometheus-monitoring'
5 years ago
Egon Rijpkema
182ab6c0c0
Script to convert ouput of PusHprox/clients
5 years ago
Egon Rijpkema
524f467d12
Added node exporter, a prometheus proxy and..
...
proxy client.
This should allow us to monitor teh peregrine cluster.
Bind to 0.0.0.0
Add the client and node_exporter in one yaml
5 years ago
Egon Rijpkema
4e011d35e1
Renamed node.yml to common.yml
5 years ago
Egon Rijpkema
3e5815ad0a
Switched to singular.
5 years ago
Egon Rijpkema
99a07fdc1b
These tasks are for all peregrine hosts.
5 years ago
Egon Rijpkema
9a812c8006
Now running as Slurm user, no hourly restart needed.
5 years ago
Egon Rijpkema
310627bddc
Pinned the uid of the slurm user.
5 years ago
Egon Rijpkema
4e05c244d4
We're using singular group names now.
5 years ago
Egon Rijpkema
723e0a2a3b
File mode should be specified as octal
...
https://github.com/ansible/ansible/issues/23491
5 years ago
Egon Rijpkema
191a595f17
Remove plural in roles.
5 years ago
Egon Rijpkema
8496c17105
Added tmp dir things to common roles
...
also running this on all nodes.
5 years ago
Egon Rijpkema
190b1d4b18
Set the tempdir to /local/tmp...
...
on login and interactive nodes
5 years ago
Egon Rijpkema
8c8ff1f28f
Cronjob files are overwritten by ansible...
5 years ago
E.M.A. Rijpkema
8b2e347fc0
Merge branch 'feature/cleanup-temp' of HPC/pg-playbooks into master
5 years ago
Egon Rijpkema
e5c88ca17a
Changed mtime into 10 for /tmp as suggested.
5 years ago
Egon Rijpkema
c720417e65
cronjob to cleanup temp dirs
5 years ago
Egon Rijpkema
e07db38cd5
Merge branch 'feature/dockerized-slurm'
5 years ago
Egon Rijpkema
1ec290ddc1
Added workaround for slurm user accounts.
...
Slurm will be restarted every hour.
5 years ago
Egon Rijpkema
e855b05e3b
recipy to build slurm docker
...
trick slurmdb and slurm into connecting
added separate munge service
added tag to docker build step
bypass selinux to let docker volumes work
removed duplicate slurdbd start
added munge key generation
polishing
fasing out test system
Added munge files.
Munge will be run in a separate container.
Added ldap client to the docker image.
Needs to be tested though
Encrypted slurmdbd secret
Added firewall and Postfix
Needed for slurm but can probably be used elsewhere.
Set selinux to permissive to allow docker volumes.
and make starting the services optional.
ZOnder host networking maar lijkt niet te werken.
Obtained changes from live pg-scheduler
Changes after testing on production...
changed timezone to CEST
clean up reference to test system
Updated readme
Fixed mail from docker
This change installs ssmtp to connect to a mailserver. required settings
are to be entered in the hosts file.
updated hostname for mail to work
Added monk user and group
Fixed syntax error
Updated docker and changed Requires into wants.
Added a tag to the service files.
Changed to overlay driver as it is possible now.
also added various config files for slurm.
remove obsolete var/spool/slurm mount in db docker
5 years ago
Egon Rijpkema
32c0377bca
Added timeout to scratch cleanup.
5 years ago
Egon Rijpkema
12eef11127
Find -delete -type f still deletes directories.
5 years ago
E.M.A. Rijpkema
ff67ffe14b
Merge branch 'feature/load-nvidia-module' of HPC/pg-playbooks into master
5 years ago
Egon Rijpkema
aacb5c3d22
added -pm ENABLED flag
...
this prevents reloading of the module after each job start.
5 years ago
Egon Rijpkema
98127fd145
Systemd unit file that will load nvidia module
...
If the module is not loaded, slurm will not start on the gpu nodes.
5 years ago
Egon Rijpkema
f72265905d
CHanged cronjob to run at night.
5 years ago
Egon Rijpkema
916ee9688f
accidentialy also deleted directories
5 years ago
Egon Rijpkema
c72d272d08
cronfile and logger now have same name.
5 years ago
Egon Rijpkema
690e648200
switched to syslog instead of mailing
5 years ago
Egon Rijpkema
c098d9ed38
Passwords should not be committed.
5 years ago
Egon Rijpkema
964ee64cdc
forgotten this one
5 years ago
Egon Rijpkema
a60ce419ca
added gitignore
5 years ago
Egon Rijpkema
621b1966e9
Changed day of run from saturday to monday morning.
5 years ago
Egon Rijpkema
61d9ea3043
cronjob to purge /scratch
5 years ago
Egon Rijpkema
1d8fa6f087
ansible config to reach pg components
5 years ago