276 Commits (master)

Author SHA1 Message Date
Egon Rijpkema 9b3d8093a8 Changed minimum storage for trigger. 1 month ago
Egon Rijpkema bf0d61b21c Added pg-lustre.yml 1 month ago
Egon Rijpkema c751276347 Added node exporter of pg lustre components. 1 month ago
Egon Rijpkema 9790dc00ae Removed vulture nodes that have been terminated. 5 months ago
B.E. Droge d06255b17f remove esx pg-nodes 5 months ago
Egon Rijpkema ffd0540f0d Removed nodes that are now longer in use. 5 months ago
F. Dijkstra 54657daab0 Added additional GPU nodes with ssd (labeled as nvme) disks. 5 months ago
F. Dijkstra 4cfa01b162 Changed the quota check to also set quota when the quota are very 6 months ago
F. Dijkstra 7ec294f30b This is the actual login_checks.sh script which is in use on the 6 months ago
Egon Rijpkema 53f9a22938 New extreme load alert. 6 months ago
F. Dijkstra e7d0ac6708 Moved nodes with broken IB to vulture partition. 6 months ago
B.E. Droge df5dabb454 reqmem is now specified per job 6 months ago
F. Dijkstra 697ac013b7 Removed users from PrivateData, as this affects the coordinator role. 7 months ago
F. Dijkstra 120a0c4150 Removed settings that have been removed from Slurm 21.08 7 months ago
F. Dijkstra 0e1fc73cca Increased timeout for not using the GPU to 4 hours, since 8 months ago
F. Dijkstra 63d5c01d59 Added PrivateData setting to slurm.conf as setting it only in 8 months ago
F. Dijkstra fe09b7faf5 Added users to PrivateData, as usage on itself did not have the 8 months ago
F. Dijkstra 80c0533eb6 Added the parameter PrivateData to prevent regular users from seeing 8 months ago
Egon Rijpkema 5dc4274e96 Added new prometheus cert for knyft. 10 months ago
Egon Rijpkema c70e4a4af9 Lustre exporter is extremely verbose. 11 months ago
B.E. Droge 7e43402cb0 set pg-node247 and 269 to FUTURE 1 year ago
B.E. Droge 3e775df7a7 remove dh-node11 and 19 1 year ago
root 5074348f17 slurmd_restart should actually restart (not reload) slurmd 1 year ago
root 6735ac1e69 add config tag to config-related steps, do restart of slurmd 1 year ago
B.E. Droge ecc56268c4 remove xdmod scripts 1 year ago
B.E. Droge 39e4b8ad77 disable task affinity for cgroups 1 year ago
B.E. Droge df5090ca69 decrease tmpdisk values to a close power of 10 1 year ago
B.E. Droge c29670ceaf Add TmpFS=/local and TmpDisk values for nodes 1 year ago
root 41f075af42 fix syntax error 1 year ago
root 1bb6ca0329 update db password 1 year ago
root b965d07018 split single slurm logrotate setting into two separate ones 1 year ago
root 56ac7e9194 fix deprecation warning for loop in yum module 1 year ago
root a3eb7a3e72 bump slurm version 1 year ago
B.E. Droge c2fc2e779a make slurm user owner of slurmdbd.conf 1 year ago
B.E. Droge e7cf23fb7d change mode of slurmdbd.conf 1 year ago
B.E. Droge 5e730f4364 move node 267 back to generic list with esx nodes 1 year ago
B.E. Droge 6e684d9056 move nodes with broken ib to vulture, set merlin nodes to future 1 year ago
Egon Rijpkema d2d799cf56 Do not crash when no usage data for a gpu is available. 1 year ago
Egon Rijpkema e23f29f39e Added alerts for ceph health status. 1 year ago
Egon Rijpkema 3d5120363e Scrape ceph on the merlin-management001 1 year ago
B.E. Droge 731fe3d802 Use nodesets, and move non-ib nodes to vulture 2 years ago
B.E. Droge 746d716385 modify link to scientific papers that acknowledge peregrine 2 years ago
Egon Rijpkema c6fcf9ca27 When it is actualy full, send an alert about /tmp 2 years ago
B.E. Droge d801c55ff4 fix typo in lua script 2 years ago
B.E. Droge 3e361b5cac set max_rpc_cnt=150 2 years ago
B.E. Droge 9347bd9238 set messagetimeout to 30 2 years ago
B.E. Droge 68806f782e set maxnodes=1 for gpushort 2 years ago
F. Dijkstra 9267d5ffbc Added missing plugstack.conf change. The private tmpdir is now taken 2 years ago
F. Dijkstra 90a5552b47 Changed the limit for short jobs in the gpu partition to 2 hours, 2 years ago
F. Dijkstra ae3532d3f8 Added 2nd node to gpushort partition. 2 years ago