Egon Rijpkema
|
a9d1f4e5bd
|
No 72h predictions on /var.
This disables the 72h disk full prediction on /var. This was done
because it led to false positives. The 8h prediction and diskfull alert
are kept.
|
6 months ago |
B.E. Droge
|
3320c4d570
|
Merge pull request 'Increased timeout for not using the GPU to 4 hours' (#20) from feature/increased_gpu_timeout into master
Reviewed-on: #20
|
6 months ago |
F. Dijkstra
|
0e1fc73cca
|
Increased timeout for not using the GPU to 4 hours, since
AlphaFold needs several hours to initialize when reading its
data from Lustre, and we don't have alternative faster storage.
|
6 months ago |
Egon Rijpkema
|
700c7fd0a6
|
Updated prometheus documentation a little.
|
6 months ago |
E.M.A. Rijpkema
|
454061659a
|
Merge pull request 'Add PrivateData setting to slurmdbd.conf and slurm.conf' (#18) from feature/privatedata into master
Reviewed-on: #18
|
6 months ago |
F. Dijkstra
|
63d5c01d59
|
Added PrivateData setting to slurm.conf as setting it only in
slurmdbd.conf was not sufficient.
|
6 months ago |
F. Dijkstra
|
fe09b7faf5
|
Added users to PrivateData, as usage on itself did not have the
required effect.
|
7 months ago |
F. Dijkstra
|
80c0533eb6
|
Added the parameter PrivateData to prevent regular users from seeing
the cluster accounting data of other users.
|
7 months ago |
G.J.C. Strikwerda
|
32dc935e4c
|
Merge pull request 'Added tree to the list of tools.' (#17) from feature/tree into master
Reviewed-on: #17
|
7 months ago |
F. Dijkstra
|
2b0c012502
|
Added tree to the list of tools.
|
7 months ago |
Egon Rijpkema
|
5dc4274e96
|
Added new prometheus cert for knyft.
Not used in playbook.... yet...
|
9 months ago |
Egon Rijpkema
|
210c8a6911
|
Made build work again.
TODO: Find a better fork of lustre-exporter
|
10 months ago |
Egon Rijpkema
|
c70e4a4af9
|
Lustre exporter is extremely verbose.
we removed all the stdout logging.
|
10 months ago |
B.E. Droge
|
7e43402cb0
|
set pg-node247 and 269 to FUTURE
|
11 months ago |
B.E. Droge
|
3e775df7a7
|
remove dh-node11 and 19
|
11 months ago |
root
|
5074348f17
|
slurmd_restart should actually restart (not reload) slurmd
|
11 months ago |
root
|
6735ac1e69
|
add config tag to config-related steps, do restart of slurmd
|
11 months ago |
B.E. Droge
|
a4cb09cd33
|
Merge branch 'master' of ssh://git.web.rug.nl:222/HPC/pg-playbooks
|
11 months ago |
B.E. Droge
|
ecc56268c4
|
remove xdmod scripts
|
11 months ago |
B.E. Droge
|
39e4b8ad77
|
disable task affinity for cgroups
|
11 months ago |
B.E. Droge
|
df5090ca69
|
decrease tmpdisk values to a close power of 10
|
11 months ago |
B.E. Droge
|
c29670ceaf
|
Add TmpFS=/local and TmpDisk values for nodes
|
11 months ago |
root
|
41f075af42
|
fix syntax error
|
11 months ago |
root
|
1bb6ca0329
|
update db password
|
11 months ago |
root
|
b965d07018
|
split single slurm logrotate setting into two separate ones
|
11 months ago |
root
|
56ac7e9194
|
fix deprecation warning for loop in yum module
|
11 months ago |
root
|
a3eb7a3e72
|
bump slurm version
|
11 months ago |
B.E. Droge
|
c2fc2e779a
|
make slurm user owner of slurmdbd.conf
|
11 months ago |
B.E. Droge
|
e7cf23fb7d
|
change mode of slurmdbd.conf
|
11 months ago |
B.E. Droge
|
5e730f4364
|
move node 267 back to generic list with esx nodes
|
1 year ago |
B.E. Droge
|
6e684d9056
|
move nodes with broken ib to vulture, set merlin nodes to future
|
1 year ago |
Egon Rijpkema
|
d2d799cf56
|
Do not crash when no usage data for a gpu is available.
|
1 year ago |
Egon Rijpkema
|
e23f29f39e
|
Added alerts for ceph health status.
|
1 year ago |
Egon Rijpkema
|
3d5120363e
|
Scrape ceph on the merlin-management001
|
1 year ago |
Egon Rijpkema
|
dba8e6269b
|
Added a slurmdbd storage pass
|
1 year ago |
E.M.A. Rijpkema
|
335e60087c
|
Merge pull request 'Use NodeSets in SLURM config' (#16) from nodesets into master
|
1 year ago |
B.E. Droge
|
731fe3d802
|
Use nodesets, and move non-ib nodes to vulture
|
1 year ago |
B.E. Droge
|
746d716385
|
modify link to scientific papers that acknowledge peregrine
|
2 years ago |
Egon Rijpkema
|
c6fcf9ca27
|
When it is actualy full, send an alert about /tmp
we still omit it from the prediction alerts because we don't like
getting alerts...
|
2 years ago |
B.E. Droge
|
d801c55ff4
|
fix typo in lua script
|
2 years ago |
B.E. Droge
|
3e361b5cac
|
set max_rpc_cnt=150
|
2 years ago |
B.E. Droge
|
9347bd9238
|
set messagetimeout to 30
|
2 years ago |
B.E. Droge
|
68806f782e
|
set maxnodes=1 for gpushort
|
2 years ago |
B.E. Droge
|
9731cc7b04
|
Merge pull request 'Added gpushort partition and removed pg-gpu06 from the list of nodes.' (#15) from gpushort into master
|
2 years ago |
B.E. Droge
|
c5722f5c89
|
Merge pull request 'Moved the location of the job private temporary directory from /local to /local/tmp.' (#14) from localdir into master
|
2 years ago |
F. Dijkstra
|
9267d5ffbc
|
Added missing plugstack.conf change. The private tmpdir is now taken
from /local/tmp instead of /local.
|
2 years ago |
F. Dijkstra
|
90a5552b47
|
Changed the limit for short jobs in the gpu partition to 2 hours,
in line with the gpushort partition.
|
2 years ago |
F. Dijkstra
|
ae3532d3f8
|
Added 2nd node to gpushort partition.
Removed pg-gpu06 from the gpu partitions, since it is out of production
and used as an AMD GPU test machine.
|
2 years ago |
F. Dijkstra
|
0bc104ad17
|
Fixed a typo.
|
2 years ago |
F. Dijkstra
|
795577652c
|
Added gpushort partition and corresponding qos. This to be able to
reserve a few nodes for short jobs.
|
2 years ago |