Egon Rijpkema
|
9b3d8093a8
|
Changed minimum storage for trigger.
|
1 month ago |
Egon Rijpkema
|
bf0d61b21c
|
Added pg-lustre.yml
|
1 month ago |
Egon Rijpkema
|
c751276347
|
Added node exporter of pg lustre components.
|
1 month ago |
Egon Rijpkema
|
9790dc00ae
|
Removed vulture nodes that have been terminated.
|
5 months ago |
B.E. Droge
|
d06255b17f
|
remove esx pg-nodes
|
5 months ago |
Egon Rijpkema
|
ffd0540f0d
|
Removed nodes that are now longer in use.
|
5 months ago |
F. Dijkstra
|
54657daab0
|
Added additional GPU nodes with ssd (labeled as nvme) disks.
|
5 months ago |
F. Dijkstra
|
4cfa01b162
|
Changed the quota check to also set quota when the quota are very
small. This allows for setting small default quota.
|
6 months ago |
F. Dijkstra
|
7ec294f30b
|
This is the actual login_checks.sh script which is in use on the
Peregrine cluster. It is unclear where the previous version came
from.
|
6 months ago |
Egon Rijpkema
|
53f9a22938
|
New extreme load alert.
|
6 months ago |
F. Dijkstra
|
e7d0ac6708
|
Moved nodes with broken IB to vulture partition.
|
6 months ago |
B.E. Droge
|
df5dabb454
|
reqmem is now specified per job
|
6 months ago |
F. Dijkstra
|
697ac013b7
|
Removed users from PrivateData, as this affects the coordinator role.
|
7 months ago |
F. Dijkstra
|
120a0c4150
|
Removed settings that have been removed from Slurm 21.08
|
7 months ago |
F. Dijkstra
|
0e1fc73cca
|
Increased timeout for not using the GPU to 4 hours, since
AlphaFold needs several hours to initialize when reading its
data from Lustre, and we don't have alternative faster storage.
|
8 months ago |
F. Dijkstra
|
63d5c01d59
|
Added PrivateData setting to slurm.conf as setting it only in
slurmdbd.conf was not sufficient.
|
8 months ago |
F. Dijkstra
|
fe09b7faf5
|
Added users to PrivateData, as usage on itself did not have the
required effect.
|
8 months ago |
F. Dijkstra
|
80c0533eb6
|
Added the parameter PrivateData to prevent regular users from seeing
the cluster accounting data of other users.
|
8 months ago |
Egon Rijpkema
|
5dc4274e96
|
Added new prometheus cert for knyft.
Not used in playbook.... yet...
|
10 months ago |
Egon Rijpkema
|
c70e4a4af9
|
Lustre exporter is extremely verbose.
we removed all the stdout logging.
|
11 months ago |
B.E. Droge
|
7e43402cb0
|
set pg-node247 and 269 to FUTURE
|
1 year ago |
B.E. Droge
|
3e775df7a7
|
remove dh-node11 and 19
|
1 year ago |
root
|
5074348f17
|
slurmd_restart should actually restart (not reload) slurmd
|
1 year ago |
root
|
6735ac1e69
|
add config tag to config-related steps, do restart of slurmd
|
1 year ago |
B.E. Droge
|
ecc56268c4
|
remove xdmod scripts
|
1 year ago |
B.E. Droge
|
39e4b8ad77
|
disable task affinity for cgroups
|
1 year ago |
B.E. Droge
|
df5090ca69
|
decrease tmpdisk values to a close power of 10
|
1 year ago |
B.E. Droge
|
c29670ceaf
|
Add TmpFS=/local and TmpDisk values for nodes
|
1 year ago |
root
|
41f075af42
|
fix syntax error
|
1 year ago |
root
|
1bb6ca0329
|
update db password
|
1 year ago |
root
|
b965d07018
|
split single slurm logrotate setting into two separate ones
|
1 year ago |
root
|
56ac7e9194
|
fix deprecation warning for loop in yum module
|
1 year ago |
root
|
a3eb7a3e72
|
bump slurm version
|
1 year ago |
B.E. Droge
|
c2fc2e779a
|
make slurm user owner of slurmdbd.conf
|
1 year ago |
B.E. Droge
|
e7cf23fb7d
|
change mode of slurmdbd.conf
|
1 year ago |
B.E. Droge
|
5e730f4364
|
move node 267 back to generic list with esx nodes
|
1 year ago |
B.E. Droge
|
6e684d9056
|
move nodes with broken ib to vulture, set merlin nodes to future
|
1 year ago |
Egon Rijpkema
|
d2d799cf56
|
Do not crash when no usage data for a gpu is available.
|
1 year ago |
Egon Rijpkema
|
e23f29f39e
|
Added alerts for ceph health status.
|
1 year ago |
Egon Rijpkema
|
3d5120363e
|
Scrape ceph on the merlin-management001
|
1 year ago |
B.E. Droge
|
731fe3d802
|
Use nodesets, and move non-ib nodes to vulture
|
2 years ago |
B.E. Droge
|
746d716385
|
modify link to scientific papers that acknowledge peregrine
|
2 years ago |
Egon Rijpkema
|
c6fcf9ca27
|
When it is actualy full, send an alert about /tmp
we still omit it from the prediction alerts because we don't like
getting alerts...
|
2 years ago |
B.E. Droge
|
d801c55ff4
|
fix typo in lua script
|
2 years ago |
B.E. Droge
|
3e361b5cac
|
set max_rpc_cnt=150
|
2 years ago |
B.E. Droge
|
9347bd9238
|
set messagetimeout to 30
|
2 years ago |
B.E. Droge
|
68806f782e
|
set maxnodes=1 for gpushort
|
2 years ago |
F. Dijkstra
|
9267d5ffbc
|
Added missing plugstack.conf change. The private tmpdir is now taken
from /local/tmp instead of /local.
|
2 years ago |
F. Dijkstra
|
90a5552b47
|
Changed the limit for short jobs in the gpu partition to 2 hours,
in line with the gpushort partition.
|
2 years ago |
F. Dijkstra
|
ae3532d3f8
|
Added 2nd node to gpushort partition.
Removed pg-gpu06 from the gpu partitions, since it is out of production
and used as an AMD GPU test machine.
|
2 years ago |