B.E. Droge
5e730f4364
move node 267 back to generic list with esx nodes
1 year ago
B.E. Droge
6e684d9056
move nodes with broken ib to vulture, set merlin nodes to future
1 year ago
Egon Rijpkema
d2d799cf56
Do not crash when no usage data for a gpu is available.
1 year ago
Egon Rijpkema
e23f29f39e
Added alerts for ceph health status.
1 year ago
Egon Rijpkema
3d5120363e
Scrape ceph on the merlin-management001
1 year ago
Egon Rijpkema
dba8e6269b
Added a slurmdbd storage pass
1 year ago
E.M.A. Rijpkema
335e60087c
Merge pull request 'Use NodeSets in SLURM config' ( #16 ) from nodesets into master
2 years ago
B.E. Droge
731fe3d802
Use nodesets, and move non-ib nodes to vulture
2 years ago
B.E. Droge
746d716385
modify link to scientific papers that acknowledge peregrine
2 years ago
Egon Rijpkema
c6fcf9ca27
When it is actualy full, send an alert about /tmp
...
we still omit it from the prediction alerts because we don't like
getting alerts...
2 years ago
B.E. Droge
d801c55ff4
fix typo in lua script
2 years ago
B.E. Droge
3e361b5cac
set max_rpc_cnt=150
2 years ago
B.E. Droge
9347bd9238
set messagetimeout to 30
2 years ago
B.E. Droge
68806f782e
set maxnodes=1 for gpushort
2 years ago
B.E. Droge
9731cc7b04
Merge pull request 'Added gpushort partition and removed pg-gpu06 from the list of nodes.' ( #15 ) from gpushort into master
2 years ago
B.E. Droge
c5722f5c89
Merge pull request 'Moved the location of the job private temporary directory from /local to /local/tmp.' ( #14 ) from localdir into master
2 years ago
F. Dijkstra
9267d5ffbc
Added missing plugstack.conf change. The private tmpdir is now taken
...
from /local/tmp instead of /local.
2 years ago
F. Dijkstra
90a5552b47
Changed the limit for short jobs in the gpu partition to 2 hours,
...
in line with the gpushort partition.
2 years ago
F. Dijkstra
ae3532d3f8
Added 2nd node to gpushort partition.
...
Removed pg-gpu06 from the gpu partitions, since it is out of production
and used as an AMD GPU test machine.
2 years ago
F. Dijkstra
0bc104ad17
Fixed a typo.
2 years ago
F. Dijkstra
795577652c
Added gpushort partition and corresponding qos. This to be able to
...
reserve a few nodes for short jobs.
2 years ago
F. Dijkstra
5599223993
Moved the location of the job private temporary directory from /local to /local/tmp.
...
This allows to have 2nd private directory in /local, which will have the same path
on all nodes even when using scp or ssh to that node.
This directory can be reached using $LOCALDIR. This directory can be used as a job
private node local scratchdir for multinode jobs.
2 years ago
Egon Rijpkema
1e7d18fbde
The node exporters for dh should be included as well...
2 years ago
Egon Rijpkema
0f3a52e65a
removed nodes that we don't care about. (for now)
2 years ago
B.E. Droge
956197b186
fix types
2 years ago
B.E. Droge
06fa1cb16b
Additional check for memory usage
2 years ago
B.E. Droge
1256535d27
Additional check for memory usage
2 years ago
root
bbf79f38bc
added xdmod prolog/epilog
2 years ago
B.E. Droge
d241bbbb2b
Move kill_invalid_depend to DependencyParameters
2 years ago
root
980cd6628c
add nvme flag to gpu22
2 years ago
root
3c87f0a6cf
better task names, make .d dirs first
2 years ago
root
501fd9d992
only run slurm client playbook on non-storage nodes
2 years ago
root
0134eefb03
Dont copy taskprolog to prolog.d
2 years ago
root
9b4a4ae829
split epilog and prolog
2 years ago
B.E. Droge
86af4f5ea8
Prolog/epilog fix
2 years ago
B.E. Droge
3030b5825e
Move some slurm files from templates to files
2 years ago
B.E. Droge
f447ad2fef
Use copy module instead of template for prologs/epilogs
2 years ago
B.E. Droge
97fc1024ba
Move prolog and epilog scripts to .d directories
2 years ago
B.E. Droge
4397c3f9c4
Move prolog and epilog scripts to .d directories
2 years ago
B.E. Droge
6a5e205bd8
Removed MemLimitEnforce from config, deprecated in SLURM 20
2 years ago
B.E. Droge
fd59c0dc98
Removed FastSchedule from config, removed in SLURM 20
2 years ago
B.E. Droge
be1b27bfd0
Modified name of Euclid CVMFS package
2 years ago
B.E. Droge
34330e91aa
Install texlive on all nodes
2 years ago
Egon Rijpkema
ced403a61d
dh-node19 is Cobbler test node.
2 years ago
F. Dijkstra
c0021bcb1b
Added pack_serial_at_end to SchedulerParameters which should improve
...
the scheduling for parallel jobs.
2 years ago
B.E. Droge
fba7ddf740
Changed threshold for regular qos
2 years ago
Egon Rijpkema
fd326603d2
Take average usage when multiple gpus are present.
...
Also removed double return.
2 years ago
Egon Rijpkema
b069bb58f8
really really remove fuse.
2 years ago
root
24e26c3ca8
Increased weight of job age parameter to prevent job starvation.
2 years ago
root
077218bbfb
Added nvme label to GPU nodes with nvme enabled /local
2 years ago