Commit Graph

  • 9b3d8093a8 Changed minimum storage for trigger. master Egon Rijpkema 2022-07-07 14:48:27 +0200
  • bf0d61b21c Added pg-lustre.yml Egon Rijpkema 2022-07-07 14:44:31 +0200
  • c751276347 Added node exporter of pg lustre components. Egon Rijpkema 2022-07-05 14:29:09 +0200
  • 493afa29ae Merge pull request 'Remove ESX vulture nodes from Ansible hosts file' (#28) from fix/remove_vcpu_nodes into master E.M.A. Rijpkema 2022-03-29 16:18:15 +0200
  • 8833b57a68 remove esx nodes from vulture #28 B.E. Droge 2022-03-29 15:58:55 +0200
  • efe46a94f0 Merge pull request 'Removed vulture nodes that have been terminated.' (#27) from fix/remove-vulture into master B.E. Droge 2022-03-29 15:13:04 +0200
  • 9790dc00ae Removed vulture nodes that have been terminated. #27 Egon Rijpkema 2022-03-29 14:59:34 +0200
  • 002d629276 Merge pull request 'Remove ESX pg-nodes' (#26) from remove_esx_nodes into master H. Meijering 2022-03-25 15:55:20 +0100
  • d06255b17f remove esx pg-nodes #26 remove_esx_nodes B.E. Droge 2022-03-25 09:19:02 +0100
  • ffd0540f0d Removed nodes that are now longer in use. Egon Rijpkema 2022-03-24 14:52:17 +0100
  • 7da275e513 Merge pull request 'Added additional GPU nodes with ssd (labeled as nvme) disks.' (#25) from feature/gpu_nvme into master B.E. Droge 2022-03-03 15:59:43 +0100
  • 54657daab0 Added additional GPU nodes with ssd (labeled as nvme) disks. #25 F. Dijkstra 2022-03-03 15:19:00 +0100
  • 50539b1b2e Merge pull request 'Changed login_checks.sh to the version used in production and modified the quota check.' (#24) from feature/login_script_default_quota into master B.E. Droge 2022-03-02 14:46:24 +0100
  • 4cfa01b162 Changed the quota check to also set quota when the quota are very small. This allows for setting small default quota. #24 F. Dijkstra 2022-03-02 09:56:45 +0100
  • 7ec294f30b This is the actual login_checks.sh script which is in use on the Peregrine cluster. It is unclear where the previous version came from. F. Dijkstra 2022-03-02 09:42:53 +0100
  • 53f9a22938 New extreme load alert. Egon Rijpkema 2022-03-01 09:55:44 +0100
  • b1c883dbd1 Merge pull request 'Moved nodes with broken IB to vulture partition.' (#23) from slurm_21.08 into master E.M.A. Rijpkema 2022-02-22 16:55:36 +0100
  • e7d0ac6708 Moved nodes with broken IB to vulture partition. #23 F. Dijkstra 2022-02-22 16:48:00 +0100
  • df5dabb454 reqmem is now specified per job B.E. Droge 2022-02-01 11:37:30 +0100
  • 86b74f07ad Merge pull request 'Removed settings that are no longer available in Slurm 21.08' (#22) from slurm_21.08 into master G.J.C. Strikwerda 2022-01-25 09:06:53 +0100
  • 697ac013b7 Removed users from PrivateData, as this affects the coordinator role. #22 F. Dijkstra 2022-01-25 08:51:43 +0100
  • 120a0c4150 Removed settings that have been removed from Slurm 21.08 F. Dijkstra 2022-01-25 08:46:55 +0100
  • a9d1f4e5bd No 72h predictions on /var. #21 fix/no_alerts_var Egon Rijpkema 2022-01-06 09:20:37 +0100
  • 3320c4d570 Merge pull request 'Increased timeout for not using the GPU to 4 hours' (#20) from feature/increased_gpu_timeout into master B.E. Droge 2021-12-22 16:25:58 +0100
  • 0e1fc73cca Increased timeout for not using the GPU to 4 hours, since AlphaFold needs several hours to initialize when reading its data from Lustre, and we don't have alternative faster storage. #20 #19 F. Dijkstra 2021-12-22 16:21:08 +0100
  • 700c7fd0a6 Updated prometheus documentation a little. Egon Rijpkema 2021-12-21 11:42:22 +0100
  • 454061659a Merge pull request 'Add PrivateData setting to slurmdbd.conf and slurm.conf' (#18) from feature/privatedata into master E.M.A. Rijpkema 2021-12-16 10:16:39 +0100
  • 63d5c01d59 Added PrivateData setting to slurm.conf as setting it only in slurmdbd.conf was not sufficient. #18 F. Dijkstra 2021-12-15 14:08:39 +0100
  • fe09b7faf5 Added users to PrivateData, as usage on itself did not have the required effect. F. Dijkstra 2021-12-06 11:43:34 +0100
  • 80c0533eb6 Added the parameter PrivateData to prevent regular users from seeing the cluster accounting data of other users. F. Dijkstra 2021-12-06 10:08:06 +0100
  • 32dc935e4c Merge pull request 'Added tree to the list of tools.' (#17) from feature/tree into master G.J.C. Strikwerda 2021-11-25 09:45:53 +0100
  • 2b0c012502 Added tree to the list of tools. #17 feature/tree F. Dijkstra 2021-11-25 09:41:55 +0100
  • 5dc4274e96 Added new prometheus cert for knyft. Egon Rijpkema 2021-10-11 15:47:55 +0200
  • 210c8a6911 Made build work again. Egon Rijpkema 2021-09-09 16:59:42 +0200
  • c70e4a4af9 Lustre exporter is extremely verbose. Egon Rijpkema 2021-09-09 15:36:06 +0200
  • 7e43402cb0 set pg-node247 and 269 to FUTURE B.E. Droge 2021-07-27 17:11:52 +0200
  • 3e775df7a7 remove dh-node11 and 19 B.E. Droge 2021-07-27 17:06:39 +0200
  • 5074348f17 slurmd_restart should actually restart (not reload) slurmd root 2021-07-27 11:45:08 +0200
  • 6735ac1e69 add config tag to config-related steps, do restart of slurmd root 2021-07-27 11:04:28 +0200
  • a4cb09cd33 Merge branch 'master' of ssh://git.web.rug.nl:222/HPC/pg-playbooks B.E. Droge 2021-07-26 16:53:57 +0200
  • ecc56268c4 remove xdmod scripts B.E. Droge 2021-07-26 16:53:34 +0200
  • 39e4b8ad77 disable task affinity for cgroups B.E. Droge 2021-07-26 16:53:09 +0200
  • df5090ca69 decrease tmpdisk values to a close power of 10 B.E. Droge 2021-07-26 15:38:52 +0200
  • c29670ceaf Add TmpFS=/local and TmpDisk values for nodes B.E. Droge 2021-07-26 15:27:47 +0200
  • 41f075af42 fix syntax error root 2021-07-26 15:07:10 +0200
  • 1bb6ca0329 update db password root 2021-07-26 14:57:23 +0200
  • b965d07018 split single slurm logrotate setting into two separate ones root 2021-07-26 14:48:17 +0200
  • 56ac7e9194 fix deprecation warning for loop in yum module root 2021-07-26 14:42:32 +0200
  • a3eb7a3e72 bump slurm version root 2021-07-26 14:41:50 +0200
  • c2fc2e779a make slurm user owner of slurmdbd.conf B.E. Droge 2021-07-26 12:34:19 +0200
  • e7cf23fb7d change mode of slurmdbd.conf B.E. Droge 2021-07-26 12:28:56 +0200
  • 5e730f4364 move node 267 back to generic list with esx nodes B.E. Droge 2021-05-18 11:21:56 +0200
  • 6e684d9056 move nodes with broken ib to vulture, set merlin nodes to future B.E. Droge 2021-05-18 11:13:36 +0200
  • d2d799cf56 Do not crash when no usage data for a gpu is available. Egon Rijpkema 2021-05-07 14:24:29 +0200
  • e23f29f39e Added alerts for ceph health status. Egon Rijpkema 2021-05-04 11:44:40 +0200
  • 3d5120363e Scrape ceph on the merlin-management001 Egon Rijpkema 2021-05-04 10:43:05 +0200
  • dba8e6269b Added a slurmdbd storage pass Egon Rijpkema 2021-04-12 15:18:56 +0200
  • 53e4150e7e All the uncommitted files on xcat status-on-xcat root 2021-02-24 12:08:27 +0100
  • 335e60087c Merge pull request 'Use NodeSets in SLURM config' (#16) from nodesets into master E.M.A. Rijpkema 2021-01-05 09:26:57 +0000
  • 731fe3d802 Use nodesets, and move non-ib nodes to vulture #16 B.E. Droge 2021-01-05 10:19:14 +0100
  • 746d716385 modify link to scientific papers that acknowledge peregrine B.E. Droge 2020-12-16 16:52:17 +0100
  • c6fcf9ca27 When it is actualy full, send an alert about /tmp Egon Rijpkema 2020-12-11 12:45:44 +0100
  • d801c55ff4 fix typo in lua script B.E. Droge 2020-11-20 17:18:42 +0100
  • 3e361b5cac set max_rpc_cnt=150 B.E. Droge 2020-11-20 14:07:52 +0100
  • 9347bd9238 set messagetimeout to 30 B.E. Droge 2020-11-20 12:00:00 +0100
  • 68806f782e set maxnodes=1 for gpushort B.E. Droge 2020-11-20 10:06:57 +0100
  • 9731cc7b04 Merge pull request 'Added gpushort partition and removed pg-gpu06 from the list of nodes.' (#15) from gpushort into master B.E. Droge 2020-11-19 10:50:40 +0000
  • c5722f5c89 Merge pull request 'Moved the location of the job private temporary directory from /local to /local/tmp.' (#14) from localdir into master B.E. Droge 2020-11-19 10:49:38 +0000
  • 9267d5ffbc Added missing plugstack.conf change. The private tmpdir is now taken from /local/tmp instead of /local. #14 F. Dijkstra 2020-11-18 17:27:58 +0100
  • 90a5552b47 Changed the limit for short jobs in the gpu partition to 2 hours, in line with the gpushort partition. #15 F. Dijkstra 2020-11-18 17:22:36 +0100
  • ae3532d3f8 Added 2nd node to gpushort partition. Removed pg-gpu06 from the gpu partitions, since it is out of production and used as an AMD GPU test machine. F. Dijkstra 2020-11-17 17:38:19 +0100
  • 0bc104ad17 Fixed a typo. F. Dijkstra 2020-11-17 17:35:29 +0100
  • 795577652c Added gpushort partition and corresponding qos. This to be able to reserve a few nodes for short jobs. F. Dijkstra 2020-11-17 17:32:29 +0100
  • 5599223993 Moved the location of the job private temporary directory from /local to /local/tmp. This allows to have 2nd private directory in /local, which will have the same path on all nodes even when using scp or ssh to that node. This directory can be reached using $LOCALDIR. This directory can be used as a job private node local scratchdir for multinode jobs. F. Dijkstra 2020-11-17 13:13:19 +0100
  • 1e7d18fbde The node exporters for dh should be included as well... Egon Rijpkema 2020-10-27 16:59:42 +0100
  • 0f3a52e65a removed nodes that we don't care about. (for now) Egon Rijpkema 2020-10-15 09:29:25 +0200
  • 956197b186 fix types B.E. Droge 2020-09-15 15:32:56 +0200
  • 06fa1cb16b Additional check for memory usage B.E. Droge 2020-09-15 14:14:29 +0200
  • 1256535d27 Additional check for memory usage B.E. Droge 2020-09-15 14:12:31 +0200
  • bbf79f38bc added xdmod prolog/epilog root 2020-08-13 11:27:40 +0200
  • d241bbbb2b Move kill_invalid_depend to DependencyParameters B.E. Droge 2020-08-12 15:56:25 +0200
  • 980cd6628c add nvme flag to gpu22 root 2020-08-12 14:34:49 +0200
  • 3c87f0a6cf better task names, make .d dirs first root 2020-08-11 15:25:08 +0200
  • 501fd9d992 only run slurm client playbook on non-storage nodes root 2020-08-11 15:18:28 +0200
  • 0134eefb03 Dont copy taskprolog to prolog.d root 2020-08-11 08:24:46 +0200
  • 9b4a4ae829 split epilog and prolog root 2020-08-10 14:03:10 +0200
  • 86af4f5ea8 Prolog/epilog fix B.E. Droge 2020-08-10 13:46:31 +0200
  • 3030b5825e Move some slurm files from templates to files B.E. Droge 2020-08-10 13:43:15 +0200
  • f447ad2fef Use copy module instead of template for prologs/epilogs B.E. Droge 2020-08-10 13:39:58 +0200
  • 97fc1024ba Move prolog and epilog scripts to .d directories B.E. Droge 2020-08-10 13:30:44 +0200
  • 4397c3f9c4 Move prolog and epilog scripts to .d directories B.E. Droge 2020-08-10 13:30:35 +0200
  • 6a5e205bd8 Removed MemLimitEnforce from config, deprecated in SLURM 20 B.E. Droge 2020-08-10 11:35:04 +0200
  • fd59c0dc98 Removed FastSchedule from config, removed in SLURM 20 B.E. Droge 2020-08-10 11:33:42 +0200
  • be1b27bfd0 Modified name of Euclid CVMFS package B.E. Droge 2020-06-11 16:35:30 +0200
  • 34330e91aa Install texlive on all nodes B.E. Droge 2020-06-11 16:06:28 +0200
  • ced403a61d dh-node19 is Cobbler test node. Egon Rijpkema 2020-06-10 09:31:36 +0200
  • c0021bcb1b Added pack_serial_at_end to SchedulerParameters which should improve the scheduling for parallel jobs. F. Dijkstra 2020-06-09 13:43:35 +0200
  • fba7ddf740 Changed threshold for regular qos B.E. Droge 2020-06-09 13:13:55 +0200
  • fd326603d2 Take average usage when multiple gpus are present. Egon Rijpkema 2020-06-09 10:34:39 +0200
  • b069bb58f8 really really remove fuse. Egon Rijpkema 2020-05-27 11:06:52 +0200