Browse Source

Started documenting prometheus.

pull/8/head
Egon Rijpkema 4 years ago
parent
commit
071791aa72
  1. 17
      documentation/prometheus.md
  2. BIN
      documentation/prometheus.png
  3. 2
      readme.md

17
documentation/prometheus.md

@ -0,0 +1,17 @@ @@ -0,0 +1,17 @@
# Prometheus
Below is a picture of the current prometheus monitoring setup on gearshift. Our setup consists of the following components:
![Gross simplification](./prometheus.png)
## Node exporter
Each peregrine node has a node exporter running. It was installed using the node exporter.yml playbook in the root of this repository.
This playbook applies the node exporter which does little more than copy the binary (from promtools/results) to the node and install a systemd unit file on the node. The node exporter listens for requests on port 9100 on each node.
## Prometheus server
The server runs in a docker container on knyft. It was installed using the prometheus.yml playbook that installs the prom_server role. This role also contains its configuration files. The server scrapes the exports of the nodes. It stores them in a special time series database that is integrated in the prometheus server. Targets and alerts are configured using these files. Prometheus also has a web frontend that listens on [knyft](http://knyft.hpc.rug.nl:9090/graph) and is accessible from the management vlan. Via the webinterface it is possible to query the data directly and to see the status of reporters to the server. Alerts are also shown here.
## Grafana
Grafana runs from the [rancher environment](https://webhost12.service.rug.nl:8080/login) queries the prometheus servers on knyft and in the rancher environment itself. It has various dashboards that present the data. (the prometheus server in the rancher environment monitors other systems than peregrine)
## Alertmanager
Prometheus posts the alerts it raises to the alertmanager in the rancher cloud. The aletmanager filters the alerts. It filters to duplicate errors should you have one node monitored with more than one prometheus for instance. It is also possible to silence alerts here. The web interface of the alertmanager is [here](http://alertmanager.hpc.dcktest01.rug.nl/). The aletmanager is configured to push alerts to various slack channels.

BIN
documentation/prometheus.png

Binary file not shown.

After

Width:  |  Height:  |  Size: 43 KiB

2
readme.md

@ -4,7 +4,7 @@ This repository contains an inventory and ansible playbooks for the peregrine cl @@ -4,7 +4,7 @@ This repository contains an inventory and ansible playbooks for the peregrine cl
## Install slurm.
To install slurm:
To install slurm server:
```
ansible-playbook --vault-password-file=.vault_pass.txt slurm.yml

Loading…
Cancel
Save