Browse Source

Updated prometheus documentation a little.

pull/20/head
Egon Rijpkema 6 months ago
parent
commit
700c7fd0a6
  1. 16
      documentation/prometheus.md
  2. BIN
      documentation/prometheus.png

16
documentation/prometheus.md

@ -7,11 +7,21 @@ Below is a picture of the current prometheus monitoring setup on gearshift. Our @@ -7,11 +7,21 @@ Below is a picture of the current prometheus monitoring setup on gearshift. Our
Each peregrine node has a node exporter running. It was installed using the node exporter.yml playbook in the root of this repository.
This playbook applies the node exporter role which does little more than copy the binary (from promtools/results) to the node and install a systemd unit file on the node. The node exporter listens for requests on port 9100 on each node.
## Other exporters.
Besides node exporter, we're also running a lustre exorter, nvidia_smi exporter, slurm exporter and ipmi exporter. These are build using the [build.sh](/HPC/pg-playbooks/src/branch/master/promtools/build.sh), which fires of a go environment in a docker file. Expect this build pipeline to be broken a lot as the various exporters are updated.
## Prometheus server
The server runs in a docker container on knyft. It was installed using the prometheus.yml playbook that installs the prom_server role. This role also contains its configuration files. The server scrapes the exports of the nodes. It stores them in a special time series database that is integrated in the prometheus server. Targets and alerts are configured using these files. Prometheus also has a web frontend that listens on [knyft](http://knyft.hpc.rug.nl:9090/graph) and is accessible from the management vlan. Via the webinterface it is possible to query the data directly and to see the status of reporters to the server. Alerts are also shown here.
The server runs in a docker container on knyft. It was installed using the prometheus.yml playbook that installs the prom_server role. This role also contains its configuration files. The server scrapes the exports of the nodes. It stores them in a special time series database that is integrated in the prometheus server. Targets and alerts are configured using these files. Prometheus also has a web frontend that listens on [knyft](http://knyft.hpc.rug.nl:9090/graph) and is accessible from the management vlan. Via the webinterface it is possible to query the data directly and to see the status of reporters to the server. Alerts are also shown here. The official Prometheus documentation can be found [here](https://prometheus.io/docs/introduction/overview/).
## Grafana
Grafana runs from the [rancher environment](https://webhost12.service.rug.nl:8080/login) queries the prometheus servers on knyft and in the rancher environment itself. It has various dashboards that present the data. (the prometheus server in the rancher environment monitors other systems than peregrine)
Grafana is provides our [dashboards](https://hpc.webhosting.rug.nl/). Currently
Grafana runs from the [rancher kubernetes environment](https://k8s.rug.nl). It queries the prometheus servers on knyft and in the rancher environment itself. It has various dashboards that present the data. (the prometheus server in the rancher environment monitors other systems than peregrine) Admin credentials are stored in 1password. The official grafana documentation can be found [here](https://grafana.com/docs/).
## Alertmanager
Prometheus posts the alerts it raises to the alertmanager in the rancher cloud. The aletmanager filters the alerts. It filters to duplicate errors should you have one node monitored with more than one prometheus for instance. It is also possible to silence alerts here. The web interface of the alertmanager is [here](http://alertmanager.hpc.dcktest01.rug.nl/). The aletmanager is configured to push alerts to various slack channels.
Prometheus Alertmanager also runs in the rancher kubernetes environment.
Prometheus posts the alerts it raises to the alertmanager in the rancher cloud. The alertmanager filters the alerts. It deduplicates errors if you have one node monitored with more than one prometheus. It is also possible to silence alerts here. The web interface of the alertmanager is [here](http://alertmanager.kube.hpc.rug.nl). (The credentials are in 1password) The alertmanager is configured to push alerts to various slack channels.
If a node is down and you won't be able to bring it up in he next few minutes, please silence the alert.
From the [official documentation](https://prometheus.io/docs/alerting/latest/alertmanager/):
> Silences are a straightforward way to simply mute alerts for a given time. A silence is configured based on matchers, just like the routing tree. Incoming alerts are checked whether they match all the equality or regular expression matchers of an active silence. If they do, no notifications will be sent out for that alert. Silences are configured in the web interface of the Alertmanager.

BIN
documentation/prometheus.png

Binary file not shown.

Before

Width:  |  Height:  |  Size: 43 KiB

After

Width:  |  Height:  |  Size: 62 KiB

Loading…
Cancel
Save