Below is a picture of the current prometheus monitoring setup on gearshift. Our setup consists of the following components:
Each peregrine node has a node exporter running. It was installed using the node exporter.yml playbook in the root of this repository.
This playbook applies the node exporter role which does little more than copy the binary (from promtools/results) to the node and install a systemd unit file on the node. The node exporter listens for requests on port 9100 on each node.
Besides node exporter, we're also running a lustre exorter, nvidia_smi exporter, slurm exporter and ipmi exporter. These are build using the build.sh, which fires of a go environment in a docker file. Expect this build pipeline to be broken a lot as the various exporters are updated.
The server runs in a docker container on knyft. It was installed using the prometheus.yml playbook that installs the prom_server role. This role also contains its configuration files. The server scrapes the exports of the nodes. It stores them in a special time series database that is integrated in the prometheus server. Targets and alerts are configured using these files. Prometheus also has a web frontend that listens on knyft and is accessible from the management vlan. Via the webinterface it is possible to query the data directly and to see the status of reporters to the server. Alerts are also shown here. The official Prometheus documentation can be found here.
Grafana is provides our dashboards. Currently
Grafana runs from the rancher kubernetes environment. It queries the prometheus servers on knyft and in the rancher environment itself. It has various dashboards that present the data. (the prometheus server in the rancher environment monitors other systems than peregrine) Admin credentials are stored in 1password. The official grafana documentation can be found here.
Prometheus Alertmanager also runs in the rancher kubernetes environment.
Prometheus posts the alerts it raises to the alertmanager in the rancher cloud. The alertmanager filters the alerts. It deduplicates errors if you have one node monitored with more than one prometheus. It is also possible to silence alerts here. The web interface of the alertmanager is here. (The credentials are in 1password) The alertmanager is configured to push alerts to various slack channels.
If a node is down and you won't be able to bring it up in he next few minutes, please silence the alert.
From the official documentation:
Silences are a straightforward way to simply mute alerts for a given time. A silence is configured based on matchers, just like the routing tree. Incoming alerts are checked whether they match all the equality or regular expression matchers of an active silence. If they do, no notifications will be sent out for that alert. Silences are configured in the web interface of the Alertmanager.