Browse Source

Fix font, add monitoring and alerting post

master
Devin Dooley 1 year ago
parent
commit
fa498ca2a2
12 changed files with 443 additions and 16 deletions
  1. +2
    -1
      Makefile
  2. +2
    -2
      assets/styles/fonts.scss
  3. +439
    -0
      content/blog/monitoring-and-alerting.md
  4. +0
    -0
      static/fonts/opensans_regular.woff
  5. +0
    -13
      static/fonts/stylesheet.css
  6. BIN
      static/images/monitoring-and-alerting/nginx-vts-dashboard.png
  7. BIN
      static/images/monitoring-and-alerting/node-exporter-dashboard.png
  8. BIN
      static/images/monitoring-and-alerting/plex-inactive-alert.png
  9. BIN
      static/images/monitoring-and-alerting/prometheus-memavailable.png
  10. BIN
      static/images/monitoring-and-alerting/prometheus-vts-totals.png
  11. BIN
      static/images/monitoring-and-alerting/worldping-dashboard.png
  12. BIN
      static/images/monitoring-and-alerting/worldping-endpoint-comparison.png

+ 2
- 1
Makefile View File

@ -1,4 +1,5 @@
build: hugo_build process_images
server: hugo_build process_images serve
hugo_build:
hugo -v --gc
@ -6,7 +7,7 @@ hugo_build:
process_images:
./build/image-processing/image-processing -d public/photos
server:
serve:
./build/server/server -p 1314 -d public
deploy:


+ 2
- 2
assets/styles/fonts.scss View File

@ -1,11 +1,11 @@
@font-face {
font-family: 'open_sansregular';
src: url('OpenSans-Regular-webfont.woff') format('woff');
src: url('../fonts/opensans_regular.woff') format('woff');
font-weight: normal;
font-style: normal;
}
html {
font-family: 'open_sansragular';
font-family: 'open_sansregular';
}

+ 439
- 0
content/blog/monitoring-and-alerting.md View File

@ -0,0 +1,439 @@
---
title: "Improving My Monitoring and Alerting"
date: "2020-05-16"
---
Shortly after getting my basic alerting script [marvin](https://devinadooley.com/blog/marvin)
written, I got the feeling that my monitorix/marvin system was a bit too hacky for my liking
and decided to upgrade to a more proper ecosystem. After some research, I settled on using
Prometheus as my metrics aggregator, Grafana for visualizations, and integrating Prometheus'
Alertmanager with a webhook configuration to report alerts to my Matrix rooms. The work
spanned about a month of on-and-off focus, so I wanted to get some documentation written
on what was involved in this setup before I forget too much.
All of this was implemented to target services running on single Arch Linux server, so YMMV
if you are using this as reference for your own setup. I'll opt for Arch and AUR packages
over compiled binaries and containers wherever possible as it is easier to keep in-sync with
a bleeding edge system, but you should follow official documentation for installation of these
packages if you are running a different operating system.
1. [Prometheus](#prometheus)
2. [Prometheus Node Exporter](#prometheus-node-exporter)
3. [systemd Metrics](#systemd-metrics)
4. [nginx VTS Metrics](#nginx-vts-metrics)
5. [Grafana](#grafana)
6. [worldPing](#worldping)
7. [Alerting to Matrix Rooms](#alerting-to-matrix-rooms)
### Prometheus
Prometheus appeared to be the best monitoring toolkit for my use-case, largely because it is
open source, has no commercial offerings, and includes first-class alerting integration. They have a
great write-up in their docs
[comparing their platform to alternative solutions](https://prometheus.io/docs/introduction/comparison/),
that helped sell me on using it. Ultimately what makes me happy with this solution is the
widespread use of Prometheus and the ease of creating exporters -- for these reasons it was
easy to get metrics on everything I wanted to track with very little friction.
There is an Arch package for Prometheus that you should be able to install and get running
without issue. Install it with `pacman -S prometheus`, and start/enable the daemon with
`systemctl enable prometheus && systemctl start prometheus`. At this point you should be able to
see the Prometheus dashboard in a web browser by pointing to port 9090 of your server.
However, to get anything useful out of Prometheus you will need to setup exporters for
Prometheus to scrape and aggregate metrics from.
### Prometheus Node Exporter
The best solution to getting system metrics to your Prometheus instance is through their
[node exporter](https://github.com/prometheus/node_exporter). Once again, there is an
Arch package for this that can be installed by running `pacman -S
prometheus-node-exporter`. Like with the `prometheus` package, you'll want to enable and start the
daemon after installing.
At this point you should be able to see your machine's metrics being
exported by pinging port 9100 on your server. Note that Prometheus is designed to scrape metrics
over HTTP, so the easiest way to check that your exporters are working is by pointing a web browser
to the appropriate port.
Now that your node exporter is running, you need to tell Prometheus where to scrape these metrics
from. On Arch, the Prometheus configuration file is located at `/etc/prometheus/prometheus.yml` by
default. Open that file up in your text editor of choice, and look for the `scrape_configs` entry.
After a fresh installation, you should see a single entry for the `prometheus` job. You will want
to add a job for your node, so that your new `scrape_configs` will look something like the below
snippet.
{{< highlight yml >}}
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'localhost'
static_configs:
- targets: ['localhost:9100']
{{< / highlight >}}
I chose to name the job `localhost` based on the example I saw on the Arch wiki, but you can choose
something more descriptive to your system if you wish.
If you are setting up multiple nodes to scrape, you will want to describe the nodes
more explicitly in your job naming scheme.
Once your Prometheus configuration file is updated, you will need to restart the service by
running `systemctl restart prometheus`. You should now be able to write PromQL queries against
your node from the Prometheus dashboard (exposed at port 9090 by default). As a test, go to that
dashboard and enter `node_memory_MemAvailable_bytes`, execute the query, and you should be able
to view a graph of your system's available memory. The Prometheus dashboard has great
auto-completion in the query textbox -- try typing in `node_` and look through all the metrics
you now have from your node exporter if you want to get an idea of what is now available to you.
![Prometheus MemAvailable](/images/monitoring-and-alerting/prometheus-memavailable.png)
### systemd Metrics
If you wish to monitor systemd service statuses, you can enable collection of these metrics
through your Prometheus node exporter. There might be other ways to accomplish this,
but I chose to add the systemd collector configuration through command line flags in the service's
unit file. To get the location of the unit file, run `systemctl status prometheus-node-exporter`.
You should see a line showing where the service is loading from:
```
Loaded: loaded (/usr/lib/systemd/system/prometheus-node-exporter.service; ...)
```
Open up that service file, and add command line flags to the `ExecStart` call to suit your needs.
In my case, I wanted to track the `sshd`, `nginx`, `synapse`, `jenkins` and `plexmediaserver`
daemons, so my `ExecStart` command looked like the below.
{{< highlight systemd >}}
ExecStart=/usr/bin/prometheus-node-exporter --colector.systemd --collector.systemd.unit-whitelist="(nginx|sshd|synapse|jenkins|plexmediaserver).service" $NODE_EXPORTER_ARGS
{{< / highlight >}}
A call to `systemctl restart prometheus-node-exporter` should now allow you to write PromQL queries
against the services whitelisted.
### nginx VTS Metrics
I use nginx as a reverse proxy for every public-facing site on my server, so tracking
the traffic being routed to each site was one of the most important metrics I wanted to visualize.
Luckily, [nginx-module-vts](https://github.com/vozlt/nginx-module-vts) handles everything needed
to get these metrics -- from monitoring the virtual host traffic to providing an exporter for
Prometheus to scrape. There is a lot of support online for
[nginx-vts-exporter](https://github.com/hnlq715/nginx-vts-exporter) as a means of exporting virtual
host traffic from `nginx-module-vts` to Prometheus, but I found the included exporter to be more
than enough for what I wanted to accomplish.
There is an [AUR package](https://aur.archlinux.org/packages/nginx-mainline-mod-vts/)
for compiling `nginx-module-vts` for the `nginx-mainline` Arch package that, at the time
of writing, works wonderfully. As with any AUR package, you should take a look at the included
PKGBUILD file before installing. Once you are familiar with what it does, clone the repository and
install the package with `makepkg -sci` (or use whatever method you normally use to install AUR
packages).
After a successful installation, you should be able to find the module at
`/usr/lib/nginx/modules/ngx_http_vhost_traffic_status_module.so`. Make sure that file exists,
then open up your nginx configuration file (likely at `/etc/nginx/nginx.conf`). Adding the below
changes to your nginx configuration should handle loading of the dynamic module and exposing
`localhost:8080/status` for viewing nginx virtual traffic statistics.
{{< highlight nginx >}}
load_module "/usr/lib/nginx/modules/ngx_http_vhost_traffic_status_module.so";
...
http {
...
vhost_traffic_status_zone;
...
server {
server_name 127.0.0.1;
location /status {
vhost_traffic_status_display;
vhost_traffic_status_display_format html;
allow 127.0.0.1;
deny all;
}
listen 127.0.0.1:8080;
}
}
{{< / highlight >}}
Be sure to change the `allow` field or add additional addresses to suit your needs. In my case,
the statistics are only going to be accessible by the local machine for Prometheus scraping,
so this example only exposes the statistics to that machine. You will want to be as restrictive as
possible here to avoid accidentally exposing your metrics to anyone who shouldn't see them.
At this point, you will need to restart your nginx daemon to start serving your new status server.
Prior to restarting the service, you should also run `nginx -t` to test your configuration and
make sure your dynamic module loading is working correctly. You should now be able to get your VTS
traffic status response from your server, so try running
`curl localhost:8080/status/format/prometheus` and make sure you get an appropriate response.
After you are confident that your newly configured vhost traffic exporter is working
correctly, you will need to configure Prometheus to scrape the metrics. Open up your Prometheus
configuration file at `/etc/prometheus/prometheus.yml` and add the following entry to your
`scrape_configs`.
{{< highlight yml >}}
scrape_configs:
- job_name: 'nginx-vts'
metrics_path: '/status/format/prometheus'
static_configs:
- targets: ['localhost:8080']
{{< / highlight >}}
After restarting your `prometheus` daemon, you should now be able to write PromQL queries from
your Prometheus dashboard against your new metrics. Autocomplete is again your friend here for
seeing what metrics are available to you. Try running `nginx_vts_server_requests_total{code="2xx"}`
to see all the 200 range responses from your server by hostname, or start typing `nginx_`
to get a list of suggestions.
![Prometheus VTS Totals](/images/monitoring-and-alerting/prometheus-vts-totals.png)
### Grafana
Grafana provides a great interface for visualizing Prometheus metrics and integrates easily with
the existing Prometheus metrics configured above. On Arch, Grafana can be installed through the
`grafana` package, and will start serving its web interface on port 3000 after starting and enabling
the `grafana` service. Once you get the service up and running, you'll need to login with the user
`admin` and password `admin`, after which you'll be prompted to setup a proper account.
Grafana is interacted with almost entirely through the web UI, and I find it untuitive enough that I
don't feel it is necessary to document how to use it. I do, however, want to mention a few things
that I found useful in getting started with Grafana.
##### Start with Existing Dashboards
If you don't want to spend a ton of time getting your basic visualizations setup, I'd recommend you
start with dashboards that other people have created and
[shared online](https://grafana.com/grafana/dashboards?orderBy=name&direction=asc). I started off
with a dashboard created to visualize Prometheus node exporter metrics and another for nginx-mod-vts
metrics. Once I had them imported, I just deleted or tweaked everything that didn't work, and
restructured and adjusted to what I felt was necessary and helpful.
##### Test Queries on the Prometheus Dashboard
I find it easier to work within the Prometheus dashboard to execute my PromQL queries and set
them up for use within Grafana. Being able to see the actual results of your execution is helpful
for understanding how it will be visualized over time.
##### Rely on the Editor
Just like with the Prometheus dashboard, Grafana will provide helpful suggestions as you type
in your queries (maybe even better suggestions than the Prometheus dashboard). You should also
preview your graphs frequently, and let Grafana tell you when something is wrong. The graph
editor is good at telling you when your query is broken, so listen to it.
### worldPing
There were a few ways to monitor sites and alert on downtime that I considered during this
transition. My previous solution was marvin, but I didn't feel like maintaining him as one-off
script. There is also the `prometheus-blackbox-exporter` which can monitor sites and alert on them
for you, which is a good step up from marvin's capabilities.
I ended up settling on worldPing, a Grafana plugin that monitors your sites over HTTP/S, DNS and
through sending ICMP Echo packets to your domains. They provide 1 million requests per month in
their free tier, which is enough to check HTTPS, DNS and send Echo packets from a few different
locations every couple minutes for the production domains I manage. This also has the benefit of
not requiring me to manage another server to alert me of my central server's outages, which I would
have had to do with the other solutions.
Grafana provides a pretty neat plugin management CLI -- you should be able to
install worldPing by running `grafana-cli plugins install raintank-worldping-app`, and restarting
the `grafana` service. After installing you should have a new set of worldPing dashboards
available through the Grafana UI that you can configure to your liking. I set mine up to monitor
my two production domains in a way that came under their free tier limit, and to email
me of any outages.
![worldPing Dashboard](/images/monitoring-and-alerting/worldping-dashboard.png)
![worldPing Endpoint Comparison](/images/monitoring-and-alerting/worldping-endpoint-comparison.png)
### Alerting to Matrix Rooms
If you are as cheap and paranoid as I am, you may be running a Matrix/synapse node with rooms
that you would like to send alerts to. Prometheus' Alertmanager tool provides an interface
to alert on the values of PromQL queries, and send alerts as JSON to a custom webhook, so I
went ahead and set up a few alerts to cover the bases that my previous monitoring tool
covered: daemon's in a failed state and CPU/RAM usage being above a threshold.
To get started, you'll need to install the `alertmanager` package with pacman, which should give you
a file to configure at `/etc/alertmanager/alertmanager.yml`. I currently have this configured to
send requests to a webhook I manage at
`http://localhost:4050/services/hooks/YWxlcnRtYW5hZ2VyX3NlcnZpY2U`. This address will be
explained in further detail later, but assuming you are following this document,
you will want a configuration that looks like the below. If you are not setting up Alertmanager
to send alerts to a custom webhook, refer to [their documentation](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#alertmanager_config)
and tailor the example to your use case.
{{< highlight yaml >}}
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://localhost:4050/services/hooks/YWxlcnRtYW5hZ2VyX3NlcnZpY2U'
{{< / highlight >}}
Now we will need to get some actual alerts configured. I created a rules alert if
CPU/RAM usage above a 90% threshold over 2 minutes, and if some of the daemons I setup to track
are in a failed state for 30 seconds.
I put these into a file at `/etc/prometheus/alerting_rules.yml`:
{{< highlight yaml >}}
groups:
- name: systemd
rules:
- alert: Nginx Inactive
expr: node_systemd_unit_state{name="nginx.service",state="active"} != 1
for: 30s
labels:
severity: critical
annotations:
summary: Nginx service is inactive
- alert: SSH Inactive
expr: node_systemd_unit_state{name="sshd.service",state="active"} != 1
for: 30s
labels:
severity: critical
annotations:
summary: SSH service is inactive
- alert: Plex Inactive
expr: node_systemd_unit_state{name="plexmediaserver.service",state="active"} != 1
for: 30s
labels:
severity: critical
annotations:
summary: Plex media server is inactive
- name: resources
rules:
- alert: CPU Resources Above Threshold
expr: (100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 90
for: 2m
labels:
severity: warning
annotations:
summary: Average CPU usage is exceeding 90%
- alert: Memory Resources Above Threshold
expr: ((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes) * 100 > 90
for: 2m
labels:
severity: warning
annotations:
summary: Average CPU usage is exceeding 90%
{{< / highlight >}}
And then referenced this file in my Prometheus configuration at `/etc/prometheus/prometheus/yml`.
{{< highlight yaml >}}
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
{{< / highlight >}}
With those configurations in place, you should be able to start/enable the `alertmanager` service,
restart the `prometheus` service, and then be ready to start alerting to your webhook at
`http://localhost:4050/services/hooks/YWxlcnRtYW5hZ2VyX3NlcnZpY2U`. The problem is that there is
nothing listening at that address yet.
To manage alerting to my Matrix rooms, I decided to use their
[Go-NEB](https://github.com/matrix-org/go-neb) bot, a newer iteration of their Matrix-NEB
bot written in Go instead of Python. I already had a script in place that would have worked with
a little tweaking, but I decided to use Go-NEB for their Giphy and RSS services which I did not
want to write myself. It also seems easily extensible if I wish to add any extra services in the
future.
There are a few ways to run this bot -- I ran into issues getting the services
to handle requests correctly through their containerized solution, so I ultimately setup
this bot to bootstrap its configuration through a YAML file, and I managed the bot through
a systemd unit file. They provide a sample configuration file in their repository,
but there were a couple typos that gave me trouble (and they haven't merged my PR in to
correct them yet) so I'll include a redacted version of my configuration for people
to reference.
{{< highlight yaml >}}
clients:
- UserID: "@marvin:matrix.devinadooley.com"
AccessToken: "MARVIN_USER_ACCESS_TOKEN"
HomeserverURL: "https://matrix.devinadooley.com"
Sync: true
AutoJoinRooms: true
DisplayName: "Marvin"
services:
- ID: "alertmanager_service"
Type: "alertmanager"
UserID: "@marvin:matrix.devinadooley.com"
Config:
webhook_url: "http://localhost:4050/services/hooks/YWxlcnRtYW5hZ2VyX3NlcnZpY2U"
rooms:
"!ROOM_ID:matrix.devinadooley.com":
text_template: "{{range .Alerts -}} [{{ .Status }}] {{index .Labels \"alertname\" }}: {{index .Annotations \"description\"}} {{ end -}}"
html_template: "{{range .Alerts -}} {{ $severity := index .Labels \"severity\" }} {{ if eq .Status \"firing\" }} {{ if eq $severity \"critical\"}} <font color='red'><b>[FIRING - CRITICAL]</b></font> {{ else if eq $severity \"warning\"}} <font color='orange'><b>[FIRING - WARNING]</b></font> {{ else }} <b>[FIRING - {{ $severity }}]</b> {{ end }} {{ else }} <font color='green'><b>[RESOLVED]</b></font> {{ end }} {{ index .Labels \"alertname\"}} : {{ index .Annotations \"description\"}} <a href=\"{{ .GeneratorURL }}\">source</a><br/>{{end -}}"
msg_type: "m.text"
{{< / highlight >}}
This is likely a bit overwhelming, so let me break down what this configuration specifies to the
program.
When Go-NEB is given a configuration file, it will parse it and add all specified user
credentials and services into an in-memory SQLite database (rather than creating a persistent
SQLite database interacted over JSON HTTP when ran without a configuration file). This
configuration tells Go-NEB to
create a service called `alertmanager_service` of type `alertmanager`, and forward
alerts through the @marvin user to a given room when sent to the correct URL.
The `webhook_url` specified ends in the base64-encoding of the service ID -- that is, the base64
encoding of the string "alertmanager_service". According to the comments in the sample
configuration, the `webhook_url` is informational and does not change your actual configuration.
I believe all services are configured to serve under
`http://BASE_URL/services/hooks/$ENCODED_SERVICE_ID`. That is why we configured our AlertManager
instance to send alerts to this URL earlier.
Getting a user access token can be tricky. I found the token through the Riot.im webapp interface
under the user settings, but found that the token would not persist as long as I needed. It turns
out these tokens are invalidated on a logout through the Riot client, so for this access token
to persist you need to: login, retrieve the access token corresponding to that login session,
then close your browser without logging out. This means that any future login under the user
will invalidate the token, so make sure to configure as much as you need to while you are logged
into that session.
After configuring the user, you need to update the ROOM_ID value
(can be retrieved through the Riot interface under the
room settings), and update the `text_template` that is specified (it uses Go templates) to suit your
needs.
You can run this bot anyway you wish, though I do so through systemd. I have the following
unit file written at `/etc/systemd/system/go-neb.service`, which runs the bot whose code is
located at `/var/automation/go-neb`. This also assumes you have compiled the `go-neb` binary
in the `WorkingDirectory`.
{{< highlight systemd >}}
[Unit]
Description=Go-NEB Matrix Bot
After=network.target
[Service]
Type=simple
ExecStart=BIND_ADDRESS=:4050 DATABASE_TYPE=sqlite3 BASE_URL=https://localhost:4050 CONFIG_FILE=config.yaml ./go-neb
WorkingDirectory=/var/automation/go-neb
Restart=on-failure
[Install]
WantedBy=default.target
{{< / highlight >}}
At this point, you should be able to test your alerting easily by bringing down one of your
tracked services.
![Plex Inactive Alert](/images/monitoring-and-alerting/plex-inactive-alert.png)

static/fonts/OpenSans-Regular-webfont.woff → static/fonts/opensans_regular.woff View File


+ 0
- 13
static/fonts/stylesheet.css View File

@ -1,13 +0,0 @@
@font-face {
font-family: 'open_sansregular';
src: url('OpenSans-Regular-webfont.eot');
src: url('OpenSans-Regular-webfont.eot?#iefix') format('embedded-opentype'),
url('OpenSans-Regular-webfont.woff2') format('woff2'),
url('OpenSans-Regular-webfont.woff') format('woff'),
url('OpenSans-Regular-webfont.ttf') format('truetype'),
url('OpenSans-Regular-webfont.svg#open_sansregular') format('svg');
font-weight: normal;
font-style: normal;
}

BIN
static/images/monitoring-and-alerting/nginx-vts-dashboard.png View File

Before After
Width: 1858  |  Height: 888  |  Size: 343 KiB

BIN
static/images/monitoring-and-alerting/node-exporter-dashboard.png View File

Before After
Width: 1839  |  Height: 849  |  Size: 206 KiB

BIN
static/images/monitoring-and-alerting/plex-inactive-alert.png View File

Before After
Width: 460  |  Height: 107  |  Size: 10 KiB

BIN
static/images/monitoring-and-alerting/prometheus-memavailable.png View File

Before After
Width: 1902  |  Height: 679  |  Size: 164 KiB

BIN
static/images/monitoring-and-alerting/prometheus-vts-totals.png View File

Before After
Width: 1902  |  Height: 561  |  Size: 253 KiB

BIN
static/images/monitoring-and-alerting/worldping-dashboard.png View File

Before After
Width: 1830  |  Height: 907  |  Size: 140 KiB

BIN
static/images/monitoring-and-alerting/worldping-endpoint-comparison.png View File

Before After
Width: 1830  |  Height: 864  |  Size: 111 KiB

Loading…
Cancel
Save