Monitoring & Alerting
This section of the documentation is dedicated to the monitoring and alerting of the OnBuy platform and the philosophy behind it. It will cover the different tools and services we currently use, to monitor and alert on the platform and services, and how we use them to ensure the platform and services are running smoothly. Additionally, in the system section, we will cover how to use this platform component to monitor and alert on the services that run on the platform.
Tools
We use a variety of tools to monitor and alert on the platform and services. These include:
- Google Monitoring Metrics (prometheus)
- Google Cloud Logging
- Google Monitoring Uptime Checks
In general our philosophy is to use Google Monitoring Metrics (prometheus) to monitor the platform and services. We use Google Cloud Logging to log the platform and services. We use Google Monitoring Uptime Checks to monitor the uptime of the platform and services.
We have lots of exporters set up via the google cloud monitoring agent, and we have a few other exporters running throughout the estate, aslong side the standard google metrics that are collected by default.
To alert, we use google monitoring alerts that are currently mostly written in google monitoring query language, and are defined in the gitlab repo monitoring-tf as terraform resources.
We are currently in the process of migrating to defining these as prometheus alerting rules, and writing tests alongside them to ensure they are working as expected.
A lot of our alerts are currently cause based alerts, such as "Database CPU utilisation is too high", "Queue depth is too high" etc. However, these are low level system components and they tend to be quite noisy and do not nessisarily mean the system is unhealthy or causing bad customer experience.
Therefore, we are moving towards monitoring and alerting from 'the spout' where customers consume the system (Load balancers, CDN, etc) and instead focusing on symptom based alerts, such as "Users are experiencing slow page loads", "Users are experiencing errors", "Users are experiencing 500 errors", etc. Whilst still monitoring cause based alerts. This allows us to have focused pages that will only wake us up when customers are experiencing issues, but still alert us to when system components are unhealthy.
This paper from a google SRE is a good read on this topic: My philosophy of Alerting
However, Ultimately we want to move towards more modern SRE principles and define system Service Level Agreements (SLAs) and Service Level Objectives (SLOs), so that we can define alerting rules that measure Service Level Indicators (SLIs), and alert to SLA/SLO violations.
This section of the SRE handbook is a good read on this topic: Alerting best practices
Implimentation
Google monitoring via alert manager is setup, and we currently have two main alerting levels, WARNING and CRITICAL. We are now routing CRITICAL alerts to PagerDuty which Page us, and will wake up who ever is on call at the time. WARNING alerts are currently not routed to PagerDuty, but are instead routed to a collection of Microsoft Teams channels that are monitored by the OnBuy technology teams.
These channels are in the monitoring microsoft team, and are:
- prod system - Overall system health and alerts
- prod platform - Infrastructure and platform alerts
- development system - Overall system health and alerts
- development platform - Infrastructure and platform alerts
Future
We are currently in the process of migrating this to incident.io oncall which will allow us to handle on call and trigger incidents in our incident management system.