Member-only story
Monitoring at giffgaff
Monitoring remains a critical part of managing any IT system. Monitoring allows service owners to keep track of a system’s health and availability, and detect and prevent failures.
At giffgaff we run a system made up of hundreds of servers that scale up and down all the time, and a rapidly growing number of microservices running along with our legacy applications. The number of moving parts grows very quickly, and finding out what goes wrong when things go wrong becomes a challenge.
9 months back we decided to build a new monitoring and alerting system that would allow us to monitor all of our systems: from physical servers to microservices, some third-party systems, and our legacy applications.
We’ve gone through a number of iterations, starting with a basic setup to demonstrate our solution was valid, and building up on it until we found a robust, highly available, highly scalable system.
Our monitoring solution has Prometheus at its heart, and a bunch of different components around it that offer much needed additional capabilities.
The diagram below shows the complete architecture:
We run everything in a Kubernetes cluster, which makes scaling any of the components very easy, for both performance and resilience. Outer boxes represent pods, while inner boxes represent…