Member-only story

Monitoring at giffgaff

9 min readDec 12, 2019

Monitoring remains a critical part of managing any IT system. Monitoring allows service owners to keep track of a system’s health and availability, and detect and prevent failures.

At giffgaff we run a system made up of hundreds of servers that scale up and down all the time, and a rapidly growing number of microservices running along with our legacy applications. The number of moving parts grows very quickly, and finding out what goes wrong when things go wrong becomes a challenge.

9 months back we decided to build a new monitoring and alerting system that would allow us to monitor all of our systems: from physical servers to microservices, some third-party systems, and our legacy applications.

We’ve gone through a number of iterations, starting with a basic setup to demonstrate our solution was valid, and building up on it until we found a robust, highly available, highly scalable system.

Our monitoring solution has Prometheus at its heart, and a bunch of different components around it that offer much needed additional capabilities.

The diagram below shows the complete architecture:

We run everything in a Kubernetes cluster, which makes scaling any of the components very easy, for both performance and resilience. Outer boxes represent pods, while inner boxes represent…

Monitoring at giffgaff

Written by Matías Costa

No responses yet