Member-only story

Monitoring at giffgaff

Matías Costa
9 min readDec 12, 2019

--

Monitoring remains a critical part of managing any IT system. Monitoring allows service owners to keep track of a system’s health and availability, and detect and prevent failures.

At giffgaff we run a system made up of hundreds of servers that scale up and down all the time, and a rapidly growing number of microservices running along with our legacy applications. The number of moving parts grows very quickly, and finding out what goes wrong when things go wrong becomes a challenge.

9 months back we decided to build a new monitoring and alerting system that would allow us to monitor all of our systems: from physical servers to microservices, some third-party systems, and our legacy applications.

We’ve gone through a number of iterations, starting with a basic setup to demonstrate our solution was valid, and building up on it until we found a robust, highly available, highly scalable system.

Our monitoring solution has Prometheus at its heart, and a bunch of different components around it that offer much needed additional capabilities.

The diagram below shows the complete architecture:

Monitoring Stack

We run everything in a Kubernetes cluster, which makes scaling any of the components very easy, for both performance and resilience. Outer boxes represent pods, while inner boxes represent…

--

--

Matías Costa
Matías Costa

Written by Matías Costa

SRE engineer | Technology enthusiast | Learning & Sharing

No responses yet