Log Aggregation With EFK
Logs are underrated in many enterprise environments. Logs are often completely ignored, and only noticed when disk space is low. At that point, they’re usually deleted without review.
Sometimes logs are seen as a way to troubleshoot operational problems. Logs can be a good source of forensic information for determining what happened after an incident. However, we think that proactive logging enables improving business decisions. Logs, and in particular application logs, can contain a wide range of information that is not available otherwise.
Why are logs ignored? Log analysis isn’t easy. Effective log analysis take some work. Logs come in a variety of shapes and forms, and it can be difficult to extract information from them. The volume of logs generated by distributed applications can be overwhelming and difficult to correlate.
Wouldn’t it be great if you could aggregate the logs from multiple locations(servers, containers or even deleted pods) in a single place? Imagine how useful it would be if you could index them and run fast queries to get the data you’re looking for?
At giffgaff, we’ve decided to use EFK Stack ( Elasticsearch, Fluentd, Kibana) to provide such capability. EFK allows you to collect, index, search, and visualize log data. The EFK stack is a modified version of the ELK stack and is comprised of:
- Elasticsearch: a distributed, RESTful search and analytics engine, where all logs are stored.
- Fluentd:: an open source data collector providing a unified logging layer. It gathers logs from different sources and feeds them to Elasticsearch
- Kibana: a web UI that lets you visualize your Elasticsearch data.
This article will focus on the core component of the stack, i.e. Elasticsearch. We’ll discuss cluster sizing, configuration, benefits and limitations of using a managed cluster.
Fluentd is another beast in itself, and we’ll be sharing our experience in the near future.
Amazon Elasticsearch Service
Rather than building an Elasticsearch cluster ourselves, we decided to use Amazon Elasticsearch Service. Amazon Elasticsearch Service is fully managed service that makes it easy to deploy and run Elasticsearch at scale. The service provides support for open source Elasticsearch APIs, managed Kibana, and built-in alerting and SQL querying.
We’ve created a Terraform module, so that the code is reusable and easily configurable through input parameters. We can configure the number of instances, instance types, volume sizes, or whether we want to run dedicated masters in our clusters among others.
Once the cluster is created, there are two important endpoints you’ll be using: VPC Endpoint and Kibana Endpoint. The former is used by the log collector (Fluentd in our case) to interact with Elasticsearch, while the latter is used to access Kibana console. It’s recommended to create a Route53 record with a friendly URL pointing to this endpoint, so that you can easily remember it.
Sizing the cluster
In our Elasticsearch workload data continuously flows into a set of temporary indices. Indices are rotated daily and kept for a number of days specified in an index management policy. This is commonly known as rolling indices.
Check out our article about Index State Management if you want to learn more about it.
For rolling indices, multiplying the amount of data generated during a representative time period by the retention period will give an indicative figure of the storage needed. For instance, if your systems generate 200 GiB of log data daily, with a 30 day retention period, you’d have 6TiB of data at any given time stored in the cluster.
However, there are other aspects you need to consider:
- Number of replicas: Each replica is a full copy of an index and needs the same amount of disk space. By default, each Elasticsearch index has one replica. It is recommended to have at least one to prevent data loss. Replicas also improve search performance, so you might want more if you have a read-heavy workload.
- Elasticsearch indexing overhead: The on-disk size of an index varies, but is often 10% larger than the source data.
- OS reserved space: By default, Linux reserves 5% of the file system for the root user for critical processes, system recovery, and to safeguard against disk fragmentation problems.
- Amazon ES overhead: Amazon ES reserves 20% of the storage space of each instance (up to 20 GiB) for segment merges, logs, and other internal operations.
A simple formula to calculate your storage requirements is:
Source Data * (1 + Number of Replicas) * 1.45 = Minimum Storage Requirement
Number of shards
Each Elasticsearch index is split into a number of shards. Choosing a correct number of shards helps distributing an index evenly across all data nodes in the cluster. A good rule of thumb is to keep shard size between 10–50 GiB. Large shards can make it difficult for Elasticsearch to recover from failure. On the other hand, too many small shards can cause performance issues and out of memory errors.
Elasticsearch default is 5 shards per index, and can be modified using index templates. In our case, we’ve lowered this setting down to 4 shards per index.
Our master instances are from the c5 family (c5.large). For the data nodes, we started using m5, but as the load increased, JVMMemoryPressure went over the roof. We decided to update them to r5, which come with double the memory for the same instance type (and a slightly higher price).
One of the things to consider when choosing instance types are the EBS size limits.
If your source data is above 4TiB and you need fast access, you might want to consider using i3 family, that provides local storage.
Benefits and limitations
After almost 2 years using the service, we’ve found a lot of benefits but also some limitations. There are multiple articles comparing Amazon ES and Elasticsearch Service on Elastic Cloud.
- Easy to setup: It takes literally minutes to get a cluster up and running.
- Easy to manage: The service simplifies complex management tasks such as hardware provisioning, software installation and patching, failure recovery, backups, and monitoring.
- Highly scalable and highly available: You can easily scale your cluster up or down both vertically and horizontally via a single API call or a few clicks in the AWS console.
- Cluster modifications will trigger a costly process. AWS will start a whole new cluster, and will relocate every shard in the cluster to the new nodes. The old nodes are taken out of the cluster once the process finishes. This can take several hours, depending on the volume of data in the cluster. We’ve seen this process happening twice in a row while upgrading both instance types and engine version.
- Access to Elasticsearch administrative APIs is partially restricted, not allowing changes in settings at a cluster level.
- You’re always a few versions behind. Not a big deal, and sometimes a good thing.
- Kibana plugins: there’s a limited number of plugins available, and you cannot install extra plugins that you might want.
- Auto-complete functionality in Kibana is gone! After a major upgrade to version 7, auto-completion is not available anyomore. We really miss this one.
- Index management: Index management used to be painful. However, since March 2020, Amazon Elasticsearch provides index management capabilities. We’ve been using a python lambda function to delete logs older than a specified number of days until now. This functionality has been migrated to IM (Index management).
It’s been almost 2 years since we started using EFK for log aggregation, and we can say it’s made a big difference for the business. It’s become one of those more than necessary tools that’s used daily by our engineers.
In addition to that, we think that in an environment where logs are stateless and do not last long, it is imperative you aggregate them before they disappear. We already stated the importance of this piece of information, and we try to squeeze out of them all the insights they provide.
Originally published at https://www.giffgaff.io.