Log Aggregation With EFK

  • Elasticsearch: a distributed, RESTful search and analytics engine, where all logs are stored.
  • Fluentd:: an open source data collector providing a unified logging layer. It gathers logs from different sources and feeds them to Elasticsearch
  • Kibana: a web UI that lets you visualize your Elasticsearch data.

Amazon Elasticsearch Service

Rather than building an Elasticsearch cluster ourselves, we decided to use Amazon Elasticsearch Service. Amazon Elasticsearch Service is fully managed service that makes it easy to deploy and run Elasticsearch at scale. The service provides support for open source Elasticsearch APIs, managed Kibana, and built-in alerting and SQL querying.

Sizing the cluster

There are some good articles that can help you find the right size for your cluster [ 1] [ 2]. The next sections put together some of the key considerations when doing this task.

Storage requirements

In our Elasticsearch workload data continuously flows into a set of temporary indices. Indices are rotated daily and kept for a number of days specified in an index management policy. This is commonly known as rolling indices.

  1. Number of replicas: Each replica is a full copy of an index and needs the same amount of disk space. By default, each Elasticsearch index has one replica. It is recommended to have at least one to prevent data loss. Replicas also improve search performance, so you might want more if you have a read-heavy workload.
  2. Elasticsearch indexing overhead: The on-disk size of an index varies, but is often 10% larger than the source data.
  3. OS reserved space: By default, Linux reserves 5% of the file system for the root user for critical processes, system recovery, and to safeguard against disk fragmentation problems.
  4. Amazon ES overhead: Amazon ES reserves 20% of the storage space of each instance (up to 20 GiB) for segment merges, logs, and other internal operations.
Source Data * (1 + Number of Replicas) * 1.45 = Minimum Storage Requirement

Number of shards

Each Elasticsearch index is split into a number of shards. Choosing a correct number of shards helps distributing an index evenly across all data nodes in the cluster. A good rule of thumb is to keep shard size between 10–50 GiB. Large shards can make it difficult for Elasticsearch to recover from failure. On the other hand, too many small shards can cause performance issues and out of memory errors.

Instance types

Our master instances are from the c5 family (c5.large). For the data nodes, we started using m5, but as the load increased, JVMMemoryPressure went over the roof. We decided to update them to r5, which come with double the memory for the same instance type (and a slightly higher price).

Benefits and limitations

After almost 2 years using the service, we’ve found a lot of benefits but also some limitations. There are multiple articles comparing Amazon ES and Elasticsearch Service on Elastic Cloud.

The good

  • Easy to setup: It takes literally minutes to get a cluster up and running.
  • Easy to manage: The service simplifies complex management tasks such as hardware provisioning, software installation and patching, failure recovery, backups, and monitoring.
  • Highly scalable and highly available: You can easily scale your cluster up or down both vertically and horizontally via a single API call or a few clicks in the AWS console.

The bad

  • Cluster modifications will trigger a costly process. AWS will start a whole new cluster, and will relocate every shard in the cluster to the new nodes. The old nodes are taken out of the cluster once the process finishes. This can take several hours, depending on the volume of data in the cluster. We’ve seen this process happening twice in a row while upgrading both instance types and engine version.
  • Access to Elasticsearch administrative APIs is partially restricted, not allowing changes in settings at a cluster level.
  • You’re always a few versions behind. Not a big deal, and sometimes a good thing.
  • Kibana plugins: there’s a limited number of plugins available, and you cannot install extra plugins that you might want.
  • Auto-complete functionality in Kibana is gone! After a major upgrade to version 7, auto-completion is not available anyomore. We really miss this one.
  • Index management: Index management used to be painful. However, since March 2020, Amazon Elasticsearch provides index management capabilities. We’ve been using a python lambda function to delete logs older than a specified number of days until now. This functionality has been migrated to IM (Index management).

Conclusion

It’s been almost 2 years since we started using EFK for log aggregation, and we can say it’s made a big difference for the business. It’s become one of those more than necessary tools that’s used daily by our engineers.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Matías Costa

Matías Costa

SRE engineer | Technology enthusiast | Learning&Sharing