With growing concern around conserving budget, tech teams are identifying their largest areas of spend and how best to optimize their stack. What’s clear from conversations with engineering leaders is that one of the biggest offenders of tech spending is logging. While logs are a pillar of observability, too often they lead to massive storage bills and memory-constrained systems.
Fortunately, there are ways to optimize log costs.
We’ll study how Uber, HubSpot, and Finout handled log management. Common themes:
- Understanding the use cases for logs - what are the business needs? How do logs correlate with revenue?
- Log retention period - similarly, how long should logs be stored to maximize business value?
- Compression (Uber with CLP and HubSpot with ORC) - how can you store logs more cheaply and efficiently?
Painpoint: At Uber, there was demand from users to increase the log retention period from three days to at least a month. On-call engineers needed older log data to identify the root cause postmortem when Spark jobs failed or stalled. Data scientists wanted access to old Spark logs to access and improve their ML applications. Other users wanted to be able to analyze logs to monitor anomalies and improve applications. But, increasing the log retention period would be costly. If Uber chose to increase the log retention period with the existing tooling, just the HDFS storage costs would reach millions of dollars per year. On top of that, Elasticsearch incurred prohibitive costs and Uber had been forced to move away from it. There were several other challenges like the SSDs burning out prematurely and an overall memory-constrained system.
Solution: As a long-term solution, the engineering team decided to explore compression within the logging library. They built a tool called CLP (Compressed Log Processor) that offered “unprecedented compression (2x of general-purpose compressors)” that preserved the ability to search the logs without having to fully decompress them. Uber was required to customize CLP because its logging library writes a single log file at a time, whereas CLP was designed to compress batches of files at a time. Ultimately, Uber decided to roll out two phases: Phase 1 - compressing a single log file at a time and achieving “modest compression” and Phase 2 - aggregating compressed files into a format compatible with CLP.
Result: Uber achieved a compression ratio of 169x after Phase 1. Prior to leveraging CLP, storing logs for three days cost $180,000 / year. One month of storing logs would have cost $1.8 million / year. After implementing just Phase 1, the cost was reduced to $10,000 / year for a one month log retention period. Not only was the team able to retain the logs for a month, but they also restored log levels from WARN back to INFO, and reduced storage costs by 17x. Uber will implement Phase 2 compression, which will reduce storage costs even further (by more than 2x) and still enable powerful search analytics.
Painpoint: The Backend Performance team at HubSpot is tasked with optimizing costs for their backend. When the team analyzed their cost data, they found that logs (specifically hubspot-live-logs-prod) were driving the largest percentage of their S3 costs. After talking to other teams, they learned that logs are first stored as raw JSON in S3 and are then converted to compressed ORC format (a columnar storage format). ORC request log data is 5% the size of raw JSON data. A key discovery was that the Spark compaction job was not keeping up with the volume of the logs and only 30% of the files ended up getting compressed to ORC. The Backend Performance team also found that the request logs and application logs specifically were major contributors to costs.
Solution: The team came to two conclusions: (1) reduce the cost of logs by storing them for less time, and (2) increase the percentage of logs as compressed ORC.
With regard to the former, the Backend Performance team noticed a disconnect between how long files were stored in JSON and ORC formats. JSON files were stored 60% longer than ORC files. A key question arose around how log files were being used. At HubSpot, engineers run about 2,200 log queries a day. Most often, they’ll look at the most recent logs to diagnose issues. Occasionally, they’ll leverage logs that are a few weeks or months old. On very rare occasions, teams will need access to logs that are 6+ months old. The use cases suggested that HubSpot did not need historical log data from 1-2 years ago. Therefore, the teams could lower the TTL of the raw JSON data.
HubSpot wanted to convert logs from raw JSON to ORC format. But, they needed to figure out the right architecture to increase the percentage of logs as compressed ORC. Ultimately, they decided to switch to ORC conversion during the staging phase. They also leveraged a new custom event logging pipeline.
Result: The HubSpot team achieved seven figures in yearly cost savings from the ORC conversion and six figures from the TTL reduction. Additionally, the monthly average JSON logs cost was reduced by 55.7%. The changes implemented also led to gains on engineers’ user experience searching logs.
Painpoint: Finout implemented a cost reduction effort to mitigate costs associated with logs. They had launched a new service to replace a legacy high throughput service. Most notably, they reduced the fleet size from ~300 EC2 instances to ~30 pods on a multi-tenant Kubernetes cluster. They were excited about improving reliability, lowering costs, and lower latency. And suddenly, they were inadvertently logging large volumes of data in high throughput (which inevitably led to skyrocketing bills from their logging provider). It became clear that they needed to come up with a solution.
Solution: Before architecting a solution, Finout aimed to better understand what logs were actually being used for. They also sought to better understand which logs were being used the most frequently via their managed ELK. They identified that their requests didn’t directly correlate with revenue and that they could get away with a one day log retention period. They also took a number of steps that included: eliminating unnecessary service requests, severity based logging (put log levels on WARNING and later on ERROR), dynamic logging (a mechanism that decides whether to log a request’s log), implementing an-in memory latch (a mechanism that “locks in” a certain state), and switching log vendors.
Result: Eliminating unnecessary service requests reduced logging costs by 5-10%. Severity based logging cut down log usage by ~90%. Finout was also able to reduce ~50% of the associated logging cost by switching their managed ELK vendor.
Companies to Support Log Cost Management
In summary, logs are incredibly useful, until they're not and they then become a hefty cost to your company. To reduce the cost, log retention should be based on business drivers. And if the business requires those logs, compress or find ways to manage the storage of the logs. Tools like CLP along with commercial tools like Cribl, LogSlash / cwolves, Vaero, Vector, and Mezmo help with this.
- Cribl - Cribl Stream is an observability pipeline that can help with log reduction by eliminating elements that don’t provide analytical value.
- LogSlash - Logslash is a method for reducing log volume that sits between your log producers (firewalls, systems, applications) and log platforms (Databricks, Splunk, Snowflake, S3). It was created by John Althouse.
- cwolves - cwolves is a startup built around the LogSlash method that offers AI-based normalization and a config builder. The company is also developing a Splunk application to make LogSlash more accessible to Splunk customers.
- Vaero - Vaero is a data pipeline for log data that can help with log cost savings by filtering and routing logs.
- Vector - Vector is an observability data pipeline that enables cost reduction.
- Mezmo - Mezmo offers a telemetry pipeline and log management platform that aid in observability data cost reduction.
If you’re thinking of other ways to help solve the log cost problem, I’d love to connect with you!