This post was originally published on The Data Source on April 6th, 2023, my monthly newsletter covering the top innovation in data infrastructure, engineering and developer-first tooling. Subscribe here and never miss an issue!
The market for software observability and debugging continues to evolve owing to the range of tools and modern technologies that are re-thinking how it can be done reliably. From observing the software performance category (think speed, efficiency, responsiveness and reliability), 3 trends come to mind that I believe will continue to guide opportunities for innovation:
- Automation is at the core of every developer workflow: Automated testing tools (e.g. Mabl, Testim, Applitools, and LambdaTest) and continuous integration tools (e.g. AWS CodePipeline, GitLab CI/CD, and GitHub Actions) have shifted the way in which developers identify and remediate critical bugs in their code. With the rise of automated testing frameworks and DevOps best practices around CI/CD (continuous integration and delivery) developers are empowered to automate as much of their software delivery lifecycle (SDLC), making it possible to deploy software faster and reliably.
- Observability is top of mind given the proliferation of distributed systems: With the rise of distributed systems, there’s renewed focus on improving software observability as developers grapple with monitoring, understanding and troubleshooting complex systems that span multiple nodes and services. In recent years, log aggregation tools (e.g. Splunk, Fluentd, Loggly, and Graylog), tracing frameworks (e.g. OpenTelemetry, Jaeger, and Zipkin) and technology such as eBPF (extended Berkeley Packet Filler) and related tooling (e.g. Flowmill, Tracee, Sysdig, Falco, and Cilium) have emerged to provide superior depth and clarity into production systems.
- DevOps continues to “Shift-Left”: From the security world to the modern software development ecosystem, Shift-Left as a concept has contributed to improving code quality, speeding up development cycles and enabling developer productivity. It has become common practice for teams to implement rigorous frameworks for software testing as early as possible in the development lifecycle. Instead of having to test in production while software is being used by the customer, teams are shifting the process to the development and QA stage and limiting in-production manipulation for edge case scenarios.
Software Observability & Debugging In < 1 Minute
There’s no question that debugging is an essential part of the SDLC. Identifying and resolving code issues that impact functionality, performance, and security is critical to remediate quickly to ensure a smooth customer experience. While DevOps teams have gotten better at implementing continuous testing processes throughout the development process, there’s also a need to figure out how to safely debug production environments in order to mitigate issues that make it past QA. Not having the proper monitoring and observability guardrails in production can pose a negative impact on the customer and lead to frustration and loss in revenue should something go awry.
But as many will tell you, debugging in production is a tedious, complicated and expensive process:
- Production environments often span multiple components, each interacting with one another which can make it hard for developers to identify, isolate and diagnose the root cause of an issue. These environments are “live,” meaning there are user requests and traffic flowing through and new features and updates that are being introduced. Given these constant changes, it’s tricky for developers to accurately capture all that’s going on.
- Given the frequency and volume of activities that production environments are subject to, developers grapple with limited visibility surrounding code bugs. Even with the right tools for collecting metrics, logs, traces and profiles, stitching data together to make an inference can be a lengthy process that can delay resolution of these bugs and impact developer productivity.
- Putting the engineering lift aside, there are real security risks associated with the process. Debugging in production involves making changes to code which can in turn introduce vulnerabilities if not done correctly. In cases where the debugging process might require analysis of sensitive information, data security risks can stem from the accidental exposure of sensitive information. Mishandling of that data could lead to breaches.
Today, teams are investing in a combination of automated monitoring, observability and alerting solutions to ensure that their critical systems remain secure and highly performant at all times.
Digging into Current Software Observability & Debugging Tools
Software observability and debugging go hand in hand when it comes to building and maintaining reliable systems: Through observability, one can quickly identify issues that can then be fixed through debugging.
To enable real-time visibility into software systems, developers instrument logging, tracing, and metrics to generate insights about their behaviors. Application profiling too is leveraged to gather how resources are being consumed by the software program in order to help optimize their usage.
These solutions fall into the following categories:
- Logging tools capture events and errors that occur in a system and store them as log files. Log files are then used to troubleshoot issues to understand system behavior. Examples of logging solutions include: Mezmo, Logz.io, Humio(acquired by CloudStrike).
- Metrics tools capture numerical data about a system’s performance. Metrics can range from network traffic to latency, throughput and response times and are used to track performance over time by identifying important trends and anomalies. Examples of metrics tools include Chronosphere, Datadog and Grafana.
- Alerting tools capture data about various parts of a working system (e.g. logs, events, performance metrics) and are tasked with continuously monitoring for anomalies. Should a threshold for a particular metric be reached, the tool is triggered to notify the on-call team before these issues could potentially create an outage and impact end-users. Examples of alerting solutions include PagerDuty, New Relic and Honeycomb.
- Distributed tracing tools track transactions (or requests) to form an understanding of how data moves through a system. For distributed systems involving multiple components and microservices, distributed tracing enables developers to generate insights based on the transaction path of data as it flows from one end to another. Examples of tracing solutions include Dynatrace, SigNoz and Lightstep (acquired by ServiceNow).
- Profiling tools capture resource-specific data such as CPU and memory usage, and disk I/O to guide teams in optimizing their resource consumption. In addition to performance optimization benefits, profiling tools are helpful for identifying memory-related issues, providing code coverage analysis to flag areas in a program that have not been tested yet and building a proactive strategy to handle unexpected crashes. Examples of profiling solutions include YourKit, Optymize (acquired by Elastic), Pyroscope (acquired by Grafana), and PolarSignals.
While software observability techniques have come a long way and newer categories such as distributed tracing and application profiling are starting to garner mindshare amongst the DevOps ecosystem, I find that many organizations are still struggling with implementing end-to-end observability in their production systems. Today’s gaps generally tend to be around the lack of contextualization around the different metrics that are being pulled. Current tools are great at capturing relevant insights but fail to correlate data with different metrics to build a comprehensive picture of what’s truly broken.
On eBPF and what’s next in Software Observability & Debugging
In my research and conversations with DevOps practitioners, eBPF has come up as a next-gen tech that holds significant promise for shaping how software observability and debugging is done.
For context, eBPF allows developers to write low-level code that can be safely executed within the Linux kernel, providing a flexible way to observe, analyze and manipulate data as it moves through the kernel. Unlike the Berkeley Packet Filter (BPF) technology that was designed for network packet filtering but lacked the flexibility to support complex use cases such as tracing system calls and analyzing performance, eBPF is a highly performant tool used to identify performance and security issues.
More specifically, eBPF allows developers to attach custom probes to various system events in order to trace and analyze how the system behaves at a low level. Examples of system events can range from network related activities such as network connection / disconnection to application crashes, to failed login attempts to memory reaching a critical level. By being able to observe different activities at the kernel level, eBPF can prove to be valuable for debugging software.
While eBPF is a relatively new technology, it has been experiencing significant growth and interest from the cloud-native ecosystem with some of the biggest tech companies leveraging the tool internally:
- Datadog for example has released Cloud Workload Security (it uses eBPF to hook into events such as file interactions, privilege escalations, and executions) and Network Performance Monitoring (it uses eBPF to gain visibility into low level networking to power aggregate network flow visualizations down to granular network details).
- New Relic, through its acquisition of Pixie, provides live visibility into Kubernetes and cloud-native workloads. It enables real-time monitoring and automatic instrumentation to collect metrics and traces from clusters in real time.
- Splunk, through its acquisition of Flowmill, leverages network traffic analysis to provide in-depth visibility into the underlying infrastructure layer (microservices, containers and more). Flowmill’s capabilities around high-speed data capture and automatic bug discovery are appealing to customers running applications on the cloud and in complex environments.
In general, there’s a lot of excitement around the use of eBPF to power next-gen observability and debugging frameworks. Below are some benefits that have really stood out to me:
- Non-intrusive approach to observability: Unlike traditional debugging solutions, eBPF is able to monitor and trace kernel events without requiring any changes to the code. This non-intrusive approach introduces little to no overhead, enabling developers to inject custom probes to the system without incurring any significant change to the application.
- Granular software visibility: By virtue of it being implemented at the kernel level, eBPF is able to fetch low-level data about how systems behave in real-time. These kernel functions tend to reveal more context and better data correlation and accuracy than traditional means of observing systems in motion.
- Expressiveness and programmability of eBPF: Given eBPF was built to be an expressive and programmable framework, developers are able to write their own customer programs and probes to collect and analyze different types of data from kernel events. This equips developers with a superior level of flexibility when it comes to writing their own customized debugging solutions versus having to work within the restricted walls of traditional tooling.
As more and more complex and distributed applications get created over time, technology like eBPF will play an important role in creating a standard for what good software observability and debugging looks like. Even though we are still in the early stages of understanding the true scope for what eBPF holds for the future of software reliability, its adoption across big tech companies is an exciting signal and I expect to see more innovation on the startup front to take shape.
If you’re an early stage software-developer startup building in this space, I’d love to hear from you! DM me on Twitter or email me!