Work-Bench | Work-Bench Snapshot: Augmenting Streaming and Batch Processing Workflows

The Work-Bench Snapshot Series explores the top people, blogs, videos, and more, shaping the enterprise on a particular topic we’re looking at from an investment standpoint.

Due to the rising demand for large-scale data processing, today’s data systems have undergone significant changes to effectively handle transactional data models as well as support a larger variety of sources, including log and metrics from web servers, sensor data from IoT systems and more. In fact, the current data ecosystem is split between two fundamental computing paradigms, batch processing, where large volumes of data are scheduled and processed offline, and stream processing, where large streams of data are continuously processed for real-time analysis.

Today, there are an increasing number of applications that require both stream and batch processing. For example, financial services organizations utilize stream analytics in areas where it is important to get fast analytical results on critical jobs, such as monitoring fraud detection and analyzing customer behavior data and stock trades. On the other hand, batch processing is used for use cases where it’s more critical to process large volumes of data than it is to get near instant results, such as end-of-cycle data reconciliation.

But the current technologies underpinning these paradigms have evolved significantly over the past couple of years. Let’s dive into it!

Shifting from Kafka to Pulsar

Kafka, the leading platform for high-throughput and low-latency messaging created by LinkedIn, became largely popular because it can publish log and event data from multiple sources into databases in a real-time, scalable and durable manner. But Kafka is not as scalable and performant as it should be: Owing to its monolithic architecture, the storage and serving layers in Kafka are coupled and can only be deployed together. What this means is that every time someone needs to access data from the storage layer, the request has to go through the message broker first which slows down time to query and reduces latency and throughput.

Trend: Apache Pulsar is a next generation messaging and queuing system that came out of Yahoo. Unlike Kafka, Pulsar is architected in a multi-layer way which decouples its compute, storage and messaging layers into separate distinct layers, enabling developers to access data directly from each individual layer. This not only enables instant access to data as it gets published by the broker, but it also significantly increases throughput, data scalability and availability.

StreamNative is a cloud-native event streaming service powered by Apache Pulsar that was founded by the co-creators of Apache Pulsar and BookKeeper.
Pandio is a distributed messaging service that offers Apache Pulsar as a service.

Other tools include: Kesque and Cloudkarafka

Unifying Batch and Stream Processing Systems

Like event streaming platforms, batch processing systems have their own advantages and disadvantages. Innovation in the ETL pipeline has made it easier for the engineers and end users to collaboratively work with and process batch data. But since data in ETL is loaded on schedule, every time the end-user poses a question, the data has to be processed all over again in order to return a particular query. And, as more and more users query from the same pipeline and spin up multiple workflows on an ad-hoc basis, this results in slower query times and higher infrastructure costs.

Trend: Traditionally the operational and analytical stacks have largely been separate owing to the complexities of integrating the two. However, a trend that we’re observing in this space is around unifying these stacks in a model that expresses both batch and streaming computations to offer the best of each of these ecosystems. Here are some tools and their approaches on how they are tackling this space:

Materialize is a SQL streaming database built on top of the Timely Dataflow research project and is used for processing streaming data.
Estuary provides a unified foundational layer for batch and streaming workflows and is built on top of Gazette, an open source streaming infrastructure that enables users to build real-time apps with exactly-once semantics.

Other tools include: Dataflow and Apache Beam

Delivering Stream Processing Capabilities to a More Diverse User Base via SQL

While the first two forward-looking trends dealt with the infrastructure side of the challenge in stream and batch data processing, the user side of the problem also needs to be addressed. Data-driven organizations today have a large number of non-technical employees who need to analyze real-time streaming data but need to do so without having to interact with the complexities of the underlying infrastructure.

Trend: We are seeing a growing number of tools that democratize access to non-technical users by enabling them to query data through SQL, a common programming language among most data practitioners. These tools not only create an end-to-end self-service experience for the users but they also simplify the process by providing a common base for data engineers, data scientists and analysts to work collaboratively.

AthenaX is Uber’s open source streaming analytics platform built on top of Apache Flink that empowers both technical and non-technical users to run and analyze streaming analytics in production using SQL.

Relevant Blog Posts for Additional Reading

Rethinking Flink’s APIs for a Unified Data Processing Framework by Aljoscha Krettek
“As simple as this idea might be, making you wonder why it is that we have separate batch and stream processing systems — including Flink, with its DataSet and DataStream APIs — we live in a complex world where requirements are vastly different when it comes to data processing. More specifically, between batch and streaming, there is a difference that relates to what changes faster: the data or the program. For stream processing-style use cases (such as data pipelines, anomaly detection, ML evaluation, or continuous applications) we need to process data very fast and produce results in real-time, while at the same time the application logic stays relatively unchanged or is rather long-lived. On the other hand, for most batch processing-style use cases like data exploration, ML training, and parameter tuning, the data changes relatively slowly compared to the fast-changing queries and logic.”
Logs & Offsets: (Near) Real Time ELT with Apache Kafka + Snowflake by Adrian Kreuziger
“Unfortunately, as Convoy has grown to 700+ employees over the past year, the one thing in this picture that wasn’t scaling well was our ELT system. ELT, which stands for Extract Load Transform, is the process of extracting data from our various production services, and importing it into our data warehouse. Previously we had relied on a third party service to handle this work for us, but our import latency increased as we continued to scale. Data that used to take 10–15 minutes to import now frequently took 1–2 hours, and for some of the larger datasets you could expect latencies of 6+ hours. As you can imagine, knowing where a truck was 2 hours ago when it’s due at a facility now isn’t very useful. Several teams had already started building workarounds to pull real-time data from our production systems with varying degrees of success, and it became increasingly clear that access to low latency data was critical.”
Kafka Alternative Pulsar Unifies Streaming and Queuing by Susan Hall
“Apache Pulsar combines high-performance streaming (which Apache Kafka pursues) and flexible traditional queuing (which RabbitMQ pursues) into a unified messaging model and API. Pulsar gives you one system for both streaming and queuing, with the same high performance, using a unified API.”

People to Follow on Twitter

Stephan Ewen is a committer and PMC member of the Apache Flink project and CTO of Ververica (formerly data Artisans).

Arjun Narayan is the CEO and co-founder of Materialize, a NYC-based real-time streaming SQL database and was formerly a software engineer at Cockroach Labs.

Ricardo Ferreira is a developer advocate at Elastic and was formerly a member of the developer relations team at Confluent.

Maximilian Michels is a software engineer and committer to Apache Flink and Apache Beam and previously worked on the data infrastructure team at Lyft.

Videos

Unified Data Processing with Apache Flink and Apache Pulsar
This video from Flink Forward discusses how Apache Pulsar and Apache Flink can be combined to power a unified data processing platform.
Designing ETL Pipelines with Structured Streaming and Delta Lake — How to Architect Things Right
In this talk, the speaker, Tathagata Das, a software engineer at Databricks examines streaming design patterns and discusses the key considerations involved in architecting ETL pipelines.
From Batch to Streaming to Both
In this video, the speaker shares how the data platform at Skyscanner evolved from batch to streaming to a unified platform of both.
The Materialize Incremental View Maintenance Engine
In this video, Materialize co-founder Frank McSherry dives into the architectural differences that sets Materialize apart from traditional Spark-like systems and relational databases.

Let’s Chat! 📩

If you’re a startup or data practitioner working on a solution in this space, please reach out! I’d love to chat. We continue to learn and evolve our thinking in this space.