Work-Bench Snapshot: Augmenting Streaming and Batch Processing Workflows
The Work-Bench Snapshot Series explores the top people, blogs, videos, and more, shaping the enterprise on a particular topic we’re looking at from an investment standpoint.
This post was originally published on The Data Source, my monthly newsletter covering the top innovation in data infrastructure, engineering and developer-first tooling. Subscribe here and never miss an issue!
The movement towards real-time streaming infrastructures is finally crystalizing. In the last 3 years alone, I’ve observed a growing number of startups and products built around streaming data, many of which have received massive venture funding. I have also spoken to data practitioners across medium to large enterprises who are currently looking into optimizing their failing batch workflows to enable more efficient, fresher data pulls and faster time-to-insight. The heightened sense of awareness around streaming systems has not only brought into focus the existing challenges that make their implementation and adoption an engineering nightmare (e.g. ingesting, processing, storing, and serving of that data), but also offers a fresh scope for innovation.
Today, there’s a broad swath of tooling being built to address the reliability, cost, maintenance and usability considerations that real-time applications often require. These tools span the entire data infrastructure stack across categories including data storage, ingestion, modeling, analytics, AI / machine learning and more. While many of these tools could eventually become an important fabric underpinning modern data applications and products, I’ve been most fascinated by the ongoing evolution of the “modern” data warehouses and what it means for the future of big data and creation of next-gen stream processing systems.
This is all meaningful development because if you look back on the evolution of the batch data infrastructure layer you’ll find that almost every major technological shift that has seen the light of day has emerged around the data warehouse. And, driving every shift was a list of technical limitations that inhibited how data practitioners performed analytical work (processing, storing, querying, activating data, etc.).
Case in point: When we last hit the most recent transformation in data warehousing, it was the early 2010s, a time when data engineers were collectively trying to get to a point where they could process large scale queries in a cloud-first world and have access to self-serve data. It was these technical challenges that gave birth to the cloud data warehouses, distributed SQL engines and whole new categories of tooling built to support these newfound engineering capabilities.
What followed next was that we saw the first cloud data warehouses come to life with Google Cloud releasing BigQuery in 2011 and AWS releasing Amazon Redshift in 2012. Then in 2013, came the next iteration of the big data engines e.g. Apache Impala, Trino (fka PrestoSQL) and more. In 2014, Snowflake (est. 2012) launched to GA and Fivetran (est. 2012) launched its first data connector. In 2015, Dremio launched with a new approach to self-service data without needing traditional ETL. In 2016, we saw Fishtown Analytics, makers of data transformation tool, dbt Labs, come to life. And, more recently, in 2017 we saw the makers of Trino / PrestoSQL spin up Starburst Data as a way to redefine traditional data warehousing architectures.
While there are certainly other driving forces that have culminated in the growth and adoption of these products, I’d argue that a lot of it can be attributed to the rise of Snowflake / BigQuery / RedShift as the first data platforms built for the cloud. Snowflake in particular, offered an opportunity to completely reinvent the SQL data warehouse with its unique elastic architecture that “physically separates and logically integrates compute and storage” and one that can easily scale up and down. Today, we are seeing a whole new ecosystem of cloud-native tools built around or directly on top of Snowflake in the so-called “modern data stack.” These categories include data observability (Bigeye, Monte Carlo Data), lineage / catalog (Atlan, Stemma), data activation (Hightouch and Census), predictive analytics (Continual), data app builders (Hex), metrics layers (Transform, Trace), and business intelligence / dashboarding (Omni, Preset) and more.
Now, in mid 2022, we are sailing through yet another wave in data warehousing transformation where today’s top engineering limitations are largely centered around manipulating data at instant query times in order to unlock more value out of data. It almost feels like every shift in data warehousing is just people wanting to do even more with their data. From my conversations with data infrastructure practitioners, I’m finding that many organizations are starting to ideate on potential solutions for architecting real-time data workflows that could power large scale analytics apps and enable interactive data experiences. While not everyone has urgent use cases for fetching data in “real-time,” there is certainly a growing awareness around all that you can potentially do with your data if you’re able to instantly query and analyze it (e.g. powering interactive dashboards, data apps, enabling multi-stakeholder, live engagement with the data, etc.).
In response to these technical hurdles, it’s interesting to see Snowflake now doubling down on fortifying its streaming capabilities. Recent offerings from Snowflake show that the cloud data warehouse as we know it now offers significant scope for building better interoperability between batch and streaming data and unlocking newer querying and analytic capabilities. In fact, this concept of a newer and more evolved data warehouse built for the batch-stream world has been in motion for a while now, with products such as Materialize, SingleStore, Imply, Onehouse, Deltastream, RockSet and more making their way into the database market.
While the market for unified batch and stream databases is still nascent, I'm excited to watch this space grow as I anticipate newer categories of tooling specific to the streaming stack. In particular, I have my eyes on: