Work-Bench Snapshot: Augmenting Streaming and Batch Processing Workflows
The Work-Bench Snapshot Series explores the top people, blogs, videos, and more, shaping the enterprise on a particular topic we’re looking at from an investment standpoint.
This post was originally published on The Data Source, my monthly newsletter covering the top innovation in data infrastructure, engineering and developer-first tooling. Subscribe here and never miss an issue!
There are a number of free, open sourced notebooks as well as a mix of proprietary solutions that exist in the market today that are being adopted by data scientists at large across the enterprise. Even though computational notebook interfaces have existed for decades now, we are starting to see some innovative work being done both at the notebook level and beyond.
Before I jump into some of the ongoing work around next-gen computational notebooks and collaborative data workspaces, let me set some context:
When Project Jupyter spun off from IPython in 2015, the idea was to create an open-source environment, governed by open standards, for interactive computing that supports multiple programming languages. Fundamentally, a Jupyter notebook is a JSON document that stores live codes, represented in a logical way across cells and displays the outputs upon executing those codes in a way that’s relatively easy for humans to read and write.
Over time, Jupyter notebooks gained popularity within the data science community as it created a new standard for teams to collaborate on code in an interactive way. Just like you would on a Google document, multiple teammates can work off a Jupyter notebook, contributing, embedding, and tracking in real-time text and code. Teams have the ability to add descriptive texts to their codes to explain what the code is doing, why certain decisions are being made, and how certain analyses are being derived.
With ~10 million public Jupyter Notebooks on GitHub and a growing number (~2,000) of open job postings on LinkedIn that list “Jupyter Notebook” as a required experience or bonus skill set, it’s clear that Jupyter has become a critical tool in the data science stack.
As organizations continue to grow their data processes to the point where collaboration between teams (and tooling) becomes a friction point, this will unlock some exciting opportunities in the Jupyter ecosystem for a better collaborative front-end experience.
But while there are undoubtedly many things to love about Project Jupyter, there are a few key areas for improvement that came up in my research and conversations with data practitioners that I’d like to call out:
While Jupyter was developed around the idea of “shareable”computational notebooks, aimed at making it easier for data scientists / analysts / scientific practitioners to iterate faster on code and share their experimental results with their teams (typically over GitHub, Jupyter Notebook Viewer, email etc.), my research shows that Jupyter’s sharing functionality doesn’t extend well to non-programming / non-technical counterparts as it was primarily built with the technical user in mind. It’s a problem that’s becoming more and more apparent in specific industries, such as biotechnology and healthcare where data scientists and analysts often need to communicate back and forth with non-technical stakeholders, such as scientists and researchers.
From my research, it seems like there is consensus that computational notebook interfaces have created a step change in the way that folks collaborate and iterate on processes. What’s missing in the market today is a better UI/UX for collaboration (more on this below) and a better workflow-oriented tool that can effectively close the gap between performing computational work and reporting out on those results in a way that’s secure, easily reproducible, and easily accessible to a diverse group of users.
In fact, one of the themes I’ve seen come up a lot from the data science community is this idea of better data shareability where you can have a set process in place for creating reproducible work. Today, there are a number of solutions that have emerged in the market such as Hex, Curvenote and Noteable, that are fundamentally re-thinking the way teams beyond the data science organization collaborate on data through a shared interface. Unlike Jupyter that caters largely to one particular user group, these next-gen solutions are taking a different approach. Instead of focusing on the needs of the individual data scientist or analyst, their goal is to open up access to data to anyone within the organization.
Building on top of the Jupyter ecosystem, these workflow-specific solutions are hyper-focused on turning data science computations into meaningful and usable outputs (e.g. report, dashboard, etc.), while optimizing for a better data sharing and collaboration experience for users. These tools integrate deeply into the data layer, offer great back-end support and a good query experience.
Also tackling the “computation-to-communication” challenge is another interesting category of solutions, namely data framework products that focus on the development of interactive apps. Tools such as Streamlit, Plotly Dash, Voila, RStudio Shiny and Panel enable ML engineers and data scientists to turn their codes and ML scripts into interactive web applications.
You’ve probably heard, over and over again, that a good data science practice must follow the rules of good software engineering and this couldn’t be more true. Data science has significantly evolved over time where data scientists today are building and deploying their own software.
While traditional notebooks are fairly good at delivering guided experiences for software development, they are still missing some of the core features (autocompletion, unit testing, debugging, version control, code reviews, documentation, etc.) that would encourage best practices around building reliable software. Besides, there are a bunch of known operational challenges around running and executing notebooks that make it quite difficult to go from prototyping to productionizing.
The question is, what does it take to address these challenges?
Of the more recent innovations in the data science notebook space is Deepnote, a python notebook, built around the Jupyter ecosystem. Their approach is to create a completely new computational front-end that brings core software development best practices to the data science workflow. At a quick glance, what really stands out is their focus on real-time collaboration and growing library of integrations.
It’s exciting to see innovation happen at the UX/UI layer. In addition to improving collaboration and data science reporting capabilities, I think there’s an exciting opportunity around improving the discoverability of notebooks and enabling data governance, and auditability (especially for organizations that have a significant number of notebooks under their purview), and I’ll be watching this space as it continues to evolve.
And that’s a wrap folks! To all the founders and data scientists out there, I’d love to swap notes if this is a space that you’re passionate about. My Twitter DM is open and you can 📩 at firstname.lastname@example.org!