Work-Bench | Programmable Compute, the Next Big Thing in ML Infrastructure?

This post was originally published on The Data Source on April 14th, 2023, my monthly newsletter covering the top innovation in data infrastructure, engineering and developer-first tooling. Subscribe here and never miss an issue!

The data and machine learning landscape has transformed over the past five years. With cloud data warehouses like Snowflake and Databricks revolutionizing the management of data across distributed systems, the discourse has started shifting towards leveraging advanced machine learning to support business critical use cases.

The concept of distributed machine learning, too, has taken the front seat: Given the scale and complexity of modern machine learning models, organizations are recognizing the value in enabling collaboration between multiple stakeholders to foster faster model development and deployment cycles. Given the need for advanced machine learning and computing techniques to accelerate model training and improve efficiency, big tech companies and large enterprises alike, are seeking the right solutions to power up their internal tech stack. This trend may indicate that we are on the cusp of a new phase in the evolution of machine learning infrastructure.

Mapping Out the Evolution of ML Infrastructure

Pulling from the archives, it’s worth looking back to the early years of 2016-2017 when Uber and Meta started openly discussing their approach to machine learning infrastructure. We got a glimpse of what Meta’s and Uber’s original production-scale machine learning platforms looked like with the launch of Meta’s FBLearner Flow (2016) and Uber’s Michelangelo (2017). Both FBLearner Flow and Michelangelo were launched as improved versions of their monolithic predecessors which marks an important evolution in machine learning infrastructure.

Trend 1: ML Workflows Move Off of Monolithic Architectures, Onto Microservices

The first generation of internal machine learning systems were built as monolithic architectures. In the early days of machine learning, hardware and software computing resources were limited and machine learning algorithms were not as sophisticated as they are today. Building machine learning systems as monoliths where all the processing components are tightly coupled and integrated into a single codebase was a practical move that made managing these systems a fairly simple task.

But as advancements in machine learning continued to evolve and demand for sophisticated and complex machine learning workflows increased, the need for system modularity and flexibility became a top engineering priority. As described in a blog post by the Uber Engineering team:

“Before Michelangelo, we faced a number of challenges with building and deploying machine learning models at Uber related to the size and scale of our operations. While data scientists were using a wide variety of tools to create predictive models (R, scikit-learn, custom algorithms, etc.), separate engineering teams were also building bespoke one-off systems to use these models in production. As a result, the impact of ML at Uber was limited to what a few data scientists and engineers could build in a short time frame with mostly open source tools.
‍‍Specifically, there were no systems in place to build reliable, uniform, and reproducible pipelines for creating and managing training and prediction data at scale. Prior to Michelangelo, it was not possible to train models larger than what would fit on data scientists’ desktop machines, and there was neither a standard place to store the results of training experiments nor an easy way to compare one experiment to another. Most importantly, there was no established path to deploying a model into production.”

Taking a step back, it’s important to note that the range of machine learning workflows that are produced to spin up different machine learning tasks and operations are varied. Workflows exist as fixed pipelines and flexible ones. Fixed pipelines are designed for tasks that have well-defined and stable processing requirements and don’t have much room for customization. Examples of such tasks include data pre-processing, feature engineering and model training.

Flexible pipelines on the other hand, are designed for tasks that have dynamic processing needs and that may be subject to change over time. Through flexible pipelines, users can customize their workflows by adding or removing tasks and experiment with different preprocessing steps and models as needed. This is especially valuable in real-world scenarios such as predictive modeling and anomaly detection where the data is constantly evolving.

The monolithic nature of first-generation machine learning platforms posed significant challenges around scaling workflows and reusing and customizing individual components to meet growing business requirements. It’s why Uber struggled with code duplication, longer development cycles and scalability issues.

By building Michelangelo as a microservices-first machine learning system, the team figured out a way for fixed and flexible pipelines to be managed and scaled independently. In a microservices environment, machine learning workflows are broken down into smaller individual components that can be customized without affecting the rest of the system. Developers too, are able to work independently on each service so that every new feature creation, testing and deployment happen concurrently leading to fast iteration and development cycles.

This phase in the evolution of machine learning infrastructure has made machine learning workflows much simpler to build, deploy and operate at scale.

Trend 2: ML Workflows Are Programmable

The next evolution that transformed Uber’s and Meta’s machine learning architecture was the introduction of programming interfaces.

Programming interfaces are critical to modern software development as they allow different components of a workflow to communicate with each other through seamless integration. As it relates to machine learning infrastructure, programming interfaces make it possible for teams to access and implement different libraries and frameworks into their workflows.

Both Michelangelo and FBLearner Flow have introduced API support for a wide range of machine learning functions (e.g. model training, feature extraction, prediction, etc.). These APIs enable different components to be developed and deployed independently, and are connected to create a larger machine learning system. For example, with the Michelangelo PyML API, teams can access frameworks such as TensorFlow, PyTorch and Scikit-learn and have the flexibility to experiment with different model architectures. In the case of FBLearner Flow, PyTorch is one of the primary frameworks, accessible through the platform’s APIs that can be used to deploy and serve PyTorch models at scale.

The concept of programming interfaces is especially relevant today given the rise of distributed computing. With the proliferation of sophisticated models, organizations are realizing that training large scale models on massive datasets is a computational intensive task. Given the mounting requirements for training large models (e.g. the need for specialized hardware, compute and expertise), programming interfaces have enabled Uber and Meta to evolve their machine learning systems to integrate with distributed training frameworks such as Horovod (Uber) and Caffe (Meta). These frameworks work by distributing deep learning workloads across multiple GPUs, nodes and clusters, to train models on big data in a fraction of time.

By building Michelangelo and FBLearner Flow as programmable interfaces,machine learning engineering teams are able to focus solely on the business-logic of their models and not on the underlying infrastructure powering distributed computing tasks. This allows them to scale up their models and processing power as needed, without having to rewrite code.

This phase in the evolution of machine learning infrastructure has made it possible for teams to program complex and distributed workflows without touching any low-level compute tasks.

Trend 3: We cracked ML Workflows, Now Onto the Compute Layer

By looking at how machine learning infrastructure has transformed over the years, I realize that what we've been dealing with has inherently been a workflow problem: The first evolution was all about breaking monolithic workflows into flexible and modular pipelines that could be optimized for specific tasks. The second, was about making pipelines programmable so developers didn’t have to think about the underlying architecture.

Assuming we cracked machine learning workflows, I believe the focus will now shift away from wrangling pipelines to figuring out how to programmatically provision compute so developers don’t have to worry about the nitty gritty infrastructure stuff for hosting and deploying their models.

In my research and conversations with ML engineers two things have become clear:

The process for deploying machine learning models at scale is mired with infrastructure-related issues.

To articulate a few of them:

API Server Issues: Deploying machine learning models often involves creating an API request to handle ML tasks, but in the case of a request overload to the server, it could result in slow response times or even system failure.
Dependency Hell: Machine learning models rely on different software components, Python packages, libraries and frameworks to function properly but managing these dependencies and making sure they are all compatible across different environments is difficult.
Model Weight Issues: Machine learning models can have a large number of weights depending on the complexity of the model architecture (e.g., deep learning models will typically have millions of weights). Managing and deploying ML models could pose an issue around transferring large model weights across different environments.
GPU Processing: GPUs are commonly used for accelerating processing in ML tasks but managing GPU resources when deploying models is not trivial. Teams typically struggle with allocating the correct amount of memory for each of their tasks and distributing the load evenly across multiple GPUs.

The concepts of “Serverless GPUs” and “Programmable Compute” are becoming increasingly top of mind given the cost and scale benefits of serverless computing and processing power offered by GPUs.

By leveraging serverless GPUs, developers in need of sophisticated compute for running machine learning workloads can access GPUs without having to manage or buy specialized hardware. Through programmable computing, developers can remotely execute code on hardware that is physically separate from their local devices and only pay for the resources they use during computation.

There is quite a startup movement that is shaking up around serverless GPUs and programmable compute with companies like Anyscale, Coiled, Modal, Banana, Replicate and Beam, building high-level abstractions on top of the compute layer. By offloading compute-intensive tasks to remote servers, these solutions eliminate the pain of deploying machine learning workloads, and customizing infrastructure to support these tasks. Since serverless GPUs enable the parallel processing of data, these solutions may be able to achieve faster model training and reduce infrastructure costs since they require fewer resources to handle large-scale workloads.

While conceptually, the use of serverless GPUs and programmable computing for machine learning tasks is exciting, it is still relatively early in its lifecycle, and the startups that have emerged in this space are still in the early stages of development. But I expect to see continued growth and development in this category especially as organizations seek to develop and deploy large scale machine learning models and applications more efficiently.

Machine learning practitioners and startup builders, if this is an area of interest to you, I’d love your thoughts! Please reach out to me here.

Mapping Out the Evolution of ML Infrastructure

Trend 1: ML Workflows Move Off of Monolithic Architectures, Onto Microservices

Trend 2: ML Workflows Are Programmable

Trend 3: We cracked ML Workflows, Now Onto the Compute Layer

Share