A Guide to Achieving Success with MLOps

  • By MLOPs Expert Group
  • 20 April, 2023
  • 31 Views

Delivering value with Machine Learning is not an easy feat. At Factored, we’ve faced this challenge head-on since our inception. Why? Because Machine Learning is in our DNA. In order to bring success to our clients, we’ve perfected a set of MLOps Principles that create holistic Machine Learning-powered solutions.

We’ve found that the key to succeed at MLOps is to acknowledge that Machine Learning is a fundamentally experimental endeavor, during development, production—as well as other stages. Thus, we’ve found that the key to delivering value through Machine Learning are MLOps solutions that empower our clients to continuously experiment, validate, and thus improve their innovative Machine Learning solutions in a continuous process. 

In this blog post, we’ll first delve deeper into how we conceived the MLOps discipline at Factored and then we’ll give you a few examples of how we’ve delivered value by creating successful MLOps solutions for our clients. 

What is MLOps?

MLOps, short for Machine Learning Operations, entails designing systems that will foster, sustain, and improve Machine Learning-powered value creation processes. Indeed, at Factored, the MLOps process goal is to design, execute, and monitor the systems that make innovative Machine Learning solutions possible for our clients. 

Successful MLOps, much like its cousin DevOps, is not a responsibility of a few team members; instead, successful MLOps comprises the interaction between several roles across many business parts under a culture of collaboration and communicative work. Those roles include business managers, data engineers, data scientists, machine learning engineers, software engineers, etc. 

At Factored, through various projects and industry examples, we’ve found the following description of the MLOps life cycle by our founding advisor Andrew Ng invaluable:

Let’s dive deeper into what the above life cycle entails. 

The MLOps Pipeline

MLOps is a set of guidelines, practices, and tools that delivers reliable Machine Learning Solutions faster. Even though it’s all about delivering value using ML, Machine Learning is only the tip of the iceberg. We’ve found the model itself is just but a small part of the surface area of the whole system: data pipelines, deployment infrastructure, and business logic on top of model scores are all part of the MLOps pipeline.  

The Importance of Data

Unlike traditional software engineering best practices, in MLOps, we need the data and the code to be tightly coupled. Data is the oil to our Machine Learning solutions and, as advocates of the Data-Centric AI movement, we’ve found that iterating on the data is the key to successful adoption of AI for businesses. 

Constant Monitoring

The responsibility of MLOps doesn’t end when a model is deployed into production. Changing market conditions, new business requirements, and data shifts generate a constant need to monitor and adjust our models by going back to the data collection and model training stages of the MLOps life cycle. This brings us to the next part of our post.

The Need for Speed

The above MLOps life cycle reflects the reality of Machine Learning: It is tightly coupled with data within an ever-changing environment. This coupling creates scenarios where problems on one end imply new iterations across the whole pipeline in order to validate creative solutions as fast as possible. This need for speed to test out hypotheses summarizes much of the high-value Machine Learning work: you can’t tell whether it’s going to work before you test it out

Out of this need for speed in order to validate as fast as possible, we add the usual requirements for production-worthy code: Scalability and Reliability. Let’s examine how these concepts come into play in the aforementioned pipeline. 

Scalability

Scalability refers both to the infrastructure and the processes that run each of the above units in the life cycle. If you have one model in production, you can probably get by running data pipelines manually to re-train your models—the same goes for deployment and monitoring. However, when you have hundreds or thousands of models, automation becomes a necessity: How can you trigger automatic re-training when your performance drifts on the latest monitoring report? How can you automate A/B testing to test your model performance in production? Therefore, scale in MLOps refers to the ability to run, sustain, and improve your Machine Learning operations as both your workload and your model inventory grow.  

Reliability

Reliability means that your system is able to perform at the desired level even in the face of adversity (hardware, software, or human error). An additional challenge that Machine Learning presents is that it can fail silently. For example, a data imputation operation can fail resulting in non-sensical model inputs, but the model can still produce “valid” outputs. To be able to withstand frequent experiments and validation whilst detecting silent errors, an MLOps system must enable robust logging and testing of data and artifacts throughout the entire pipeline. For example, in the data pipelines that feed model serving and model retraining, we should log and test raw data as well as transformed data; in the model training, we should version and test our experiments’ artifacts, code, and training data; and in the model deployment stage, we should be able to version control the served artifacts as well as any business rules on top. These steps form the foundation of a Continuous Integration pipeline for our MLOps system.