Numerous companies have tried to incorporate machine learning into their products in some form or another. In the early days, it seemed to be straightforward: you have an idea inspired by a business need, train a model, wrap it behind an API call and you are good to go. However, people then started to realize that life wasn’t that simple: people would lose track of what the best model was; performance during evaluation didn’t reflect the performance once the model was being used in production; incorrect model predictions would start to appear more often; users would come up with ways to break the system; and so on.
The birth of MLOps
Developing a machine learning based solution is an iterative process, where we constantly think of new models and even tune some metrics. To illustrate this, imagine you worked on an email provider and built a system to detect spam in 2012. Do you think such a system would still work today? Have you noticed how the ways in which people write you spam has changed over time? How do you ensure you always have the best possible spam detector deployed?
Here is where the data scientist and researchers met the experienced software engineers, who had for a long time created a culture, practices and tools to allow the safe and continuous delivery of updates to their software. Some people give the name DevOps to such practices and thus the MLOps concept was born. You can thus see MLOps as the natural adoption of the DevOps culture for the development of machine learning based applications: the culture and practice and set of tools that allow us to continuously deliver better machine learning based software faster and safely.
The unique challenges of ML applications
It may seem too simple to deserve its own concept, but there are enough intricacies to consider machine learning based applications to be its own thing. We have several new ways in which systems can fail, if we compare to systems with no machine learning involved, which requires us to pay special attention to:
- How data is manipulated to allow high quality experimentation and to ensure consistency with the environments where machine learning models are going to be deployed.
- How experiments are tracked to allow us to reproduce critical results and to make sure we can clearly see what the best model is at any point in time.
- How we deliver models timely to make sure it is the best possible model available and we are able to reproduce evaluation results in the production environment.
- How models are monitored and evolved to make sure they keep providing high quality predictions over time.
- How to optimize and validate models for inference, considering that we may need to adapt the models we were training so that they could match throughput and latency expectations.
Dealing with all of that automatically can be overwhelming and prohibitively expensive, dependending on the stage of maturity of what you are building, but what is most important is to be aware of the things that could go wrong and plan for it, even if you can’t build anything right away; however, the more you have automated the quicker it becomes to add value to users, make sure you know what to expect. Google’s and Microsoft’s maturity models can help you understand this better, but they basically refer to how much of the process you have automated, monitored and how safe changes are.
Adding automation over time
Let’s consider our above example of a spam detector. In the early days, a data scientist or machine learning engineer would try to gather data from multiple sources offline and would come up with a set of data transformations to train a model that results in high precision and recall. Once this is achieved, they may be able to deploy a simple service to demo it to stakeholders or to even be consumed by the email client backend whenever a new email arrives.
If you want to keep experimenting with new approaches to improve the performance of the system, you should start by setting up systems to automatically evaluate the model under different static conditions, as well as to keep some sort of record for the different experiments and their results, from which the spam detector can be updated continuously by picking the best performing model (a.k.a. model registry) and running all necessary integration and end to end tests so that we are sure that the whole application is safe with the update.
Eventually, some users will flag as spam a lot of messages that the model didn’t flag as spam, which can prompt the data scientist to revisit the current model assumptions about the data, but should also make it obvious that a way to monitor performance in production is needed. Engineers start to set up metrics and alerts to discover changes in the statistics of the spam messages, how often the model makes the types of mistakes it can make (i.e. incorrectly marking as spam or incorrectly marking as not spam), etc.
The previous step, combined with the fact that the email provider backend will continuously store new data, would help to develop new models with the latest data available. Assuming no changes from the modeling point of view, we can set up an automated process for training the system, so that data scientists don’t need to go and manually put together the data and run experiments again. Those automated training runs would have to go through the previously set evaluation systems to see if they can be deployed safely.
We may not be done yet. What if the data scientist wants to try out some new strategies for modeling the problem? They may require changes in dependencies or the use of infrastructure not previously considered (e.g. hardware accelerators). Full automation could allow them to quickly deploy the new ideas for testing in a production-like environment. You can see a diagram of the above strategy of adding automation overtime below
Now, you can see that in the diagram above there are only a couple of elements that strictly relate to the spam detection example: spam detector and Email service. The reason for that is because the process here described applies to virtually any machine learning based application you wish to build. There will be some caveats based on the type of data being handled, how you can gather feedback from users and where you will deploy the system.
Final thoughts
MLops seems pretty straightforward conceptually: the practice to deliver better machine learning models continuously, by considering what could go wrong at a particular point in time, to then integrate them within software applications. However, the implementation can be quite expensive and take a lot of time for new business or products. It can be adopted progressively to avoid high development costs and delays upfront.
It may be hard to visualize an MLOps strategy for your particular application, reach out to Factored if you want to learn how we can help you to streamline the process for your organization.