What is MLOps and Why is it Important?
Machine learning is becoming more and more ubiquitous across all manner of companies from start-ups to global enterprises. Enterprises are attempting to solve real-world business problems through Machine Learning and AI and this cutting-edge approach to business results in new challenges. Data Scientists are building state-of-the-art models using the Model Development Life Cycle. As a reminder, the overall Model Development Life Cycle can be summarized as:
- ML Development: experiment and develop a robust and reproducible model training procedure (training pipeline code), which consists of multiple tasks from data preparation and transformation to model training and evaluation. Typically Data Scientists begin with data transformation and experimentation in IDEs, such as JupyterLab Notebooks to review model quality metrics and identify models for further development or training. Data Scientists train their ideal models in development, using tools like HP Tuning or AutoML, to prepare the model for production.
- Training Operationalization: automate the packaging, testing, and deployment of repeatable and reliable training pipelines.
- Continuous Training: repeatedly execute the training pipeline in response to new data, code changes, or on a schedule, potentially with new training settings.
- Model Deployment: package, test, and deploy a model to a serving environment for online experimentation and production serving. An MLOps engineer takes the model creation pipeline from the Data Scientist and deploys the creation, serving, and validation of the model and supporting services into a production environment.
- Prediction Serving: serve the production model via an inference server exposed via a cloud endpoint.
- Continuous Monitoring: monitor and measure the efficacy and efficiency of a deployed model.
- Data and Model Management: this is a central, cross-cutting function for governing ML artifacts to support auditability, traceability, and compliance. Data and model management can also promote shareability, reusability, and discoverability of ML assets.
Data Scientists are supported by Data Engineers who are responsible for sourcing and cleaning data used for model development. ML Engineers are responsible for moving models into production and setting up monitoring to hedge the impact of changes in data profile or model drift. MLOps is focused on automating the testing and deployments of Machine Learning models while simultaneously improving the quality and integrity of the data considered in order to iterate on or create models in response to model drift and data profile changes. MLOps is a framework designed in response to the needs of the overall Model Development Life Cycle while adopting popular shift-left and “shift-right” concepts from traditional DevOps. MLOps demands that build tools allow us to ensure replicability, repeatability, reproducibility and that the data being considered is sourced properly and cleaned. If two people attempt to build the same product at the same revision number in the source code repository on different machines (or cloud environments), we should expect identical results. The build process must be self-contained and not rely on services external to the build environment. A portable, composable and resilient approach ensures consistency between development and production driving the velocity of the iterations. Keeping in mind that velocity can be at the cost of stability, the harmony between the two is critical. In general, MLOps is summarized as:
- An abstracted methodology to interface with specialized services to facilitate Model Development Lifecycle work, done efficiently in a decoupled manner.
- The ability to continuously build, train and improve models to ensure stability and accuracy during production.
- Strategies and processes to repeatedly transform raw datasets, produce predictive features and respond to business goals.
- An environment that handles the life cycle of continuous training pipelines and resulting models as well as monitoring the quality of prediction results.
All in all, MLOps is a unified vision and automated Continuous Training process which cycles through points 1 - 4 above to improve the model in response to ever-evolving business needs.
The goal of MLOps is to deploy the model and achieve ML model lifecycle management holistically across the enterprise, reducing technical friction, and moving into production with as little risk and as rapidly to market as possible. A model has no ROI until it is running in production. As with any process, we track a variety of KPIs to evaluate the performance of our MLOps environment and the overall process.
- Commits / Day / Data Scientist: How frequently can a team update Pipeline code?
- Development Cycle Time: How long does work take, soup to nuts?
- Defect Rate: How often does a model need to be rollback / retrained?
- Mean Time to Repair: How quickly can a team redeploy or debug a failing or broken model?
- Mean time to Respond: How quickly can a team adjust to model drift or other external pressures?
While these are general health metrics to gauge the effectiveness of the MLOps environment and practice, keep in mind the end goal is to achieve a stable model in production. Targeting these metrics as the source of truth for model quality is not the goal, these metrics are used to inform the decision-making process around these models.
In order to work at scale, teams must be self-sufficient so that individual teams can decide how often and when to release new versions of their products. This is the intention of “shift-left” thinking that came about as part of DevOps and is being adopted by MLOps. Teams must also have confidence that their work does not create additional work for their peers, and that their MLOps platform facilitates both the control and collaboration necessary to catch issues early as well as minimize incurred technical debt. A selection must be based upon test results for the features selected for a given build. Release processes should be automated to the point that they require minimal involvement by the engineers, and multiple projects can be automatically built and released using a combination of the automated build systems and deployment tools. Additionally, it must be possible to track the lineage of the data used as well as the model creation process. In summary, the platform must enforce company-specific non-negotiables around security, reliability, and governance to protect against common human error mishaps that can lead to risky outcomes. This type of thinking is why MLOps is critical to enterprises looking to reduce technical debt and improve model development quality and deployment KPIs. Keep in mind that an effective and self-sustainable MLOps culture and environment is an ideal that many enterprises are marching towards with Kubeflow, therefore we will explore this discussion within the context of Kubeflow. Throughout this course, we will explore MLOps topics and discuss how mature MLOps deployments approach popular problems such as the one solved in this Kaggle Competition example.