MLOps and The Model Development Life Cycle
Kubeflow Jupyter Notebooks facilitate the traditional Model Development Life Cycle within the proposed MLOps framework.
- ML Development: experiment and develop a robust and reproducible model training procedure (training pipeline code), which consists of multiple tasks from data preparation and transformation to model training and evaluation. Typically Data Scientists begin with data transformation and experimentation in IDEs, such as JupyterLab Notebooks to review model quality metrics and identify models for further development or training. Data Scientists train their ideal models in development, using tools like HP Tuning or AutoML, and prepare the model for production.
- Training Operationalization: automate the packaging, testing, and deployment of repeatable and reliable training pipelines.
- Continuous Training: repeatedly execute the training pipeline in response to new data or to code changes, or on a schedule, potentially with new training settings.
- Model Deployment: package, test, and deploy a model to a serving environment for online experimentation and production serving. An MLOps engineer takes the model creation pipeline from the Data Scientist and deploys the creation, serving, and validation of the model and supporting services into a production environment.
- Prediction Serving: serve the production model via an inference server.
- Continuous Monitoring: monitor and measure the effectiveness and efficiency of a deployed model.
- Data and Model Management: is a central, cross-cutting function for governing ML artifacts to support auditability, traceability, and compliance. Data and model management can also promote shareability, reusability, and discoverability of ML assets.
Taking a step back, before Data Scientists even perform data exploration the data must be cleansed so that the best data is provided. Clean and high-quality data drives high-quality models and this is the goal of any Model Development Life Cycle. Data Scientists are supported by Data Engineers who are responsible for sourcing and cleaning data used for model development. ML Engineers are responsible for moving models into production and setting up monitoring to hedge the impact of changes in data profile or model drift. MLOps is focused on automating the testing and deployments of Machine Learning models while simultaneously improving the quality and integrity of the data considered to create the models in response to model drift and data profile changes.
For data exploration, Data Scientists will take a subset of available data into a personal environment, in this case, Jupyter Notebooks, to evaluate data quality and identify patterns. Based on their domain expertise, statistical analysis, and vast algorithmic knowledge, Data Scientists then determine how to proceed with model creation. Performing this activity within the storage of the Kubeflow Notebook reduces the burden on the network to continually pull new data from a central repository, such as S3 or some other Data Lake. Executing computations for data exploration as close to the Data Scientist as possible, specifically in the Jupyter Notebook or local pipeline volume, reduces overall resource consumption in the system and allows Kubeflow to leverage Kubernetes optimization on behalf of the Data Scientist. In this course, you will replicate this exact step since you will download the data directly into your Jupyter Notebook on the Notebook Server.
Once the desired features for model development have been identified Data Scientists will need to standardize, globally across the organization, on not just the features but how the features are transformed from the source data. This is necessary because Data Lakes are volatile and mutable and are typically designed for high-quality data inputs for dashboards or business intelligence visualizations. Data Lakes can be standardized for the Model Development Life Cycle, however, a Feature Store is not only optimized but designed for this purpose. Consider the Feature Store as the versioned, consolidated, and standardized input for model development, training, and tuning. In this course you will not be working with a feature store, this will be done in the personal Jupyter Notebook. However, in a mature MLOps environment, the Jupyter Notebook should query the Feature Store to get the feature list for model development.
With the features identified based on Data Exploration and, if possible the Feature Store, it is time to create the actual machine learning model. This is also done in the Jupyter Notebook and this happens initially with the data that was pulled into the Notebook Server. However, in a mature MLOps environment, you should expect to pull fresh data from the desired External Data Source as per the Features identified in the Feature Store for the model. In this course, you will do initial model training with the Jupyter Notebook however once a model has been developed you will want to implement a Continuous Training process to ensure the model is always up to date with the latest data. You will explore Continuous Training further towards the end of this course. One of the core tenets of MLOps, which is borrowed from DevOps, is to perform as much testing as possible early so as to improve velocity and reduce future maintenance needs. Test-driven deployments with unit tests written by the developers as part of the model creation process ensure early validation of high functioning and quality models. Making changes to a model, or any step in the model creation process becomes more expensive the closer the model gets to deployment and production. Data Scientists use model quality evaluation frameworks and algorithms to compare and contrast model performance. However, in a mature MLOps environment, this step should be further automated so that as models pass the quality inspection they are made immediately available to subsequent deployment processes. In this course, you will manually evaluate the quality of the models. However, in practice, you will want to implement a Continuous Training process to ensure the highest quality model is selected. You will explore Continuous Training further towards the end of this course.
Once an ideal model has been identified based on model training and testing, Data Scientists will perform Hyperparameter Tuning to make sure the model is as finely tuned as possible. Hyperparameter Tuning and other AutoML activities will be orchestrated from within the Jupyter Notebook as well. In this course, you will manually Hyperparameter Tune the models. In practical application, you will want to implement a Continuous Training process to ensure the highest quality model is finely tuned. You will explore Continuous Training further towards the end of this course.
Jupyter Notebooks are the vehicle for MDLC for Data Scientists in Kubeflow because they abstract away the complexities of both Kubeflow and Kubernetes so that Data Scientists can focus exclusively on their work. Kubeflow project namespaces provide an isolated personal environment, however, this is only the beginning of the overall MLOps process. Once this work is done, Data Scientists can quickly create, compile, and run a Kubeflow Pipeline. Since Kubeflow Pipelines are snapshotted and can be shared, the output of the Model Development Life Cycle is not just a model, but a process by which to continually train the desired model in any environment. With such a Continuous Training process implemented additional processes around Continuous Integration of the approved best-tuned model and Continuous Deployment of the model in production can be developed. These are the foundations of MLOps and the concepts that Kubeflow and Kubeflow Pipelines are supporting. We will explore the rest of these concepts throughout the course. In this section, we will proceed with Data Exploration, Feature Selection, and Model Creation by deploying a Kubeflow Pipeline.