The Power of Kale
Kubeflow Pipelines components are based on the compiled data science code, pipeline step container images, and associated package dependencies.
A well-defined pipeline has the following characteristics.
- A quick way to give our code access to the data we want to use.
- An easy way to pass the data between steps.
- The flexibility to display outputs.
- The ability to revisit / reuse data without repeatedly querying external systems.
- The ability to opt-out of unnecessary steps that really were meant for experimental purposes.
- The confidence that our imports and function dependencies are addressed.
- The flexibility to define Hyperparameters and Pipeline Metrics within our Jupyter environment and store them within our pipeline definitions.
Consider the following steps that are eliminated by working with Kale and the time saved as well as the consistency introduced by eliminating these steps.
- Repetitive installation of Python packages: With Kale, all the necessary packages are installed at once.
- Slow pipeline creation: Toil intensive time to code the pipeline vs taking advantage of Kale’s pipeline automation capabilities.
- Lots of boilerplate code and code restructuring: Without Kale, the notebook has to be entirely modified to be compatible with the KFP SDK, which results in having to write more boilerplate code. With Kale, the original notebook is simply annotated with Kale tags.
- Pipeline visualization setup difficulty: To produce metrics visualization, specific standards and processes must be followed. With Kale, the visualizations are created in the same conventional way we create visualizations in our notebooks.
Kubeflow Pipelines give you the connective tissue to train models with various frameworks, iterate on them, and then eventually expose them for the purpose of serving. This means our entire Model Development Lifecycle lives within our KubeFlow pipeline components and definitions. We now have the power to be intentional and declarative with how we want our models to be developed as well as how we provide feedback loops to our data science teams. This gives us the capacity to further improve not only the data science function code but the pipeline descriptions themselves in order to respond to ever-growing business demands. This lays the foundation for the Continuous Integration and Continuous Deployment processes which ultimately push and support models in production. This can become quite daunting if you are taking the manual approach. What works for your organization today might suffer the technical debt-ridden test of time as you begin to scale and introduce essential complexity that comes from improved velocity and feature offerings. Your data engineering team should not need to scale linearly with your data scientists. Leveraging Kale means you are setting the foundation for an effective and self-sustainable MLOps culture and environment
If you would like to see how to do this without using Kale please read through the following blog post: https://www.arrikto.com/blog/developing-kubeflow-pipelines-kaggles-digit-recognizer-competition/. Keep in mind that your Pipeline must be structured in a way that the steps flow as necessary, likely some combination of in sequence and in parallel. This structuring also requires managing imports and package dependencies as well as how the intermediate data is moved across Pipeline Steps. This requires design forward-thinking to support and understand how you are going to move prepared data and work done by our components across your platform with the eventual goal of serving predictions from an inference server endpoint. Pipeline components (and what they consume) greatly impact the health of your model development life cycle. This is especially true with the data used for training, but can also refer to the ConfigMaps, Secrets, PersitentVolumes, and their respective PersistentVolumeClaims that are created within the context of the Pipeline. It’s important to understand what is truly immutable, and what can be changed or misunderstood. The easiest example is a data lake or a volume. A snapshot of the current state of these tools can drastically improve your reproducibility and allow you to regain your control over your MLOps outcomes with immutable datasets. For more technical detail on this specific topic please refer to the OSS Documentation: https://www.kubeflow.org/docs/components/pipelines/overview/quickstart/.
Keep in mind that during execution there are a myriad of supporting technical exercises that will have to be coded into any Pipeline that is created manually. You will need to marshal work done within a step to volumes so you can enhance your ability to move data between components and subsequent steps to ensure proper continuity in your pipeline. If you don’t have work from the previous steps or consolidate work run in parallel, you can create problems for tasks further down the line or submit degraded outputs for steps to consume. Once you properly align your inputs and outputs, you will also need to precompile the code, upload it to the Kubeflow Deployment and actually execute the Pipeline. This can be extra difficult if you want to ensure the SAME volumes are passed along during each run. New volumes can be provisioned and “bring less baggage”, but then you are potentially creating a “dead asset” that is not only taking up precious space but can also be a security risk. We also must consider the frequency of a run and what garbage collecting must be done from our recurring pipelines. This is even touching the volume access modes which means we need to not only provision a volume but make sure that whatever needs to reach the volume (independently or with other tools) CAN access the volume. This can be a lot to manage and lead to failed smoke tests leading to slower deployments or promotion failures. For ease of use and toil-free uptime, we strongly recommend working with Kale!