MLOps: Task and Workflow Orchestration Tools on Kubernetes

Photo by Marek Piwnicki on Unsplash

TL;DR

All articles that do tools comparisons and analysis provide a table and this one will be no exception. Here it is and don’t say you didn’t find it.

The Problem

It is important to understand what your system should do and which part of the job you are ready to give to ready-made framework. Though we refer to MLOps when it comes to “Hidden Technical Debt in Machine Learning Systems[5] it is clear that there is technical debt in any system overall whether it’s related to train models or not. Systems are built to solve a problem and if a system approaches production, we all know that “the only good system is a sound system[6]. Aside from the theoretical concept that your problem can be divided into pieces and delegated to some subsystem or tooling, there is always this “glue” that is specific to a system and the problem it is trying to solve.

  • Whether you want to reuse the framework for something else
  • Define an API for describing workflows i.e. the tasks and their dependencies
  • Have a centralized scheduler that would manage the workflow execution (start, cancel, wait for resources etc.)
  • Have an operational dashboard with runs, metrics, artefacts etc.
  • Be well integrated to Kubernetes
  • Be potentially useful for other use cases, such as CI/CD
  • Be tightly coupled to a cloud provider, possibly introducing vendor specific services (sometimes described as Vendor-Lock-In)

Nails and Hammers

Certainly, the main reason for writing a framework for task and workflow orchestration is an opportunity to confuse us. With such clear goal in mind it’s no surprise that lots of companies have developed their own tool with a cool website, GitHub source code, documentation, communities etc. and most importantly a logotype that will be the decision driving factor if everything else would appear all the same.

Framework relations to Kubernetes
Kubeflow pipelines Quickstart
  • MLflow Projects: Package data science code in a format to reproduce runs
  • MLflow Models: Deploy machine learning models
  • Model Registry: Store, annotate and discover experiment artefacts
  • Does everything need to be absolutely perfect?
  • What’d I do if I face some ugly reality outside of my Zen framework?
  • Is my code as perfect as my framework?
  • Rollouts: Additional Kubernetes deployment strategies such as Blue-Green and Canary
  • Events: Event based dependency manager for Kubernetes

Conclusion

Thanks

I would like to thank the following people for their support and contribution to this article:

  • Ales Novak for his remarks and overall support.
  • Gert Ceulemans and Olivier Elshocht that they believed in the idea of this article and had enough patience to see it finished.

References

[1] Akio Morita, Wikipedia
[2]
Picking A Kubernetes Orchestrator: Airflow, Argo, and Prefect
[3]
Airflow vs. Luigi vs. Argo vs. MLFlow vs. KubeFlow
[4]
How To Productize ML Faster With MLOps Automation
[5]
Hidden Technical Debt in Machine Learning Systems
[6]
Blackout JA — The Only Good System Is A Sound System Live & Direct at YouTube
[7]
Kubeflow, Wikipedia
[8]
Introducing MLflow: an Open Source Machine Learning Platform
[9]
Siddhartha by Hermann Hesse
[10]
Apache Airflow, Wikipedia
[11]
Argo, Wikipedia
[12]
John C. Maxwell, LinkedIn
[13]
Dataflow, Wikipedia

--

--

I help people to set and achieve their goals through leadership and technical expertise.

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store