MLOps: Task and Workflow Orchestration Tools on Kubernetes
Kubeflow | MLflow | Metaflow | Flyte | ZenML | Airflow | Argo | Tekton | Prefect | Luigi
“The public does not know what is possible. We do.”
— Akio Morita, co-founder of Sony 
I was given an excellent opportunity while working on a customer project to share some of my findings related to Task and Workflow Orchestration tooling analysis.
After completing a project and moving to a next one you realize that the landscape of available tools has changed, sometimes drastically and the basis for reasoning on what to use in your project with it.
There had been excellent posts already [2, 3, 4] about the topic. This article tries not to repeat the already made comparisons, but add a data touch to it, where you could have proving links and code to everything that has been written. I hope this guide will be helpful to everyone who wants to make the right choice for a task and workflow orchestration framework that runs on Kubernetes.
All articles that do tools comparisons and analysis provide a table and this one will be no exception. Here it is and don’t say you didn’t find it.
Use Flyte or Prefect. Still too much? OK, use Prefect. No thanks. Find the why below.
It is important to understand what your system should do and which part of the job you are ready to give to ready-made framework. Though we refer to MLOps when it comes to “Hidden Technical Debt in Machine Learning Systems”  it is clear that there is technical debt in any system overall whether it’s related to train models or not. Systems are built to solve a problem and if a system approaches production, we all know that “the only good system is a sound system” . Aside from the theoretical concept that your problem can be divided into pieces and delegated to some subsystem or tooling, there is always this “glue” that is specific to a system and the problem it is trying to solve.
Sometimes it’s complicated to understand what the implications might be on a choice of framework. You might read through inspiring (and one must say often excellent) articles on the topic [2, 3, 4] and decide that you have enough information to make the right choice, but realize later that it wasn’t such a great idea after you start digging into the code.
In this article we will go through examples of task and workflow orchestration with the help of different frameworks and try to understand what those frameworks are and how their API’s look like.
What is a task or a workflow and why they need orchestration?
A workflow is a series of tasks that we want to perform on some request. These tasks might not only define dependencies on each other but can have some actual data or artefacts passed in between.
Important considerations for the selection are:
- Whether you already have or want to use Kubernetes in the future
- Whether you want to reuse the framework for something else
It is important to understand that the choice you make has a significant impact on your dependencies. The more you have, the more different they are ― the more complex your solution and consequently more time you need to spend learning and managing it, finding yourself in a situation where you already forgot what you wanted to achieve in the first place.
Workflow orchestration tool should:
- Be free / open source
- Define an API for describing workflows i.e. the tasks and their dependencies
- Have a centralized scheduler that would manage the workflow execution (start, cancel, wait for resources etc.)
- Have an operational dashboard with runs, metrics, artefacts etc.
- Be well integrated to Kubernetes
- Be potentially useful for other use cases, such as CI/CD
A workflow orchestration tool should not:
- Introduce too many new dependencies or concepts that go off the main topic: task and workflow orchestration
- Be tightly coupled to a cloud provider, possibly introducing vendor specific services (sometimes described as Vendor-Lock-In)
A workflow orchestration tool would preferably:
- Easy installation, preferably via a helm chart
Nails and Hammers
Certainly, the main reason for writing a framework for task and workflow orchestration is an opportunity to confuse us. With such clear goal in mind it’s no surprise that lots of companies have developed their own tool with a cool website, GitHub source code, documentation, communities etc. and most importantly a logotype that will be the decision driving factor if everything else would appear all the same.
To clear up the fog a bit, let’s look at this scheme:
A small remark about TFX (Tensorflow Extended) — it’s a set of components (or tasks) related specifically to ML with Tensorflow and have nothing to do with their orchestration.
A small remark about Apache Beam + Apache Flink —often used in combination these are still not task and workflow orchestration frameworks, but are related to what’s called the Dataflow concept . It is describing not a series of dependent steps, but how a continuous data stream should be processed.
Tools that did not make it to our list for various reasons, but may be worth checking out:
- Mlrun by iguazio: GitHub | Slack
- Kedro by QuantumBlack: GitHub | Discourse
- Couler by Antgroup: GitHub | Twitter
- Dagster: GitHub | Twitter | Slack
- Genie by Netflix: GitHub (Netflix’s Metaflow is on our list)
If you think you have not enough tools to choose from, you can take a look at:
Let’s get into exploring.
Kubeflow has a lot of components that are unrelated to our main goal:
- Notebook Servers: Using Jupyter notebooks in Kubeflow
- KFServing: Model deployment and serving toolkit
- Katib: Hyperparameter tuning and neural architecture search
- Training Operators: Training of ML models in Kubeflow through operators
- Multi-Tenancy: Multi-user isolation and identity access management (IAM)
A full example can be found here.
The central UI is a rich tool with interactive control over the pipelines and provides full control over the runs, including visualizations.
Kubeflow pipelines are a good solid way to define a workflow and its steps in python and that is certainly an advantage. But under the hood things get translated to YAML files of our other contenders:
These frameworks define a YAML based way to interact with the underlying scheduler and we’ll review those separately.
MLflow was created by Databricks and released in 2018 . It is an open source platform to manage ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.
As with Kubeflow there are unrelated concepts to our task:
- MLflow Tracking: Record and query experiments: code, data, config, results
- MLflow Projects: Package data science code in a format to reproduce runs
- MLflow Models: Deploy machine learning models
- Model Registry: Store, annotate and discover experiment artefacts
When you run an MLflow Project on Kubernetes, MLflow constructs a new Docker image containing the Project’s contents; this image inherits from the Project’s Docker environment. MLflow then pushes the new Project image to your specified Docker registry and starts a Kubernetes Job on your specified Kubernetes cluster. This Kubernetes Job downloads the Project image and starts a corresponding Docker container. Finally, the container invokes your Project’s entry point, logging parameters, tags, metrics, and artifacts to your MLflow tracking server.
A step in the workflow:
A full example can be found here.
The UI tracks all the experiment runs and provides access to resulting artefacts and stored models.
You could potentially do your workflows in MLflow, but that seems to me as doing most of the things (like scheduling for example) from scratch by yourself. It doesn’t mean it’s a bad framework, it just means it’s not developed for workflow and task orchestration.
It is tightly (and I mean very-very) coupled to AWS. It is stated that technically it can run anywhere, but one can tell from the relations to AWS services that this is not going to happen:
- Datastore: Amazon S3
- Compute: AWS Batch
- Metadata: AWS Fargate + Amazon RDS
- Notebooks: Amazon Sagemaker Notebooks
- Scheduling: AWS Step Functions + Amazon EventBridge
- Large-scale ML: Sagemaker Models
There is a good article on howto Metaflow and Kubernetes, but just by the amount of text you can tell that there might be some complications involved.
The steps of a workflow can be encapsulated in a python method with a decorator, but the dependencies are managed in a rather unconventional way.
A workflow with for each and scheduling.
There was a request for a dashboard, but it didn’t kick-in truly. You can do everything in your cosy notebook, can you? I mean it’s not a problem for you or other of your colleagues?.. Well, never mind.
This framework is a very vendor specific way how to manage workflows and there are no obvious benefits to favour it above other options.
Flyte was created by Lyft and released to public in 2019. The documentation on most parts is, I cite: “Coming soon 🛠”. The better news is that most of the information is available on GitHub and lots of code speaks for itself [don’t forget to put a joke here].
From the description provided it’s clear that this framework is exactly what we need:
Flyte is more than a workflow engine — it provides
workflowas a core concept and a single unit of execution called
taskas a top level concept. Multiple tasks arranged in a data producer-consumer order create a workflow.
It’s Kubernetes-native, can do python and is written in Go. I’d be really surprised if that wouldn’t do the thing. Let’s see what’s in there.
“All you need is… a dashboard”. And a really good one.
Given that there are gaps in the official website documentation one might think that this framework is not mature enough, but don’t be mistaken. That’s one of the most powerful contenders in the list, with high attention to cleanness of code and ease of usage, specifically targeted at the use case in analysis, with excellent tooling, a plugin system for extensions, no strange unrelated concepts and the list goes on and on.
These comments might be summarized in the following: “It’s awesome”.
(You may want to fix the documentation on the website lads)
“Ichi Wa Zen, Zen Wa Ichi.”
ZenML was released in 2020 and there is even an explanation why if you didn’t get it straight away from the explanation above. Interestingly, the German company maiot GmbH that had developed it is supported by German and European authorities and I absolutely love the ironic team introduction.
Surprisingly, there were no information on a dedicated dashboard for ZenML.
It’s a very tidy and beautiful framework in all aspects. It meets all of the requirements and certainly is on top of the hit parade. If you’re up for using this framework you probably need to answer the following questions:
- Do I want beauty everywhere?
- Does everything need to be absolutely perfect?
- What’d I do if I face some ugly reality outside of my Zen framework?
- Is my code as perfect as my framework?
I might have another problem with it besides the absence of a dashboard—working on my grumpy problem with lots of sharpy edges and unclear bounds it’s easy to forget what the original goal was and end up in full Zen meditating in the rock garden realizing:
“Nothing was, nothing will be; everything is, everything has being and presence.”— Hermann Hesse, Siddhartha 
« Revenons à nos moutons. »
There are multiple ways how to use Airflow with Kubernetes:
- Using the KubernetesPodOperator
- Using the KubernetesExecutor (a bit like the previous method on steroids)
- Using KEDA (Kubernetes Event-driven Autoscaling) from Astronomer
The last method is the latest and the most lucrative as it provides a helm chart for the whole installation and we are going to look into it in the example below.
One of the oldest contenders in our list has gone a long way along with lots of people using it. Apache projects have a special governance model for their projects which they call “T̶h̶i̶s̶ ̶i̶s̶ ̶t̶h̶e̶ ̶w̶a̶y̶” ”The Apache Way”. It is certainly a powerful tool with lots of functionality and good integrations. But if compared to some other frameworks from our list it feels a bit RAW and with lots of unrelated details and complications.
Argo was the ship on which Jason and the Argonauts sailed from Iolcos to Colchis to retrieve the Golden Fleece … In 2021 however it stands for “Get stuff done with Kubernetes” . The project started in 2017, was developed by Applatix and is a very dynamic project with lots of releases.
Argo Workflows is implemented as a Kubernetes CRD (Custom Resource Definition) and you might have read already that it’s one of the possible options for the powering juices of Kubeflow pipelines.
It has also some parts unrelated to pure workflow orchestration, but on the other hand, they look very appealing to be used as well:
- Continuous Delivery: Declarative Continuous Delivery following Gitops
- Rollouts: Additional Kubernetes deployment strategies such as Blue-Green and Canary
- Events: Event based dependency manager for Kubernetes
Important to understand is that pipelines in Argo are YAML based and all tools on top add a Python touch to it, basically converting the code to YAML at some point.
It’s old, rich and beautiful. That is certainly a “✓” in our analysis.
Argo is the right tool to do workflow and task orchestration. It does have a scheduler, dashboard and potential good related functionality. But it may be just too little to be used on it’s own. Maybe programming in Python is slightly more convenient then YAML and that is why Kubeflow pipelines provide a layer on top.
Tekton is very similar to Argo and can be used interchangeably. Companies as RedHat and IBM seem to favour it above others and there is even a relation to Jenkins if you have missed a good old friend to do your workflows with.
Interestingly there is a Hub that provides reusable components for common workflows.
It’s simple and gets the job done. I have noticed that a lot of people use Argo and Tekton at the same time and still it’s a question why — both frameworks provide enough functionality of task and workflow orchestration in the direction of CI/CD. My take on this — the less dependencies you have the better.
Prefect © Copyright 2021 Prefect Technologies, Inc. is around since 2018 and has seemingly no stable release at time of writing. But that doesn’t mean the framework is any bad. On the contrary — it’s one of the best in our list and has arguably the best API among competition. There is a company behind it “Prefect Technologies, Inc.” and it provides hosted task and workflow orchestration services. Certainly, the greatest unfair advantage of this company is the collection of portrait overlays on their website.
A short summary: this tool is great. And even if you were wondering “Why not Airflow?” then your curiosity might be satisfied with this post. As with Tekton Hub there is a task library and no shortage of options in there.
Luigi has been around from 2012 and was developed by Spotify. It’s not a mistake that above no website or slack is mentioned. Basically, a GitHub project is all about it. Actually, that’s the only project that doesn’t try to make a big deal out of itself. And it comes with some philosophy:
Conceptually, Luigi is similar to GNU Make where you have certain tasks and these tasks in turn may have dependencies on other tasks.
“People can be in the same place sharing the same experience at the same time, but they can walk away from it having seen very different things.”―John C. Maxwell 
All of this makes sense if you employ Kubernetes in you system. Because if you don’t―you have a different stack and your problem might be much simpler.
There is a simple logic to make the right choices easier. You define a goal and stick with it. And if this is fixed then you can validate everything against this goal with red or yellow flags. This will filter your options significantly, making it easier to decide.
In this analysis we were exploring 10 frameworks for workflow and task orchestration and found that some of them were not meeting the expectations we had in mind.
For example managing tasks and workflows with a python API is slightly more convenient than YAML and having a dashboard makes collaboration within the team much easier.
The leaders are Flyte, Prefect and Luigi. We can even shorten this list even more by kicking out Luigi as maybe too simple when compared to others.
Deciding between Flyte and Prefect is hard, because in the context of task and workflow orchestration they are equally excellent tools (though Flyte might fix the docs on their website and the installation via helm chart).
I encourage everyone to try them out and decide for themselves.
Fail fast, fail cheap, fail smart.
If you need a simple and clear answer―use Prefect.
I would like to thank the following people for their support and contribution to this article:
- Dimitri Torfs for his careful review and valuable feedback.
- Ales Novak for his remarks and overall support.
- Gert Ceulemans and Olivier Elshocht that they believed in the idea of this article and had enough patience to see it finished.
- Ganna Shchygol for her review and rich findings
 Akio Morita, Wikipedia
 Picking A Kubernetes Orchestrator: Airflow, Argo, and Prefect
 Airflow vs. Luigi vs. Argo vs. MLFlow vs. KubeFlow
 How To Productize ML Faster With MLOps Automation
 Hidden Technical Debt in Machine Learning Systems
 Blackout JA — The Only Good System Is A Sound System Live & Direct at YouTube
 Kubeflow, Wikipedia
 Introducing MLflow: an Open Source Machine Learning Platform
 Siddhartha by Hermann Hesse
 Apache Airflow, Wikipedia
 Argo, Wikipedia
 John C. Maxwell, LinkedIn
 Dataflow, Wikipedia