MLOps: Task and Workflow Orchestration Tools on Kubernetes

16 min readMay 28, 2021

“The public does not know what is possible. We do.”
— Akio Morita, co-founder of Sony [1]

I was given an excellent opportunity while working on a customer project to share some of my findings related to Task and Workflow Orchestration tooling analysis.

After completing a project and moving to a next one you realize that the landscape of available tools has changed, sometimes drastically and the basis for reasoning on what to use in your project with it.

There had been excellent posts already [2, 3, 4] about the topic. This article tries not to repeat the already made comparisons, but add a data touch to it, where you could have proving links and code to everything that has been written. I hope this guide will be helpful to everyone who wants to make the right choice for a task and workflow orchestration framework that runs on Kubernetes.

TL;DR

All articles that do tools comparisons and analysis provide a table and this one will be no exception. Here it is and don’t say you didn’t find it.

Use Flyte or Prefect. Still too much? OK, use Prefect. No thanks. Find the why below.

The Problem

It is important to understand what your system should do and which part of the job you are ready to give to ready-made framework. Though we refer to MLOps when it comes to “Hidden Technical Debt in Machine Learning Systems” [5] it is clear that there is technical debt in any system overall whether it’s related to train models or not. Systems are built to solve a problem and if a system approaches production, we all know that “the only good system is a sound system” [6]. Aside from the theoretical concept that your problem can be divided into pieces and delegated to some subsystem or tooling, there is always this “glue” that is specific to a system and the problem it is trying to solve.

Sometimes it’s complicated to understand what the implications might be on a choice of framework. You might read through inspiring (and one must say often excellent) articles on the topic [2, 3, 4] and decide that you have enough information to make the right choice, but realize later that it wasn’t such a great idea after you start digging into the code.

In this article we will go through examples of task and workflow orchestration with the help of different frameworks and try to understand what those frameworks are and how their API’s look like.

What is a task or a workflow and why they need orchestration?

A workflow is a series of tasks that we want to perform on some request. These tasks might not only define dependencies on each other but can have some actual data or artefacts passed in between.

Important considerations for the selection are:

Whether you already have or want to use Kubernetes in the future
Whether you want to reuse the framework for something else

It is important to understand that the choice you make has a significant impact on your dependencies. The more you have, the more different they are ― the more complex your solution and consequently more time you need to spend learning and managing it, finding yourself in a situation where you already forgot what you wanted to achieve in the first place.

Workflow orchestration tool should:

Be free / open source
Define an API for describing workflows i.e. the tasks and their dependencies
Have a centralized scheduler that would manage the workflow execution (start, cancel, wait for resources etc.)
Have an operational dashboard with runs, metrics, artefacts etc.
Be well integrated to Kubernetes
Be potentially useful for other use cases, such as CI/CD

A workflow orchestration tool should not:

Introduce too many new dependencies or concepts that go off the main topic: task and workflow orchestration
Be tightly coupled to a cloud provider, possibly introducing vendor specific services (sometimes described as Vendor-Lock-In)

A workflow orchestration tool would preferably:

Easy installation, preferably via a helm chart

Nails and Hammers

Certainly, the main reason for writing a framework for task and workflow orchestration is an opportunity to confuse us. With such clear goal in mind it’s no surprise that lots of companies have developed their own tool with a cool website, GitHub source code, documentation, communities etc. and most importantly a logotype that will be the decision driving factor if everything else would appear all the same.

To clear up the fog a bit, let’s look at this scheme:

A small remark about TFX (Tensorflow Extended) — it’s a set of components (or tasks) related specifically to ML with Tensorflow and have nothing to do with their orchestration.

A small remark about Apache Beam + Apache Flink —often used in combination these are still not task and workflow orchestration frameworks, but are related to what’s called the Dataflow concept [13]. It is describing not a series of dependent steps, but how a continuous data stream should be processed.

Tools that did not make it to our list for various reasons, but may be worth checking out:

Mlrun by iguazio: GitHub | Slack
Kedro by QuantumBlack: GitHub | Discourse
Couler by Antgroup: GitHub | Twitter
Dagster: GitHub | Twitter | Slack
Genie by Netflix: GitHub (Netflix’s Metaflow is on our list)

If you think you have not enough tools to choose from, you can take a look at:

Let’s get into exploring.

Kubeflow | Twitter | GitHub | Slack

Kubeflow was developed by Google from 2018 and issued a stable release in 2020. It was based on Google’s internal method to deploy TensorFlow models called TensorFlow Extended (TFX) [7].

Kubeflow has a lot of components that are unrelated to our main goal:

Notebook Servers: Using Jupyter notebooks in Kubeflow
KFServing: Model deployment and serving toolkit
Katib: Hyperparameter tuning and neural architecture search
Training Operators: Training of ML models in Kubeflow through operators
Multi-Tenancy: Multi-user isolation and identity access management (IAM)

Luckily, Kubeflow Pipelines for workflows orchestration can be installed and used separately from all other components and include UI, SDK and API.

Installation

Kubeflow pipelines can be installed only with kustomize, a helm installation is not possible:

Example

A full example can be found here.

Dashboard

The central UI is a rich tool with interactive control over the pipelines and provides full control over the runs, including visualizations.

Comments

Kubeflow pipelines are a good solid way to define a workflow and its steps in python and that is certainly an advantage. But under the hood things get translated to YAML files of our other contenders:

These frameworks define a YAML based way to interact with the underlying scheduler and we’ll review those separately.

MLflow | Twitter | GitHub | Slack

MLflow was created by Databricks and released in 2018 [8]. It is an open source platform to manage ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.

As with Kubeflow there are unrelated concepts to our task:

MLflow Tracking: Record and query experiments: code, data, config, results
MLflow Projects: Package data science code in a format to reproduce runs
MLflow Models: Deploy machine learning models
Model Registry: Store, annotate and discover experiment artefacts

MLflow is exceptionally good at managing ML experiments, but being able to do Multistep Workflow’s on MLflow Project on Kubernetes (experimental) is rather an exception to MLflow itself.

When you run an MLflow Project on Kubernetes, MLflow constructs a new Docker image containing the Project’s contents; this image inherits from the Project’s Docker environment. MLflow then pushes the new Project image to your specified Docker registry and starts a Kubernetes Job on your specified Kubernetes cluster. This Kubernetes Job downloads the Project image and starts a corresponding Docker container. Finally, the container invokes your Project’s entry point, logging parameters, tags, metrics, and artifacts to your MLflow tracking server.

Installation

Example

A step in the workflow:

A workflow:

A run:

A full example can be found here.

Dashboard

The UI tracks all the experiment runs and provides access to resulting artefacts and stored models.

Comments

You could potentially do your workflows in MLflow, but that seems to me as doing most of the things (like scheduling for example) from scratch by yourself. It doesn’t mean it’s a bad framework, it just means it’s not developed for workflow and task orchestration.

Metaflow | Twitter | GitHub | Gitter

Metaflow was developed by Netflix and released to the public in 2019. If you find it hard to read a lot of text or you are a scientific-comics-type of person you can check out their why section.

It is tightly (and I mean very-very) coupled to AWS. It is stated that technically it can run anywhere, but one can tell from the relations to AWS services that this is not going to happen:

Datastore: Amazon S3
Compute: AWS Batch
Metadata: AWS Fargate + Amazon RDS
Notebooks: Amazon Sagemaker Notebooks
Scheduling: AWS Step Functions + Amazon EventBridge
Large-scale ML: Sagemaker Models

There is a good article on howto Metaflow and Kubernetes, but just by the amount of text you can tell that there might be some complications involved.

The steps of a workflow can be encapsulated in a python method with a decorator, but the dependencies are managed in a rather unconventional way.

Installation

Example

A workflow with for each and scheduling.

Dashboard

There was a request for a dashboard, but it didn’t kick-in truly. You can do everything in your cosy notebook, can you? I mean it’s not a problem for you or other of your colleagues?.. Well, never mind.

Comments

This framework is a very vendor specific way how to manage workflows and there are no obvious benefits to favour it above other options.

Flyte | Twitter | GitHub | Slack

Flyte was created by Lyft and released to public in 2019. The documentation on most parts is, I cite: “Coming soon 🛠”. The better news is that most of the information is available on GitHub and lots of code speaks for itself [don’t forget to put a joke here].

From the description provided it’s clear that this framework is exactly what we need:

Flyte is more than a workflow engine — it provides workflow as a core concept and a single unit of execution called task as a top level concept. Multiple tasks arranged in a data producer-consumer order create a workflow.

It’s Kubernetes-native, can do python and is written in Go. I’d be really surprised if that wouldn’t do the thing. Let’s see what’s in there.

Installation

At the moment of writing helm chart installation is in progress. As with Kubeflow you have some kustomization, but for a start the installation can be pretty simple:

Example

Dashboard

“All you need is… a dashboard”. And a really good one.

Comments

Given that there are gaps in the official website documentation one might think that this framework is not mature enough, but don’t be mistaken. That’s one of the most powerful contenders in the list, with high attention to cleanness of code and ease of usage, specifically targeted at the use case in analysis, with excellent tooling, a plugin system for extensions, no strange unrelated concepts and the list goes on and on.

These comments might be summarized in the following: “It’s awesome”.

(You may want to fix the documentation on the website lads)

ZenML | Twitter | GitHub | Slack

“Ichi Wa Zen, Zen Wa Ichi.”

ZenML was released in 2020 and there is even an explanation why if you didn’t get it straight away from the explanation above. Interestingly, the German company maiot GmbH that had developed it is supported by German and European authorities and I absolutely love the ironic team introduction.

It is build on top of TFX (Tensorflow Extended) and adds a designer touch to all aspects of managing individual Steps in Pipelines with Datasources running on different Backends.

Installation

Example

Dashboard

Surprisingly, there were no information on a dedicated dashboard for ZenML.

Comments

It’s a very tidy and beautiful framework in all aspects. It meets all of the requirements and certainly is on top of the hit parade. If you’re up for using this framework you probably need to answer the following questions:

Do I want beauty everywhere?
Does everything need to be absolutely perfect?
What’d I do if I face some ugly reality outside of my Zen framework?
Is my code as perfect as my framework?

I might have another problem with it besides the absence of a dashboard—working on my grumpy problem with lots of sharpy edges and unclear bounds it’s easy to forget what the original goal was and end up in full Zen meditating in the rock garden realizing:

“Nothing was, nothing will be; everything is, everything has being and presence.”— Hermann Hesse, Siddhartha [9]

« Revenons à nos moutons. »

Airflow | Twitter | GitHub | Slack

Apache Airflow stated in Airbnb in 2014, was an Apache incubating project from 2016 and became a top level Apache project in 2019. [10]

There are multiple ways how to use Airflow with Kubernetes:

Using the KubernetesPodOperator
Using the KubernetesExecutor (a bit like the previous method on steroids)
Using KEDA (Kubernetes Event-driven Autoscaling) from Astronomer

The last method is the latest and the most lucrative as it provides a helm chart for the whole installation and we are going to look into it in the example below.

Installation

Example

Dashboard

Comments

One of the oldest contenders in our list has gone a long way along with lots of people using it. Apache projects have a special governance model for their projects which they call “T̶h̶i̶s̶ ̶i̶s̶ ̶t̶h̶e̶ ̶w̶a̶y̶” ”The Apache Way”. It is certainly a powerful tool with lots of functionality and good integrations. But if compared to some other frameworks from our list it feels a bit RAW and with lots of unrelated details and complications.

Argo | Twitter | GitHub | Slack

Argo was the ship on which Jason and the Argonauts sailed from Iolcos to Colchis to retrieve the Golden Fleece [11]… In 2021 however it stands for “Get stuff done with Kubernetes” . The project started in 2017, was developed by Applatix and is a very dynamic project with lots of releases.

Argo Workflows is implemented as a Kubernetes CRD (Custom Resource Definition) and you might have read already that it’s one of the possible options for the powering juices of Kubeflow pipelines.

It has also some parts unrelated to pure workflow orchestration, but on the other hand, they look very appealing to be used as well:

Continuous Delivery: Declarative Continuous Delivery following Gitops
Rollouts: Additional Kubernetes deployment strategies such as Blue-Green and Canary
Events: Event based dependency manager for Kubernetes

Important to understand is that pipelines in Argo are YAML based and all tools on top add a Python touch to it, basically converting the code to YAML at some point.

Installation

Example

Dashboard

It’s old, rich and beautiful. That is certainly a “✓” in our analysis.

Comments

Argo is the right tool to do workflow and task orchestration. It does have a scheduler, dashboard and potential good related functionality. But it may be just too little to be used on it’s own. Maybe programming in Python is slightly more convenient then YAML and that is why Kubeflow pipelines provide a layer on top.

Tekton | Twitter | GitHub | Slack

Tekton is very similar to Argo and can be used interchangeably. Companies as RedHat and IBM seem to favour it above others and there is even a relation to Jenkins if you have missed a good old friend to do your workflows with.

Interestingly there is a Hub that provides reusable components for common workflows.

Installation

Example

Dashboard

Comments

It’s simple and gets the job done. I have noticed that a lot of people use Argo and Tekton at the same time and still it’s a question why — both frameworks provide enough functionality of task and workflow orchestration in the direction of CI/CD. My take on this — the less dependencies you have the better.

Prefect | Twitter | GitHub | Slack

Prefect © Copyright 2021 Prefect Technologies, Inc. is around since 2018 and has seemingly no stable release at time of writing. But that doesn’t mean the framework is any bad. On the contrary — it’s one of the best in our list and has arguably the best API among competition. There is a company behind it “Prefect Technologies, Inc.” and it provides hosted task and workflow orchestration services. Certainly, the greatest unfair advantage of this company is the collection of portrait overlays on their website.

Installation

Example

Dashboard

Comments

A short summary: this tool is great. And even if you were wondering “Why not Airflow?” then your curiosity might be satisfied with this post. As with Tekton Hub there is a task library and no shortage of options in there.

GitHub

Luigi has been around from 2012 and was developed by Spotify. It’s not a mistake that above no website or slack is mentioned. Basically, a GitHub project is all about it. Actually, that’s the only project that doesn’t try to make a big deal out of itself. And it comes with some philosophy:

Conceptually, Luigi is similar to GNU Make where you have certain tasks and these tasks in turn may have dependencies on other tasks.

Installation

Example

Dashboard

Comments

The simplest of them all. If your looking for simplicity and light dependencies and if Hadoop, Hive, Pig or Cascading mean anything to you this would be a great choice for your system.

Conclusion

“People can be in the same place sharing the same experience at the same time, but they can walk away from it having seen very different things.”―John C. Maxwell [12]

All of this makes sense if you employ Kubernetes in you system. Because if you don’t―you have a different stack and your problem might be much simpler.

There is a simple logic to make the right choices easier. You define a goal and stick with it. And if this is fixed then you can validate everything against this goal with red or yellow flags. This will filter your options significantly, making it easier to decide.

In this analysis we were exploring 10 frameworks for workflow and task orchestration and found that some of them were not meeting the expectations we had in mind.

For example managing tasks and workflows with a python API is slightly more convenient than YAML and having a dashboard makes collaboration within the team much easier.

The leaders are Flyte, Prefect and Luigi. We can even shorten this list even more by kicking out Luigi as maybe too simple when compared to others.

Deciding between Flyte and Prefect is hard, because in the context of task and workflow orchestration they are equally excellent tools (though Flyte might fix the docs on their website and the installation via helm chart).

I encourage everyone to try them out and decide for themselves.

Fail fast, fail cheap, fail smart.

If you need a simple and clear answer―use Prefect.

Thanks

I would like to thank the following people for their support and contribution to this article:

My colleagues:

Dimitri Torfs for his careful review and valuable feedback.
Ales Novak for his remarks and overall support.
Gert Ceulemans and Olivier Elshocht that they believed in the idea of this article and had enough patience to see it finished.

My friends:

Ganna Shchygol for her review and rich findings

References

[1] Akio Morita, Wikipedia
[2] Picking A Kubernetes Orchestrator: Airflow, Argo, and Prefect
[3] Airflow vs. Luigi vs. Argo vs. MLFlow vs. KubeFlow
[4] How To Productize ML Faster With MLOps Automation
[5] Hidden Technical Debt in Machine Learning Systems
[6] Blackout JA — The Only Good System Is A Sound System Live & Direct at YouTube
[7] Kubeflow, Wikipedia
[8] Introducing MLflow: an Open Source Machine Learning Platform
[9] Siddhartha by Hermann Hesse
[10] Apache Airflow, Wikipedia
[11] Argo, Wikipedia
[12] John C. Maxwell, LinkedIn
[13] Dataflow, Wikipedia

MLOps: Task and Workflow Orchestration Tools on Kubernetes

TL;DR

The Problem

Nails and Hammers

Conclusion

Thanks

References

Written by Anton Chernov