Orchestrating ML workflows with Airflow
Remember when you, as a data scientist, worked on an exciting idea or project, conducted exhaustive data exploration and cleaning, tested super interesting modeling techniques, and then hoped to somehow make it to production? Most of us know how this story continues. Despite the fact that companies are very excited to introduce Machine Learning capabilities, there are still many business adoption pain points and technical challenges along the way.
Getting ML into production is hard. Who can help you ship it? Who would own, monitor and maintain the models? Would you just hand over your code to another team and your job would be done? Or the line ends at Jupyter Notebook? Nowadays, working on an ML problem is important, but it’s just one piece of the puzzle in a successful machine-learning project journey.
The rise of MLOps
At Productboard, we always considered Machine Learning as a complex end-to-end mission. In other words, data scientists should be able to translate business problems to data problems, research different types of solutions as well as provide decent code and make sure a solution is behaving as designed.
To this end, we started exploring MLOps and experimenting with ML as engineers. This includes packaging production-ready code, setting up CICD pipelines, deployment and monitoring production services. No worries, product implementation is still a task for our software engineers 🙂
First boss fight
The beginning of our MLOps journey began after we finished researching our pilot ML project. We ended up with standard ML workflows that pull, clean and retrain models for thousands of customers every day. So we started looking for a place under
the sun Kubernetes where our ML pipelines can happily prosper.
Our requirements for such a framework were:
- Production-ready: Battle tested by mature companies which also serves as a good indication of limitations
- Extensible & flexible: Configuration as code, usable for any future ML project or even non-ML initiatives (Data team, wink wink)
- Easy to set up: Deploying the framework in Kubernetes should be out-of-the-box and possible to run locally as well
- Active development: Team behind the framework should have a decent features roadmap and be active in fixing issues
- Community & documentation: Ability to search for answers and advice when struggling
- Future-ready: Vendor and programming language agnostic and easily scalable
After undertaking quick ML landscape research, we ended up with the following shortlist (as of the beginning of 2021):
- Airflow: Open-source workflow orchestrator originally developed at Airbnb. We were excited about KubernetesPodOperator as well as decent documentation, solid hype, and community. However, we were not sure whether Airflow would handle our ML needs.
- Kubeflow: Open-source ML toolkit for Kubernetes. We liked the ML-oriented tools such as Jupyter Notebooks, as well as its model serving, decent documentation and community. However, we struggled with the syntax and Pipelines developer experience.
- In addition to these big players, we found a few niche tools such as Prefect, Argo and MLflow. We also considered SageMaker, but none of the tools was full-stack or robust enough.
After hands-on testing of Airflow and Kubeflow, we had no major objections to choosing either of these tools. Kubeflow offers more than just tasks orchestration, but we felt Airflow had a steeper learning curve with a focus on general workflows rather than specializing on ML.
Our Airflow journey
After more than one year of using Airflow, we are enjoying the ride despite a few hiccups that were mostly due to our inexperience. We are using a user community helm chart for easy out-of-the-box Kubernetes deployment. Recently, the official helm chart and Airflow 2 version were finally released.
Our setup is rather basic. We have multiple ML code repositories and push DAGs to one central repository using CICD pipelines. Our production code is packaged in Dockerfiles, built and pushed to the central image repository. Airflow is syncing the DAGs from the central repository and images with the ML code from ECR and runs the tasks based on the DAGs definition. This gives us endless opportunities to run any code that could be packaged in Docker.
In DAGs, we use
KubernetesPodOperator along with
TaskGroup to dynamically create as many tasks as needed to segregate customer data in their own Kubernetes Pod.
Airflow is a Kubernetes-friendly project, so we can use constraints such as affinities, tolerations, annotations, concurrency, and resource management for each task to make sure we have our workloads in control. One example could be defining the node where the tasks should run:
Currently, we run around 5,000 tasks per day, using nodes with
c5.2xlarge instances. Our minimum node size is zero, while the maximum is 20. If there are currently no tasks running, we don’t waste any money. If there is a high-peak of queued tasks, autoscaling is triggered, and Airflow tasks take as many resources as needed. We also use a research nodes group for tasks that are unpredictable so they don’t draw resources from production workflows.
Challenges, lessons & future opportunities
- Data handover: We somehow thought that if you clean a dataset during one task, then you should be able to pass it to training. However, there is no out-of-the-box data backend except XCom, which is suitable for passing simple values but not huge datasets. At first, we utilized attaching EFS, which works great, but after some time, we abandoned this multiple-task approach. All logical steps of ML code are now done in one container. This way, we don’t need to deal with deleting leftover data, multiple containers, and the complexity of the DAG is also lower.
- Internal metastore: Airflow is using a database for storing configurations and data related to tasks and DAGs. Due to the thousands of tasks it undertakes every day, the database size grew significantly over several months, and we started to experience timeouts and restarts of metrics endpoint. Due to this, we set up an internal DAG that is cleaning the internal database and removing old logs and records. We used teamclairvoyant/airflow-maintenance-dags as an inspiration.
- Central DAG repository: We found this solution to be the least painful. However, it was still far from perfect. With a growing team or multiple teams, you will have a single point of failure where an unintended commit or mistake could erase others tasks. Apple, with its multi-tenant Airflow , also eventually had to figure this out.
- DAGs: With more complicated workflows, DAG files became messy and not very transparent. Files are full of configurations, and it’s easy to overlook something or make a mistake. We are considering introducing external configs or automatizing the process of creating DAGs with inspiration from ajbosco/dag-factory.
- Monitoring: Airflow comes with an internal UI. However it is almost impossible to navigate across thousands of datapoints to have a consistent and aggregated view. You can use epoch8/airflow-exporter for exporting aggregated metrics to Prometheus, for example, and then visualize them in Grafana. An alternative approach could be querying an internal metastore database to get an aggregated data of what is happening in Airflow.
Even though one might question why a data scientist would deploy and maintain such a system, we found ourselves enjoying this journey. It gives us a lot of independence and ownership over our ML workflows and brings us closer to production. We also feel very positive about how other companies are using Airflow, which gives us more confidence to tackle more complex ML systems in the future.
Interested in joining our growing team in a different role? Well, we’re hiring across the board! Check out our careers page for the latest vacancies.