Why Airflow is our go-to tool for modern workflow management

And should be yours too

Airflow Machine Learning Data Pipelines Astronomer Big Data Data Engineering

Posted by Juan Martin Pampliega

on April 20, 2022 · 9 mins read

Why We Use Airflow

A Bit Of Context...

As data becomes one of, if not the most relevant input for companies to generate real business value, Data Engineers have had to rise to the occasion. Their challenge: building increasingly complex data pipelines.

These data pipelines stand at the core of any data product:

Ingestion: your data has to come from somewhere, right? From relational databases to APIs and web-scraping. Data Engineers have to ingest from all of these completely different data sources.
Transformation: if you need to transform your data now is the time. Some pipelines only clean up ingested data others transform it into insightful business metrics. Maybe you only want to store it raw and skip this step entirely!
Storage: the data from the pipeline has to end up somewhere. Maybe a raw S3 file in your Data Lake after running an ELT process or a new table. Now data is ready to be consumed by your users, a Bussiness Intelligence tool or Machine Learning system.

These pipelines might process batch data (discrete and usually large amounts of data over a given period, like a day, a week, or a month) or streaming data (continuous amounts of data with no clear discretization or separation ).

Data pipelines are complex enough, but the devil is in the details. The previously mentioned steps must be executed in a specifc order and have dependencies between each other: how to make sure that you have ingested from all your sources before transforming and storing them? What happens if our pipeline fails to ingest one of those sources?

For example, let's say we want to count the number of new users we've got for a given date. We'd build a data pipeline that ingests customer data for the given date from two different databases, transforms this input by counting the number of users, and then stores it in a separate table. It's easy to see that if we didn't make sure these tasks were executed sequentially then our output wouldn't be correct! The same goes for failures: if we fail to ingest from any of its sources then we wouldn't really be counting all of our users.

And this is a simple scenario: what if our pipeline has even more tasks and dependencies?

Thankfully, there's a way to avoid all of these headaches: Data Engineers use workflow management tools to help them deal with all of those nasty scenarios.

Not so long ago, in a galaxy not so far away, data pipelines were executed using something like cron jobs (a [simple] job scheduler on Unix-like operating systems) However, data pipelines are becoming more complex with more dependencies and steps. What was once a simple query that selected data from two tables, joined them, and inserted the resulting data into a third table is now a complex workflow of steps with several data pipelines across different data sources and systems.

Since CRON is just a job scheduler that triggers a script or command at a specific time it proved insufficient to support this scenario. You can detect that cron and other job scheduling systems are not enough to manage modern data workflows when you need to develop custom code to handle:

Automatic retries with configurable wait times.
Backfilling old data (Eg. What if I created the workflow on the 10th of the month and I want the tool to run it for all dates since the beginning of the month and keep doing daily executions from now on).
Determining if a job was successful or not.
Running multiple non-dependent jobs in parallel and determining if all of them were successful or not to find out if a workflow execution was successful.
Checks to define if all data (or other) dependencies are met to determine if a workflow execution can start.
A monitoring solution to see high-level metrics and determine the number of jobs in execution, amount ready to start executing, amount failed, the general status of workflow execution.
Re-running part of the workflow and all dependent tasks.
Being able to version control workflow definitions and having procedural logic in the workflow definition file that might vary the composition of the workflow according to business configurations defined for a certain environment.

The answer came in the form of automated orchestration tools. In 2014, looking to address challenges in their own workflow management, Airbnb developed Apache Airflow: a tool that would make executing, troubleshooting, monitoring and adapting workflows much easier.

Airflow specializes in executing and coordinating tasks that involve discretely manipulating batches of data. If you're looking for a tool that can handle streams of data, Apache Airflow is not the tool for you, but they have some recommendations about that!

Airflow, Apache Airflow

Let’s go back to the distant past of 2014 for a minute. Our boss asked us to run an ETL job each day at midnight, so as to feed that data into daily reports and visualizations.

You created a trusty cron job in a forgotten instance, and that solution kept your boss happy for a few days. But one morning your boss taps your shoulder (remember this is 2014, so we are all working in open space offices instead of our homes!) and angrily tells you that the reports didn’t get yesterday's data, right when your boss was supposed to show them to their supervisor.

You check the cron job in your instance but you find nothing, leaving you to scratch your head in confusion. Later in the day you find out that one of the APIs you use to extract data had a small amount of downtime at night, causing your whole pipeline to fail.

That’s just one of the hidden complexities of creating an orchestration platform: How do I know how and why it failed? How do I define complex dependencies between my tasks? What if I want to retry a few minutes after failure? Is it possible to run multiple of these tasks in parallel? Does it usually fail at the same time/day? How long does it take to run?

It’s hard to ask those features out of a simple cron job!

But Airflow does all of this and more. All of these features come out of the box with Airflow:

The order of your tasks is configured in a simple Python file, called a DAG, letting you define actions on both failure and success.
Retries are also part of this DAG file, allowing you to decide what to do if your process fails!
Easily report those failures and successes to your platform of choice! It’s easy to send reports through email or Slack messages.
And best of all: you have the power of Python in your hands, to dynamically create and modify any of these features.

You’ll only need to understand the following components, making Airflow extremely simple to pick up:

DAG (Directed Acyclic Graph): Group of tasks with dependencies (workflow)
Operator: Type of task to be executed. Eg. Bash command, Python script, SQL statement, etc.
Task: Operator instance that defines one node in the DAG.
DAG Run: Instance in time when a DAG is executed. Airflow maintains control and dependencies between tasks.
Task Instance: Specific execution of a task in the context of a DAGRun.

Why We Love Airflow At Mutt Data

Easy-to-use Out Of The Box

The entry barrier for using Airflow is low as its main language is Python: the most widely used programming language for data teams.

Working with many clients translates to working with many different requirements and development environments: some use it on the Cloud, others run it on their own private servers. We find being able to easily set up and run Airflow is key to allowing us to deliver value and results efficiently.

This benefit isn´t limited to programmers. Anyone with access to Airflow´s UI can re-run jobs, check logs and troubleshoot for issues to discover if anything has gone wrong. Its friendly UI and modern web application grants access to a variety of metrics including: performance, execution times, processes status and execution tries. This makes monitoring workflows accessible. Many functions including starting, pausing and ending a workflow can be carried out directly from the application.

Team Player

Airflow facilitates the execution of tasks in commonly used providers such as Google Cloud, Amazon AWS, and Microsoft Azure. Additionally, instead of manually programming connections you can use the previously defined “Airflow Operators” allowing integration with other systems like databases, mailing alerts, and frameworks.

Scalability & Customization

Airflow allows users to scale horizontally as more tasks or processes are added to the workflow. This way you'll never be caught by surprise by a sudden compute intensive workload.

But that's not the only reason why Airflow is scalable. Since Airflow defines all of its workflows as code, DAGs can be dynamically generated. Consuming an outside resource like a database or configuration file will do the trick. Adding a new DAG for your latest product now requires no coding involved, allowing anyone to schedule powerful ETL jobs.

Being able to define workflows as Python code means Airflow enables building higher-level abstractions in a framework-like manner. You can create functions that generate a typical combination of tasks that is frequently repeated to avoid code repetition and simplify changes. Also, you could have a builder object that creates DAGs (Airflow workflows) dynamically from configuration files. This could be achieved by any procedural language but Python being a dynamic language and the defacto standard for data apps is the ideal option. These characteristics and Airflow's popularity makes it an ideal candidate to use as a baseline to construct a framework to create a DataOps style platform.

You can read more about our experience building DataOps platforms here.

Based on Python and an Open Source community, Airflow is easily extensible and customizable with a robust ecosystem of integrations and extensions.

Astronomer

If you´ve heard about Airflow before this post, you may have also heard about Astronomer. If not, you can read more about them on our latest press release where we announced our technological partnership.

Astronomer delivers orchestration for the modern data platform, and acts as a steward of the Apache Airflow project. Presently at the center of the Airflow community and ecosystem, Astronomer, is the driving force behind releases and the Airflow Roadmap.

Astronomer also offers self-hosted and SaaS managed Airflow with commercial support, building out a full data orchestration platform.

Why We Recommend Astronomer

When compared to simply using Airflow, Astronomer stands out by allowing seamless deployment, infinite scalability, greater system visibility, reduced day 2 operation times and, beyond break/fix support from experts.

As part of their goal of growing the Airflow community, Astronomer offers a myriad of useful resources including: talks, podcasts, community forums, slack channels, videos, documents and customer support. They also offer Apache Airflow certifications with training and exams that cover variables, pools, DAG, versioning, templatings, taskflow apis, sub dags, SLAs and many more concepts in order to stand out in the community as an Airflow Expert.

Getting Started Using Airflow

By now, you have probably gathered we like using Airflow and believe the platform delivers many benefits to its users. If you´re interested in learning more, we have taken the liberty of recommending some useful links. Hope you enjoy them!

After getting your feet wet, we recommend taking one of Astronomer’s certifications!

Wrapping Up

We have successfully built data and machine learning pipelines for adtech platforms, ad real time bidding systems, marketing campaign optimization, Computer Vision, NLP and many more projects using Airflow. Not only do we recommend the tool but we continue to build on our expertise and incentivize our team members to get Astronomer certified.

Need help figuring out the best way to boost your business with data? We are excited to help, you can find out more about us here.

We hope you’ve found this post useful, and at least mildly entertaining. If you like what you’ve read so far, got some mad dev skills, a Data Engineering itch to build modern data pipelines, and like applying machine learning to solve tough business challenges, send us your resume here or visit our lever account for [current team openings!]({{ site.jobs_page }})