I've never used Prefect, but they wrote a detailed piece called "Why Not Airflow?" that hits on many of the relevant issues: https://medium.com/the-prefect-blog/why-not-airflow-4cfa423299c4

In my own experience with Airflow I identified three major issues (some of which are covered at the above link):

  1. Scheduling is based on fixed points (docs here https://airflow.apache.org/docs/stable/scheduler.html look how confusing that is!) When we think about schedules we naturally think of "when is this thing supposed to run?" It might be at a specific time, or it might be an interval description like "every hour" or "every day at 02:30", but it is almost certainly not "...the job instance is started once the period it covers has ended" or "The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period", as the Airflow docs describe it. Our natural conception of scheduling is future-oriented, whereas Airflow's is past-oriented. One way this manifests is that if I have a "daily" job and it first runs at say, 2020-04-01T11:57:23-06:00 (roughly now), its next run will be at 2020-04-02T11:57:23-06:00. That is effectively never what I want. I want to be able to set up a job to run e.g. daily at 11:00, and then since it's a little after 11:00 right now, kick off a manual run now without impacting that future schedule. Airflow can't do this. They try to paper over their weird notion of scheduling by supporting "@daily", "@hourly", and cron expressions, but these are all translated to their bizarre internal interval concept.

(Counterpoint: their schedule model does give rise to built-in backfill support, which is cool)

  1. Schedules are optimized for machines, not humans (see https://airflow.apache.org/docs/stable/timezone.html) [Upfront weird bias note: I am cursed to trip over every timezone bug present in any system I use. As a result I have become very picky and opinionated about timezone handling.]

We run jobs on a schedule because of human concerns, not machine concerns. Any system that forces humans to bear the load of thinking about the gnarly details of time rather than making the machine do it, is not well designed. Originally, Airflow would only run in UTC. By now they've added support for running in other timezones but they still do not support DST, which basically means they don't actually support timezones. Now, standardizing on UTC certainly makes sense for some use cases at some firms, but for any firm headquartered in the US which mainly does business in the US, DST is a reality that affects humans and that means we have to deal with it. If we deny that, we're going to have problems. For example if I run a job at 05:00 UTC-7 a.k.a Mountain Standard Time, chosen such that it will complete and make data available by 08:00 UTC-7 when employees start arriving to work, I am setting myself up for problems every March when my employees change their clocks and start showing up at 08:00 UTC-6 (which is 07:00 UTC-7!) because they are now on Mountain Daylight Time. If I insist on scheduling in UTC or a fixed UTC offset, I am probably going to have to move half my schedules twice a year. That's crazy! Computers can do this for us!

  1. DAGs cannot be dynamic At the time I was seriously evaluating Airflow at Craftsy, this is what killed it.

A powerful technique in software design is to make our code data-driven. We don't often use that term, but it's a common technique, in fact so common we don't much notice it anymore. The simple way to think of this is I should be able to make my software do things by giving it new input rather than writing new code.

Consider a page like this one (from a former employer): https://shop.mybluprint.com/knitting/supplies/cloudborn-superwash-merino-worsted-twist-splash-yarn/60774 No doubt you've been to thousands of such pages in your life as an internet user. And as an engineer, you know how they work. See that 60774 at the end? That's an ID, and we can infer that a request router will match against this URL, pull off that ID, and look it up in a database. The results of that lookup will be fed into a template, and the result of that template rendering will be the page that we see. In this way, one request handler and one template can render any product in the system, and the consequence of that is that adding new products requires only that we add data.

Airflow doesn't work this way!

In Airflow's marketing material (for lack of a better term), they say that you build up your DAG with code, and that this is better than specifying static configuration. What they don't tell you is that your DAG-constructing code is expected to evaluate to the same result every time. In order to change the shape of your DAG you must release new code. Sometimes this arguably makes sense. If my DAG at v1 is A -> B, and I change it in v2 to be A -> B -> C, perhaps it makes sense for that to be a new thing, or a new version of a thing. But what if my DAG is A -> B -> C, and I want to parallelize B, perhaps over an unpredictable number of input file chunks, as in A -> {B0, B1, ..., Bn} -> C where N is unknown until runtime? Airflow doesn't allow this, because again our DAG construction code must evaluate to the same shape every run. This means that if we want data to drive our code, that data must be stored inline with the code and we must re-deploy our code whenever that data changes.

This is not good. I have built multiple flows using Luigi that expand at runtime to thousands of dynamically-constructed task nodes, and whose behavior could be adjusted between runs by adding/changing rows in a table. These flows cannot be expressed in Airflow. You will find posts suggesting the contrary (e.g. https://towardsdatascience.com/creating-a-dynamic-dag-using-apache-airflow-a7a6f3c434f3) but note what is going on here: configuration is being fed to the DAG code but that configuration is stored with the code and changing it requires a code push. If you can't feed it input without a code push, it's not dynamic.

Airflow and the team at Airbnb that built it deserve a lot of credit for popularizing the concept of DAG-oriented structuring of data jobs in a way that Luigi (which predates it by years) failed to do. The slick UI, built-in scheduler, and built-in job executor are likewise praiseworthy. Ultimately though I've found that tightly coupling your flow structure to your scheduling system is a mis-feature. The fact that Luigi jobs must be initiated by an outside force is actually a powerful simplification: it means that a Luigi program is just a program which can be run from anywhere and does not (necessarily) require complex execution infrastructure. (Prefect can be used in this way as well, or with its own supplied scheduler.)

I also concede that there is value in wholesale adoption of Airflow (or something like it) as the central unifying structure of one's data wrangling universe. Regardless of the specific tech, having a single central scheduler is a great idea, because it makes the answers to "where is X scheduled?" or "is there anything that runs at time Y?" trivial to find. What's worrisome about Airflow specifically in that role is all the things it prevents you from doing, or allows only through dirty hacks like writing DAGs that use Luigi internally, or using code-generation to push dynamism to "build time".

Lastly, I have to concede that Airflow's sheer popularity is a vote in its favor. There's a lot of enthusiasm and momentum behind it, which bodes well for future feature additions and so on. There are already even managed Airflow-as-a-service products like Astronomer. I think it's still early, though. I've had a serious interest in dependency-structured data workflows since at least 2007, and until I encountered Luigi in 2014 I was aware of zero products that addressed this need, other than giant commercial monsters like Informatica. There's still a great deal of room for innovation and new players in this space.

There, that's my Airflow rant :)