Type: New Feature
Affects Version/s: None
Fix Version/s: None
Currently airflow runs on a date basis. All the scheduling and running logic runs on thinking that ETLs depend on the date they are run. However, there are another set of usecases where it's not the date what varies, but the dataset itself.
One example application is when treating genomic data. This data doesn't change, but the usecase is to run all DAGs you may have on samples, rather than dates. This can also be applied to when one has services that rely on making a set of operation on a dataset once.
For now, one way to solve this is by creating a DAG per user, scheduling it with None, and triggering it manually from the UI/cli, however it has the drawback that there is only one column in the dates, as new datasets will just create new DAGs.
Of course, backfill processes would be applied to run an specific DAG on all the samples, rather than just an specific one.
The features of such system would be as follows:
* Dates are irrelevant, different dates will have the same output in the same dataset, so only one run per dataset is required
* Date based scheduling is irrelevant, and addition of new datasets is the only thing that would trigger new DAGRuns
There are a few questions I would like to ask:
* How accoplated is the current design of the scheduler/executors in airflow to dates?
* Is this a contribution someone would be interested in (besides me)?
* Is there any work in progress on a similar feature?