Uploaded image for project: 'Apache Airflow'
  1. Apache Airflow
  2. AIRFLOW-72

Implement proper capacity scheduler

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Auto Closed
    • None
    • 2.0.0
    • scheduler

    Description

      The scheduler is supposed to maintain queues and pools according to a "capacity" model. However it is currently not properly implemented as therefore issues as being able to oversubscribe to pools exist, race conditions for queuing/dequeuing exist and probably others.

      This Jira Epic is to track all related issues to pooling/queuing and the (tbd) roadmap to a proper capacity scheduler.

      Why queuing / scheduling broken:

      Locking is not properly implemented and cannot be as a check for slot availability is spread throughout the scheduler, taskinstance and executor. This makes obtaining a slot non-atomic and results in over subscribing. In addition it leads to race conditions as having two tasks being picked from the queue at the same time as the scheduler determines that a queued task still needs to be send to the executor, while in an earlier run this already happened.

      In order to fix this Pool handling needs to be centralized (code wise) and work with a mutex (with_for_update()) on the database records. The scheduler/taskinstance can then do something like:

      slot = Pool.obtain_slot(pool_id)
      Pool.release_slot(slot)

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              bolke Bolke de Bruin
              Votes:
              6 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: