Uploaded image for project: 'Apache Airflow'
  1. Apache Airflow
  2. AIRFLOW-7063

dag.clear() slowness caused by multiple UNION statements and tis.count()

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.10.9
    • Fix Version/s: 1.10.10
    • Component/s: webserver
    • Labels:
      None

      Description

      When multiple ExternalTaskMarker are used, dag.clear() becomes very slow when clearing all the ExternalTaskMarker together. The slowness turns out to come from this line of code in dag.clear():

              if dry_run:
                  tis = tis.all()
                  session.expunge_all()
                  return tis
      
              count = tis.count()   <------- This line is the culprit
              do_it = True
              if count == 0:
                  return 0
      

      This is the sql generated by tis.count() when there are three ExternalTaskMarker being cleared together. Note there's nothing wrong with the sql and it's reasonably efficient when executed on postgres even when the number of UNION statements is bigger (e.g. 30 UNION statements takes about 13ms in the docker container I started with breeze)
      But it takes more than three minutes for sqlalchemy to construct this count query before it goes to the database.

      The fix is really simple, just get rid of the count() and query all the entries from the db instead. The function becomes ten times faster when tis.count() is removed.

       There are multiple places people are complaining about similar problems with sqlalchemy count() being slower than the query itself. It does not look like sqlalchemy is going to fix this issue:
      https://stackoverflow.com/questions/14754994/why-is-sqlalchemy-count-much-slower-than-the-raw-query
      https://gist.github.com/hest/8798884

       

      [2020-03-14 09:42:50,264] {base.py:1203} INFO - SELECT count(*) AS count_1
      FROM (SELECT anon_2.anon_3_anon_4_task_instance_try_number AS anon_2_anon_3_anon_4_task_instance_try_number, anon_2.anon_3_anon_4_task_instance_task_id AS anon_2_anon_3_anon_4_task_instance_task_id, anon_2.anon_3_anon_4_task_instance_dag_id AS anon_2_anon_3_anon_4_task_instance_dag_id, anon_2.anon_3_anon_4_task_instance_execution_date AS anon_2_anon_3_anon_4_task_instance_execution_date, anon_2.anon_3_anon_4_task_instance_start_date AS anon_2_anon_3_anon_4_task_instance_start_date, anon_2.anon_3_anon_4_task_instance_end_date AS anon_2_anon_3_anon_4_task_instance_end_date, anon_2.anon_3_anon_4_task_instance_duration AS anon_2_anon_3_anon_4_task_instance_duration, anon_2.anon_3_anon_4_task_instance_state AS anon_2_anon_3_anon_4_task_instance_state, anon_2.anon_3_anon_4_task_instance_max_tries AS anon_2_anon_3_anon_4_task_instance_max_tries, anon_2.anon_3_anon_4_task_instance_hostname AS anon_2_anon_3_anon_4_task_instance_hostname, anon_2.anon_3_anon_4_task_instance_unixname AS anon_2_anon_3_anon_4_task_instance_unixname, anon_2.anon_3_anon_4_task_instance_job_id AS anon_2_anon_3_anon_4_task_instance_job_id, anon_2.anon_3_anon_4_task_instance_pool AS anon_2_anon_3_anon_4_task_instance_pool, anon_2.anon_3_anon_4_task_instance_pool_slots AS anon_2_anon_3_anon_4_task_instance_pool_slots, anon_2.anon_3_anon_4_task_instance_queue AS anon_2_anon_3_anon_4_task_instance_queue, anon_2.anon_3_anon_4_task_instance_priority_weight AS anon_2_anon_3_anon_4_task_instance_priority_weight, anon_2.anon_3_anon_4_task_instance_operator AS anon_2_anon_3_anon_4_task_instance_operator, anon_2.anon_3_anon_4_task_instance_queued_dttm AS anon_2_anon_3_anon_4_task_instance_queued_dttm, anon_2.anon_3_anon_4_task_instance_pid AS anon_2_anon_3_anon_4_task_instance_pid, anon_2.anon_3_anon_4_task_instance_executor_config AS anon_2_anon_3_anon_4_task_instance_executor_config
      FROM (SELECT anon_3.anon_4_task_instance_try_number AS anon_3_anon_4_task_instance_try_number, anon_3.anon_4_task_instance_task_id AS anon_3_anon_4_task_instance_task_id, anon_3.anon_4_task_instance_dag_id AS anon_3_anon_4_task_instance_dag_id, anon_3.anon_4_task_instance_execution_date AS anon_3_anon_4_task_instance_execution_date, anon_3.anon_4_task_instance_start_date AS anon_3_anon_4_task_instance_start_date, anon_3.anon_4_task_instance_end_date AS anon_3_anon_4_task_instance_end_date, anon_3.anon_4_task_instance_duration AS anon_3_anon_4_task_instance_duration, anon_3.anon_4_task_instance_state AS anon_3_anon_4_task_instance_state, anon_3.anon_4_task_instance_max_tries AS anon_3_anon_4_task_instance_max_tries, anon_3.anon_4_task_instance_hostname AS anon_3_anon_4_task_instance_hostname, anon_3.anon_4_task_instance_unixname AS anon_3_anon_4_task_instance_unixname, anon_3.anon_4_task_instance_job_id AS anon_3_anon_4_task_instance_job_id, anon_3.anon_4_task_instance_pool AS anon_3_anon_4_task_instance_pool, anon_3.anon_4_task_instance_pool_slots AS anon_3_anon_4_task_instance_pool_slots, anon_3.anon_4_task_instance_queue AS anon_3_anon_4_task_instance_queue, anon_3.anon_4_task_instance_priority_weight AS anon_3_anon_4_task_instance_priority_weight, anon_3.anon_4_task_instance_operator AS anon_3_anon_4_task_instance_operator, anon_3.anon_4_task_instance_queued_dttm AS anon_3_anon_4_task_instance_queued_dttm, anon_3.anon_4_task_instance_pid AS anon_3_anon_4_task_instance_pid, anon_3.anon_4_task_instance_executor_config AS anon_3_anon_4_task_instance_executor_config
      FROM (SELECT anon_4.task_instance_try_number AS anon_4_task_instance_try_number, anon_4.task_instance_task_id AS anon_4_task_instance_task_id, anon_4.task_instance_dag_id AS anon_4_task_instance_dag_id, anon_4.task_instance_execution_date AS anon_4_task_instance_execution_date, anon_4.task_instance_start_date AS anon_4_task_instance_start_date, anon_4.task_instance_end_date AS anon_4_task_instance_end_date, anon_4.task_instance_duration AS anon_4_task_instance_duration, anon_4.task_instance_state AS anon_4_task_instance_state, anon_4.task_instance_max_tries AS anon_4_task_instance_max_tries, anon_4.task_instance_hostname AS anon_4_task_instance_hostname, anon_4.task_instance_unixname AS anon_4_task_instance_unixname, anon_4.task_instance_job_id AS anon_4_task_instance_job_id, anon_4.task_instance_pool AS anon_4_task_instance_pool, anon_4.task_instance_pool_slots AS anon_4_task_instance_pool_slots, anon_4.task_instance_queue AS anon_4_task_instance_queue, anon_4.task_instance_priority_weight AS anon_4_task_instance_priority_weight, anon_4.task_instance_operator AS anon_4_task_instance_operator, anon_4.task_instance_queued_dttm AS anon_4_task_instance_queued_dttm, anon_4.task_instance_pid AS anon_4_task_instance_pid, anon_4.task_instance_executor_config AS anon_4_task_instance_executor_config
      FROM (SELECT task_instance.try_number AS task_instance_try_number, task_instance.task_id AS task_instance_task_id, task_instance.dag_id AS task_instance_dag_id, task_instance.execution_date AS task_instance_execution_date, task_instance.start_date AS task_instance_start_date, task_instance.end_date AS task_instance_end_date, task_instance.duration AS task_instance_duration, task_instance.state AS task_instance_state, task_instance.max_tries AS task_instance_max_tries, task_instance.hostname AS task_instance_hostname, task_instance.unixname AS task_instance_unixname, task_instance.job_id AS task_instance_job_id, task_instance.pool AS task_instance_pool, task_instance.pool_slots AS task_instance_pool_slots, task_instance.queue AS task_instance_queue, task_instance.priority_weight AS task_instance_priority_weight, task_instance.operator AS task_instance_operator, task_instance.queued_dttm AS task_instance_queued_dttm, task_instance.pid AS task_instance_pid, task_instance.executor_config AS task_instance_executor_config
      FROM task_instance
      WHERE task_instance.dag_id LIKE %(dag_id_1)s AND task_instance.task_id IN (%(task_id_1)s, %(task_id_2)s, %(task_id_3)s, %(task_id_4)s) AND task_instance.execution_date >= %(execution_date_1)s AND task_instance.execution_date <= %(execution_date_2)s UNION SELECT task_instance.try_number AS task_instance_try_number, task_instance.task_id AS task_instance_task_id, task_instance.dag_id AS task_instance_dag_id, task_instance.execution_date AS task_instance_execution_date, task_instance.start_date AS task_instance_start_date, task_instance.end_date AS task_instance_end_date, task_instance.duration AS task_instance_duration, task_instance.state AS task_instance_state, task_instance.max_tries AS task_instance_max_tries, task_instance.hostname AS task_instance_hostname, task_instance.unixname AS task_instance_unixname, task_instance.job_id AS task_instance_job_id, task_instance.pool AS task_instance_pool, task_instance.pool_slots AS task_instance_pool_slots, task_instance.queue AS task_instance_queue, task_instance.priority_weight AS task_instance_priority_weight, task_instance.operator AS task_instance_operator, task_instance.queued_dttm AS task_instance_queued_dttm, task_instance.pid AS task_instance_pid, task_instance.executor_config AS task_instance_executor_config
      FROM task_instance
      WHERE task_instance.dag_id LIKE %(dag_id_2)s AND task_instance.task_id IN (%(task_id_5)s) AND task_instance.execution_date >= %(execution_date_3)s AND task_instance.execution_date <= %(execution_date_4)s) AS anon_4 UNION SELECT task_instance.try_number AS task_instance_try_number, task_instance.task_id AS task_instance_task_id, task_instance.dag_id AS task_instance_dag_id, task_instance.execution_date AS task_instance_execution_date, task_instance.start_date AS task_instance_start_date, task_instance.end_date AS task_instance_end_date, task_instance.duration AS task_instance_duration, task_instance.state AS task_instance_state, task_instance.max_tries AS task_instance_max_tries, task_instance.hostname AS task_instance_hostname, task_instance.unixname AS task_instance_unixname, task_instance.job_id AS task_instance_job_id, task_instance.pool AS task_instance_pool, task_instance.pool_slots AS task_instance_pool_slots, task_instance.queue AS task_instance_queue, task_instance.priority_weight AS task_instance_priority_weight, task_instance.operator AS task_instance_operator, task_instance.queued_dttm AS task_instance_queued_dttm, task_instance.pid AS task_instance_pid, task_instance.executor_config AS task_instance_executor_config
      FROM task_instance
      WHERE task_instance.dag_id LIKE %(dag_id_3)s AND task_instance.task_id IN (%(task_id_6)s) AND task_instance.execution_date >= %(execution_date_5)s AND task_instance.execution_date <= %(execution_date_6)s) AS anon_3 UNION SELECT task_instance.try_number AS task_instance_try_number, task_instance.task_id AS task_instance_task_id, task_instance.dag_id AS task_instance_dag_id, task_instance.execution_date AS task_instance_execution_date, task_instance.start_date AS task_instance_start_date, task_instance.end_date AS task_instance_end_date, task_instance.duration AS task_instance_duration, task_instance.state AS task_instance_state, task_instance.max_tries AS task_instance_max_tries, task_instance.hostname AS task_instance_hostname, task_instance.unixname AS task_instance_unixname, task_instance.job_id AS task_instance_job_id, task_instance.pool AS task_instance_pool, task_instance.pool_slots AS task_instance_pool_slots, task_instance.queue AS task_instance_queue, task_instance.priority_weight AS task_instance_priority_weight, task_instance.operator AS task_instance_operator, task_instance.queued_dttm AS task_instance_queued_dttm, task_instance.pid AS task_instance_pid, task_instance.executor_config AS task_instance_executor_config
      FROM task_instance
      WHERE task_instance.dag_id LIKE %(dag_id_4)s AND task_instance.task_id IN (%(task_id_7)s) AND task_instance.execution_date >= %(execution_date_7)s AND task_instance.execution_date <= %(execution_date_8)s) AS anon_2) AS anon_1
      [2020-03-14 09:42:50,265] {base.py:1208} INFO - "\x1b[1m{'dag_id_1': 'agg_dag', 'task_id_1': 'start', 'task_id_2': 'daily_tas_0', 'task_id_3': 'daily_tas_1', 'task_id_4': 'daily_tas_2', 'execution_date_1': datetime.datetime(2015, 1, 4, 0, 0, tzinfo=pendulum.timezone("UTC")), 'execution_date_2': datetime.datetime(2015, 1, 4, 0, 0, tzinfo=pendulum.timezone("UTC")), 'dag_id_2': 'daily_dag', 'task_id_5': 'daily_tas', 'execution_date_3': datetime.datetime(2015, 1, 3, 0, 0, tzinfo=pendulum.timezone("UTC")), 'execution_date_4': datetime.datetime(2015, 1, 3, 0, 0, tzinfo=pendulum.timezone("UTC")), 'dag_id_3': 'daily_dag', 'task_id_6': 'daily_tas', 'execution_date_5': datetime.datetime(2015, 1, 2, 0, 0, tzinfo=pendulum.timezone("UTC")), 'execution_date_6': datetime.datetime(2015, 1, 2, 0, 0, tzinfo=pendulum.timezone("UTC")), 'dag_id_4': 'daily_dag', 'task_id_7': 'daily_tas', 'execution_date_7': datetime.datetime(2015, 1, 4, 0, 0, tzinfo=pendulum.timezone("UTC")), 'execution_date_8': datetime.datetime(2015, 1, 4, 0, 0, tzinfo=pendulum.timezone("UTC"))}\x1b[0m"
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                yuqian90 Qian Yu
                Reporter:
                yuqian90 Qian Yu
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: