mysql rds metastore - db.m5.large instance, 5.7.26 version
task_instance table has 2,848,160 rows
dag_run table has 22768 rows
dag table has 23 rows
log table has 17,916,891 rows
airflow 1.10.3, using LocalExecutor, python 2.7, single ec2 m5.4xlarge, parallelism set to 45 (ie max 45 tasks at once). Just using externally triggered dags, no SLAs. No subdags/backfills. 4 gunicorn workers. Using dynamic dags
Everything was fine until yesterday, around 300 dag runs every day. Now today these issues appear all of a sudden (no code change, environment change.etc). I suspect the task_instance table has gotten too big and causing scheduler and mysql issues.
1.
'Recent tasks' are showing blank on the web ui home page. admin/airflow/task_stats fails to display with 504 error after few mins but dag_stats endpoint shows dags are in running state
2.
dag_runs are stuck in running state > 20 hrs, seems no new tasks are being run (they are stuck in scheduled/queued state)
I then tried terminating the EC2 and getting a new one, the dagruns and tasks would then start finishing but then after few hours got into same situation as points 1/2 above. I believe certain dag ids (with many tasks) are hitting the issue, will know m
You can see notable change in the graphs just after 4th november midnight (that is when the issue started). Around 30 dagruns (yes there are diff execution_dates running for same dagid at same time) start around 11pm each night.
Scheduler/webserver pids have remained up the entire time, no ec2 autoheals happened
scheduler log shows:
[2019-11-07 12:25:18,287] {jobs.py:1185} INFO - DAG daga has 29/45 running and queued tasks
[2019-11-07 12:25:18,287] {jobs.py:1185} INFO - DAG daga has 30/45 running and queued tasks
[2019-11-07 12:25:18,287] {jobs.py:1185} INFO - DAG daga has 31/45 running and queued tasks
[2019-11-07 12:25:18,287] {jobs.py:1185} INFO - DAG daga has 32/45 running and queued tasks
[2019-11-07 12:25:18,287] {jobs.py:1185} INFO - DAG daga has 33/45 running and queued tasks
[2019-11-07 12:25:18,287] {jobs.py:1185} INFO - DAG daga has 34/45 running and queued tasks
[2019-11-07 12:25:18,287] {jobs.py:1185} INFO - DAG daga has 35/45 running and queued tasks
[2019-11-07 12:25:21,435] {jobs.py:1185} INFO - DAG dage has 0/45 running and queued tasks
[2019-11-07 12:25:21,435] {jobs.py:1185} INFO - DAG dage has 0/45 running and queued tasks
[2019-11-07 12:25:21,435] {jobs.py:1185} INFO - DAG daga has 36/45 running and queued tasks
[2019-11-07 12:25:21,435] {jobs.py:1185} INFO - DAG daga has 36/45 running and queued tasks
[2019-11-07 12:25:21,435] {jobs.py:1185} INFO - DAG daga has 36/45 running and queued tasks
[2019-11-07 12:25:21,435] {jobs.py:1185} INFO - DAG daga has 36/45 running and queued tasks
[2019-11-07 12:25:21,435] {jobs.py:1185} INFO - DAG daga has 36/45 running and queued tasks
[2019-11-07 12:25:21,435] {jobs.py:1185} INFO - DAG daga has 36/45 running and queued tasks
[2019-11-07 12:25:21,435] {jobs.py:1185} INFO - DAG daga has 36/45 running and queued tasks
[2019-11-07 12:25:21,435] {jobs.py:1185} INFO - DAG daga has 36/45 running and queued tasks
[2019-11-07 12:25:21,435] {jobs.py:1185} INFO - DAG daga has 36/45 running and queued tasks
[2019-11-07 12:25:21,435] {jobs.py:1185} INFO - DAG daga has 36/45 running and queued tasks
[2019-11-07 12:25:21,436] {jobs.py:1185} INFO - DAG daga has 36/45 running and queued tasks
[2019-11-07 12:25:21,436] {jobs.py:1185} INFO - DAG daga has 36/45 running and queued tasks
[2019-11-07 12:25:21,436] {jobs.py:1185} INFO - DAG daga has 37/45 running and queued tasks
[2019-11-07 12:25:21,436] {jobs.py:1185} INFO - DAG daga has 38/45 running and queued tasks
[2019-11-07 12:25:21,436] {jobs.py:1185} INFO - DAG daga has 39/45 running and queued tasks
[2019-11-07 12:25:21,436] {jobs.py:1185} INFO - DAG daga has 40/45 running and queued tasks
[2019-11-07 12:25:21,436] {jobs.py:1185} INFO - DAG daga has 41/45 running and queued tasks
[2019-11-07 12:25:21,436] {jobs.py:1185} INFO - DAG daga has 42/45 running and queued tasks
[2019-11-07 12:25:21,436] {jobs.py:1185} INFO - DAG daga has 43/45 running and queued tasks
[2019-11-07 12:25:21,436] {jobs.py:1185} INFO - DAG daga has 44/45 running and queued tasks
[2019-11-07 12:25:21,436] {jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
[2019-11-07 12:25:21,436] {jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
[2019-11-07 12:25:21,436] {jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
[2019-11-07 12:25:21,437] {jobs.py:1185} INFO - DAG dage has 0/45 running and queued tasks
[2019-11-07 12:25:42,193] {jobs.py:1185} INFO - DAG dagb has 0/45 running and queued tasks
[2019-11-07 12:25:42,194] {jobs.py:1185} INFO - DAG dagb has 0/45 running and queued tasks
[2019-11-07 12:25:42,194] {jobs.py:1185} INFO - DAG dagb has 0/45 running and queued tasks
[2019-11-07 12:25:42,194] {jobs.py:1185} INFO - DAG dagb has 0/45 running and queued tasks
[2019-11-07 12:25:42,194] {jobs.py:1185} INFO - DAG dagb has 0/45 running and queued tasks
[2019-11-07 12:25:42,194] {jobs.py:1185} INFO - DAG dagb has 0/45 running and queued tasks
[2019-11-07 12:25:42,194] {jobs.py:1185} INFO - DAG dagb has 0/45 running and queued tasks
[2019-11-07 12:25:42,194] {jobs.py:1185} INFO - DAG dagb has 0/45 running and queued tasks
[2019-11-07 12:25:42,194] {jobs.py:1185} INFO - DAG dagb has 0/45 running and queued tasks
[2019-11-07 12:25:42,194] {jobs.py:1185} INFO - DAG dagb has 0/45 running and queued tasks
[2019-11-07 12:25:49,493] {jobs.py:1185} INFO - DAG dagc has 0/45 running and queued tasks
[2019-11-07 12:25:49,493] {jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
[2019-11-07 12:25:49,494] {jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
[2019-11-07 12:25:49,494] {jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
[2019-11-07 12:25:49,494] {jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
[2019-11-07 12:25:49,494] {jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
[2019-11-07 12:25:49,494] {jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
[2019-11-07 12:25:49,494] {jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
[2019-11-07 12:25:49,494] {jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
[2019-11-07 12:25:49,494] {jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
[2019-11-07 12:25:49,495] {jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
[2019-11-07 12:25:49,495] {jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
[2019-11-07 12:25:49,495] {jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
[2019-11-07 12:25:49,495] {jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
[2019-11-07 12:25:49,495] {jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
[2019-11-07 12:25:49,495] {jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
[2019-11-07 12:25:49,495] {jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
[2019-11-07 12:25:49,495] {jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
[2019-11-07 12:25:49,495] {jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
[2019-11-07 12:25:49,496] {jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
[2019-11-07 12:25:49,496] {jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
[2019-11-07 12:25:49,496] {jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
[2019-11-07 12:25:49,496] {jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
[2019-11-07 12:25:49,496] {jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
[2019-11-07 12:25:49,496] {jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
[2019-11-07 12:25:49,496] {jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
[2019-11-07 12:25:49,496] {jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
[2019-11-07 12:25:49,496] {jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
[2019-11-06 11:53:08,813] {jobs.py:1185} INFO - DAG daga has 45/45 running and queued tasks
[2019-11-06 11:53:08,813] {jobs.py:1191} INFO - Not executing <TaskInstance: task1 2019-11-03 00:00:00+00:00 [scheduled]> since the number of tasks running or queued from DAG daga is >= to the DAG's task concurrency limit of 45
select state,count(1) from airflowdb.task_instance --not mysql syntax (querying via Presto)
where dag_id = 'daga' and execution_date in ( timestamp'2019-11-03 00:00:00',timestamp'2019-11-04 00:00:00',timestamp'2019-11-06 00:00:00',timestamp'2019-11-07 00:00:00') --these are the dates in the dag_run table with state of running
group by state
success 871
failed 34
upstream_failed 68
skipped 568
scheduled 486
<null> 573