Uploaded image for project: 'Apache Airflow'
  1. Apache Airflow
  2. AIRFLOW-2145

Deadlock after clearing a running task

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.9.0
    • Fix Version/s: 1.10.1
    • Component/s: None
    • Labels:
      None

      Description

      TL;DR The essense of the issue is that whenever a currently running ask is cleared, the dagrun enters a deadlocked state and fails.

       

      We see this in production with Celery executors and TimeDeltaSensor, and I've been able to reproduce it locally with both TimeDeltaSensor and WebHDFSSensor.

      Here's the minimal example:

      from datetime import datetime, timedelta
      
      import airflow
      from airflow.operators.sensors import TimeDeltaSensor
      from airflow.operators.dummy_operator import DummyOperator
      
      with airflow.DAG(
          'foo',
          schedule_interval='@daily',
          start_date=datetime(2018, 1, 1)) as dag:
      
          wait_for_upstream_sla = TimeDeltaSensor(
              task_id="wait_for_upstream_sla",
              delta=timedelta(days=365*10)
          )
      
          do_work = DummyOperator(task_id='do_work')
      
          dag >> wait_for_upstream_sla >> do_work
      

       

      Sequence of actions, relevant DEBUG level logs, and some UI screenshots

      airflow clear foo -e 2018-02-22 --no_confirm && airflow backfill foo -s 2018-02-22 -e 2018-02-22
      [2018-02-23 17:17:45,983] {__init__.py:45} INFO - Using executor SequentialExecutor
      [2018-02-23 17:17:46,069] {models.py:189} INFO - Filling up the DagBag from /Users/grol/Drive/dev/reporting/dags
      ...
      [2018-02-23 17:17:47,563] {jobs.py:2180} DEBUG - Task instance to run <TaskInstance: foo.wait_for_upstream_sla 2018-02-22 00:00:00 [scheduled]> state scheduled
      ...
      
      

      Now we clear all DAG's tasks externally:

      airflow clear foo -e 2018-02-22 --no_confirm
      

      This causes the following:

      [2018-02-23 17:17:55,258] {base_task_runner.py:98} INFO - Subtask: [2018-02-23 17:17:55,258] {sensors.py:629} INFO - Checking if the time (2018-02-23 16:19:00) has come
      [2018-02-23 17:17:58,844] {jobs.py:184} DEBUG - [heart] Boom.
      [2018-02-23 17:18:03,848] {jobs.py:184} DEBUG - [heart] Boom.
      [2018-02-23 17:18:08,856] {jobs.py:2585} WARNING - State of this instance has been externally set to shutdown. Taking the poison pill.
      [2018-02-23 17:18:08,874] {helpers.py:266} DEBUG - There are no descendant processes to kill
      [2018-02-23 17:18:08,875] {jobs.py:184} DEBUG - [heart] Boom.
      [2018-02-23 17:18:08,900] {helpers.py:266} DEBUG - There are no descendant processes to kill
      [2018-02-23 17:18:08,922] {helpers.py:266} DEBUG - There are no descendant processes to kill
      [2018-02-23 17:18:09,005] {sequential_executor.py:47} ERROR - Failed to execute task Command 'airflow run foo wait_for_upstream_sla 2018-02-22T00:00:00 --local -sd DAGS_FOLDER/foo.py' returned non-zero exit status 1.
      [2018-02-23 17:18:09,012] {jobs.py:2004} DEBUG - Executor state: failed task <TaskInstance: foo.wait_for_upstream_sla 2018-02-22 00:00:00 [shutdown]>
      [2018-02-23 17:18:09,018] {models.py:4584} INFO - Updating state for <DagRun foo @ 2018-02-22 00:00:00: backfill_2018-02-22T00:00:00, externally triggered: False> considering 2 task(s)
      [2018-02-23 17:18:09,021] {models.py:1215} DEBUG - <TaskInstance: foo.do_work 2018-02-22 00:00:00 [None]> dependency 'Previous Dagrun State' PASSED: True, The task did not have depends_on_past set.
      [2018-02-23 17:18:09,021] {models.py:1215} DEBUG - <TaskInstance: foo.do_work 2018-02-22 00:00:00 [None]> dependency 'Not In Retry Period' PASSED: True, The context specified that being in a retry period was permitted.
      [2018-02-23 17:18:09,027] {models.py:1215} DEBUG - <TaskInstance: foo.do_work 2018-02-22 00:00:00 [None]> dependency 'Trigger Rule' PASSED: False, Task's trigger rule 'all_success' requires all upstream tasks to have succeeded, but found 1 non-success(es). upstream_tasks_state={'skipped': 0, 'successes': 0, 'failed': 0, 'upstream_failed': 0, 'done': 0, 'total': 1}, upstream_task_ids=['wait_for_upstream_sla']
      [2018-02-23 17:18:09,029] {models.py:4643} INFO - Deadlock; marking run <DagRun foo @ 2018-02-22 00:00:00: backfill_2018-02-22T00:00:00, externally triggered: False> failed
      [2018-02-23 17:18:09,045] {jobs.py:2125} INFO - [backfill progress] | finished run 1 of 1 | tasks waiting: 1 | succeeded: 0 | kicked_off: 1 | failed: 0 | skipped: 0 | deadlocked: 0 | not ready: 1
      [2018-02-23 17:18:09,045] {jobs.py:2129} DEBUG - Finished dag run loop iteration. Remaining tasks [<TaskInstance: foo.do_work 2018-02-22 00:00:00 [scheduled]>]
      [2018-02-23 17:18:09,045] {jobs.py:2160} DEBUG - *** Clearing out not_ready list ***
      [2018-02-23 17:18:09,048] {jobs.py:2180} DEBUG - Task instance to run <TaskInstance: foo.do_work 2018-02-22 00:00:00 [None]> state None
      [2018-02-23 17:18:09,049] {jobs.py:2186} WARNING - FIXME: task instance {} state was set to None externally. This should not happen
      [2018-02-23 17:18:09,053] {models.py:1215} DEBUG - <TaskInstance: foo.do_work 2018-02-22 00:00:00 [scheduled]> dependency 'Task Instance State' PASSED: True, Task state scheduled was valid.
      [2018-02-23 17:18:09,053] {models.py:1215} DEBUG - <TaskInstance: foo.do_work 2018-02-22 00:00:00 [scheduled]> dependency 'Previous Dagrun State' PASSED: True, The task did not have depends_on_past set.
      [2018-02-23 17:18:09,056] {models.py:1215} DEBUG - <TaskInstance: foo.do_work 2018-02-22 00:00:00 [scheduled]> dependency 'Task Concurrency' PASSED: True, Task concurrency is not set.
      [2018-02-23 17:18:09,056] {models.py:1215} DEBUG - <TaskInstance: foo.do_work 2018-02-22 00:00:00 [scheduled]> dependency 'Not In Retry Period' PASSED: True, The task instance was not marked for retrying.
      [2018-02-23 17:18:09,061] {models.py:1215} DEBUG - <TaskInstance: foo.do_work 2018-02-22 00:00:00 [scheduled]> dependency 'Trigger Rule' PASSED: False, Task's trigger rule 'all_success' requires all upstream tasks to have succeeded, but found 1 non-success(es). upstream_tasks_state={'skipped': 0, 'successes': 0, 'failed': 0, 'upstream_failed': 0, 'done': 0, 'total': 1}, upstream_task_ids=['wait_for_upstream_sla']
      [2018-02-23 17:18:09,061] {models.py:1190} INFO - Dependencies not met for <TaskInstance: foo.do_work 2018-02-22 00:00:00 [scheduled]>, dependency 'Trigger Rule' FAILED: Task's trigger rule 'all_success' requires all upstream tasks to have succeeded, but found 1 non-success(es). upstream_tasks_state={'skipped': 0, 'successes': 0, 'failed': 0, 'upstream_failed': 0, 'done': 0, 'total': 1}, upstream_task_ids=['wait_for_upstream_sla']
      [2018-02-23 17:18:09,061] {jobs.py:2274} DEBUG - Adding <TaskInstance: foo.do_work 2018-02-22 00:00:00 [scheduled]> to not_ready
      [2018-02-23 17:18:09,067] {jobs.py:184} DEBUG - [heart] Boom.
      

      Interestingly, once the success condition of the TimeDeltaSensor is met, in production we see the following final state in the UI: DAG failed, while the TimeDeltaSensor task succeeded, though there's no evidence of success in the celery executors logs.

       

        Attachments

        1. image-2018-02-23-18-59-11-828.png
          24 kB
          George Roldugin
        2. image-2018-02-23-19-00-37-741.png
          24 kB
          George Roldugin
        3. image-2018-02-23-19-00-55-630.png
          24 kB
          George Roldugin
        4. image-2018-02-23-19-01-45-012.png
          24 kB
          George Roldugin
        5. image-2018-02-23-19-01-57-498.png
          25 kB
          George Roldugin
        6. image-2018-02-23-19-02-18-837.png
          38 kB
          George Roldugin

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                grol George Roldugin
              • Votes:
                1 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: