Uploaded image for project: 'Apache Airflow'
  1. Apache Airflow
  2. AIRFLOW-6776

Celery Executor Creates Zombie Tasks on Worker Death



    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.10.7, 1.10.9
    • Fix Version/s: None
    • Component/s: celery, scheduler, worker
    • Labels:
    • Environment:
      Dockerized CeleryExecutor (ElastiCache Redis) on AWS EC2 instances.


      UPDATE: Since originally reported, we have moved off our custom fork and are using apache-airflow 1.10.7. The problem persists and likely persists through 1.10.9.


      The unexpected death of a celery worker never updates the state of the task with the scheduler. If a worker dies while any tasks are running, those tasks will time out and get rescheduled on another available worker. However, the tasks are not emptied out of the executor's running state, so their state will never change (possibly preventing anything else from being scheduled, depending on configuration of parallelism). Restarting the airflow scheduler resets the running state, but is not a long-term solution.
      We believe this issue persists on the latest version of airflow (CeleryExecutor.sync() is still written to react silently to unexpected states).
      Steps to reproduce:
      Start a long-running dag and kill the container running airflow worker while tasks are running.
      The attached screenshot shows metrics gathered during the creation of a zombie. We started a dag with many long-running tasks and killed the worker between 15:25 and 15:30 (one worker, parallelism=9).


        1. Zombie_Metrics.png
          67 kB
          Kimberly Orr



            • Assignee:
              ash Ash Berlin-Taylor
              kimdorr Kimberly Orr
            • Votes:
              2 Vote for this issue
              2 Start watching this issue


              • Created: