UPDATE: Since originally reported, we have moved off our custom fork and are using apache-airflow 1.10.7. The problem persists and likely persists through 1.10.9.
The unexpected death of a celery worker never updates the state of the task with the scheduler. If a worker dies while any tasks are running, those tasks will time out and get rescheduled on another available worker. However, the tasks are not emptied out of the executor's running state, so their state will never change (possibly preventing anything else from being scheduled, depending on configuration of parallelism). Restarting the airflow scheduler resets the running state, but is not a long-term solution.
We believe this issue persists on the latest version of airflow (CeleryExecutor.sync() is still written to react silently to unexpected states).
Steps to reproduce:
Start a long-running dag and kill the container running airflow worker while tasks are running.
The attached screenshot shows metrics gathered during the creation of a zombie. We started a dag with many long-running tasks and killed the worker between 15:25 and 15:30 (one worker, parallelism=9).