After not monitoring Airflow for a while, I noticed that tasks had not been running for several days.
My setup: Scheduler and web-server running in one pod, with KubernetesExecutor. 4 different DAGs, none of them very large: 1 running once per day, 2 every 30 mins and 1 every 2 minutes.
Airflow had log messages such as these:
... and a bit further down:
In the Kubernetes cluster, there were no pods created by Airflow (they'd all finished and been deleted).
After digging into the logs around the time at which jobs stopped progressing, I noticed that at this point in time the KubernetesJobWatcher stopped logging the state changes of pods - even though I could see log messages for new pods being created.
It's hard to tell why this happened - if the subprocess running the job watcher died it should have been detected in the heartbeat. If the Watch threw an exception, there should have been logs (which there weren't) and then it should have restarted.
I have a few theories as to what might have happened:
- The Watch hung indefinitely - although I can't see any issues against the Kubernetes python client that suggest other people have had this issue
- The KubernetesJobWatcher died, but the heartbeat was not functioning correctly
- The Watcher experienced a large gap between watch requests meaning some relevant events were "lost" leaving the respective tasks in the "running" state
Unfortunately I dont have the answers, so I'm posting this in the hope someone has some additional insight.
As a side note - Im using Kubernetes Client version 9.0.0
My only suggestion for a fix is to periodically check what Pods are actually running, and reconcile that against the "running" queue in the executor and maybe force-restart the job watcher if the state has diverged).