We recently had an issue arise with our Airflow instance which caused the scheduler to enter some sort of a deadlocked state in the middle of operation. In this state, all DAG runs were listed as 'scheduled' and it didn't appear as if anything at all was happening.
Initially, I thought this might be an issue with our configuration, but I couldn't quite track down why this issue wouldn't have arisen earlier and, looking at the logs, I've been seeing some strange behavior that I can't quite explain.
The most notable thing is that, for whatever reason, the Executor Class listed under all of our jobs is 'NoneType', previously 'LocalExecutor'. Looking at our logs, this change initially happened when we updated our instance two days prior to this initial deadlock, however, I have since cleared the database altogether and find that even starting from scratch, 'NoneType' is appearing.
In these same logs, I can see jobs continuously running for this DAG run, however the start and end times are less than a second apart. At the same time, all task instances are either listed a 'success' or 'scheduled' so I'm not entirely sure what the running jobs are.
If I look in the Task Instance Details for any of these scheduled tasks, I see
Upon viewing the logs in the airflow for the scheduler, nothing seem awry.
So to summarize, the scheduler seems to be doing it's job, as DAG runs are properly scheduled and set as 'running' however the instances themselves are not completing properly. Due to the listing of 'NoneType' instead of 'LocalExecutor' for the jobs, my theory is that there is some issue with the LocalExecutor, that's causing it not properly execute jobs. Again, clearing the database didn't seem to help this, and I now run into this deadlock almost immediately with a test DAG I'm running.
If I can provide any additional information, please let me know. I'd love to get this resolved or figured out, as we're currently unable to use Airflow because of this.