Description
The scheduler may experience a "livelock" situation while starting up due to async events on a ThreadPoolExecutor that require other not-yet-executed events to be processed. If enough of these blocking events occur simultaneously, no further event processing occurs and the scheduler stalls.
More specifically, this section of TaskGroups is afflicted:
CompletableFuture<Set<String>> result = batchWorker.execute(storeProvider -> taskScheduler.schedule(storeProvider, taskIds)); Set<String> scheduled = null; try { scheduled = result.get();
batchWorker#execute submits to a queue that is not processed until a SchedulerActive event is fired within the scheduler. SchedulerActive is sent via an AsyncEventBus which happens to also trigger the above code from TaskGroups. Therefore, the following sequence of events will cause a livelock:
TaskStateChange=pending TaskStateChange=pending TaskStateChange=pending TaskStateChange=pending TaskStateChange=pending TaskStateChange=pending TaskStateChange=pending TaskStateChange=pending DriverRegistered
Any other events may occur between the above calls, but the important sequence is N TaskStateChange=pending events, where N=-async_worker_threads followed by DriverRegistered.
This issue was exacerbated by f2755e1, which has the subtle effect of not using GatingDelayExecutor#closeDuring(), which would enqueue all these events until storage recovery is complete. The on-demand execution greatly increases the likelihood of the above event sequence, since driver registration begins strictly after storage recovery completes.