Uploaded image for project: 'Aurora'
  1. Aurora
  2. AURORA-1953

Scheduler livelock during startup

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 0.18.0
    • 0.19.0
    • Scheduler
    • None

    Description

      The scheduler may experience a "livelock" situation while starting up due to async events on a ThreadPoolExecutor that require other not-yet-executed events to be processed. If enough of these blocking events occur simultaneously, no further event processing occurs and the scheduler stalls.

      More specifically, this section of TaskGroups is afflicted:

      CompletableFuture<Set<String>> result = batchWorker.execute(storeProvider ->
          taskScheduler.schedule(storeProvider, taskIds));
      
      Set<String> scheduled = null;
      try {
        scheduled = result.get();
      

      batchWorker#execute submits to a queue that is not processed until a SchedulerActive event is fired within the scheduler. SchedulerActive is sent via an AsyncEventBus which happens to also trigger the above code from TaskGroups. Therefore, the following sequence of events will cause a livelock:

      TaskStateChange=pending
      TaskStateChange=pending
      TaskStateChange=pending
      TaskStateChange=pending
      TaskStateChange=pending
      TaskStateChange=pending
      TaskStateChange=pending
      TaskStateChange=pending
      DriverRegistered
      

      Any other events may occur between the above calls, but the important sequence is N TaskStateChange=pending events, where N=-async_worker_threads followed by DriverRegistered.

      This issue was exacerbated by f2755e1, which has the subtle effect of not using GatingDelayExecutor#closeDuring(), which would enqueue all these events until storage recovery is complete. The on-demand execution greatly increases the likelihood of the above event sequence, since driver registration begins strictly after storage recovery completes.

      Attachments

        Activity

          People

            jordanly Jordan Ly
            wfarner Bill Farner
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: