Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-851

Scheduler Driver does not guarantee that abort() prevents further calls on the Scheduler.

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.16.0
    • Component/s: c++ api, java api, python api
    • Labels:
      None

      Description

      This came up while reviewing: https://reviews.apache.org/r/15853/

      Our documentation for abort mentions that no more callbacks can be made to the scheduler:
      /**

      • Aborts the driver so that no more callbacks can be made to the
      • scheduler. The semantics of abort and stop have deliberately been
      • separated so that code can detect an aborted driver (i.e., via
      • the return status of SchedulerDriver::join, see below), and
      • instantiate and start another driver if desired (from within the
      • same process). Note that 'stop()' is not automatically called
      • inside 'abort()'.
        */
        virtual Status abort() = 0;

      However, this is inaccurate as we perform a dispatch to the SchedulerProcess which means that any already queued messages will be processed prior to abort:

      Status MesosSchedulerDriver::abort()
      {
      Lock lock(&mutex);

      if (status != DRIVER_RUNNING)

      { return status; }

      CHECK(process != NULL);

      // XXX: This does not immediately signal the SchedulerProcess to stop
      // processing messages!
      dispatch(process, &SchedulerProcess::abort);

      return status = DRIVER_ABORTED;
      }

      The driver's stop() call has a similar issue in terms of possibly making additional calls on the Scheduler after stop() is called.

      This problem is mirrored in the ExecutorDriver's stop and abort functions as well.

      So far, I see a few possible fixes:

      1. Expose the 'volatile bool aborted' member variable of SchedulerProcess and set it inside MesosSchedulerDriver::abort. stop() would need a similar boolean.

      2. Provide a "priority dispatch" mechanism in libprocess, wherein the DispatchEvent can be sent to the front of the queue. (stop() can also use this).

      3. Terminate the process when abort/stop are called and handle it appropriately in the finalize() function, however, this changes the existing functionality in that schedulers can no longer make driver calls to kill tasks, launch tasks, etc after being aborted.

        Attachments

          Activity

            People

            • Assignee:
              bmahler Benjamin Mahler
              Reporter:
              bmahler Benjamin Mahler
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: