Uploaded image for project: 'Aurora'
  1. Aurora
  2. AURORA-1933

Scheduler can process rescind before offer

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.18.0
    • Component/s: None
    • Labels:
      None

      Description

      I observed the following in production:

      Jun  6 00:31:32 compute1159-dca1 aurora-scheduler[23675]: I0606 00:31:32.510 [Thread-77638, MesosCallbackHandler$MesosCallbackHandlerImpl:229] Offer rescinded: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
      Jun  6 00:31:32 compute1159-dca1 aurora-scheduler[23675]: I0606 00:31:32.903 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl:211] Received offer: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
      Jun  6 00:31:34 compute1159-dca1 aurora-scheduler[23675]: I0606 00:31:34.815 [TaskGroupBatchWorker, VersionedSchedulerDriverService:123] Accepting offer 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 with ops [LAUNCH]
      

      Notice the rescind was processed before the offer was given. This means the offer is in the offer storage, but using it is invalid. It will cause whatever task launched with it to fail with Task launched with invalid offers: Offer 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 is no longer valid

        Issue Links

          Activity

          Show
          zmanji Zameer Manji added a comment - https://reviews.apache.org/r/59853/
          Hide
          zmanji Zameer Manji added a comment -
          commit 73234d09d803ebc718c0c0ea454c5629714ef8e6
          Author: Zameer Manji <zmanji@apache.org>
          Date:   Tue Jun 6 14:21:22 2017 -0700
          
              Process rescinds in the same thread pool as offers.
              
              In a a production environment I was able to observe the following:
              ```
              I0606 00:31:32.510 [Thread-77638, MesosCallbackHandler$MesosCallbackHandlerImpl:229] Offer rescinded: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
              I0606 00:31:32.903 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl:211] Received offer: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
              I0606 00:31:34.815 [TaskGroupBatchWorker, VersionedSchedulerDriverService:123] Accepting offer 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 with ops [LAUNCH]
              ```
              
              Notice that the offer rescind was processed before the actual offer. This is
              possible because there is a race in the `MesosCallbackHandlerImpl`. The offer is
              processed in the executor (to prevent blocking) and the rescind is handled
              directly. This means the offer procecssing thread (`SchedulerImpl-0`) is racing
              against the callback thread (`Thread-77638`).
              
              In normal operation, there will be seconds to minutes between a rescind and an
              offer, but in some cases an offer can be rescinded very quickly in clusters that
              use oversubscription modules.
              
              To fix this, we move the rescind processing into the same executor as the offer
              processing to ensure they are processed in the order they are received. Without
              fixing this, the rescinded offer exists in the offer manager and can be used
              later to launch a task. This task will immediately fail to launch because the
              offer is invalid.
              
              In this patch, I have also added a metric and logging to record when we fail to
              remove an offer from the offer manager, and cleaned up the logging to allow
              operators to see when an offer was recieved. With this logging, an operator can
              grep for the offer id and see the entire lifecycle of the offer in the
              scheduler.
              
              Bugs closed: AURORA-1933
              
              Reviewed at https://reviews.apache.org/r/59853/
          
           .../scheduler/mesos/MesosCallbackHandler.java      | 25 +++++++++++++++-------
           .../mesos/VersionedSchedulerDriverService.java     |  2 +-
           .../aurora/scheduler/offers/OfferManager.java      | 13 ++++++++++-
           .../scheduler/offers/OfferManagerImplTest.java     | 10 +++++++++
           4 files changed, 40 insertions(+), 10 deletions(-)
          
          Show
          zmanji Zameer Manji added a comment - commit 73234d09d803ebc718c0c0ea454c5629714ef8e6 Author: Zameer Manji <zmanji@apache.org> Date: Tue Jun 6 14:21:22 2017 -0700 Process rescinds in the same thread pool as offers. In a a production environment I was able to observe the following: ``` I0606 00:31:32.510 [Thread-77638, MesosCallbackHandler$MesosCallbackHandlerImpl:229] Offer rescinded: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 I0606 00:31:32.903 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl:211] Received offer: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 I0606 00:31:34.815 [TaskGroupBatchWorker, VersionedSchedulerDriverService:123] Accepting offer 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 with ops [LAUNCH] ``` Notice that the offer rescind was processed before the actual offer. This is possible because there is a race in the `MesosCallbackHandlerImpl`. The offer is processed in the executor (to prevent blocking) and the rescind is handled directly. This means the offer procecssing thread (`SchedulerImpl-0`) is racing against the callback thread (`Thread-77638`). In normal operation, there will be seconds to minutes between a rescind and an offer, but in some cases an offer can be rescinded very quickly in clusters that use oversubscription modules. To fix this, we move the rescind processing into the same executor as the offer processing to ensure they are processed in the order they are received. Without fixing this, the rescinded offer exists in the offer manager and can be used later to launch a task. This task will immediately fail to launch because the offer is invalid. In this patch, I have also added a metric and logging to record when we fail to remove an offer from the offer manager, and cleaned up the logging to allow operators to see when an offer was recieved. With this logging, an operator can grep for the offer id and see the entire lifecycle of the offer in the scheduler. Bugs closed: AURORA-1933 Reviewed at https://reviews.apache.org/r/59853/ .../scheduler/mesos/MesosCallbackHandler.java | 25 +++++++++++++++------- .../mesos/VersionedSchedulerDriverService.java | 2 +- .../aurora/scheduler/offers/OfferManager.java | 13 ++++++++++- .../scheduler/offers/OfferManagerImplTest.java | 10 +++++++++ 4 files changed, 40 insertions(+), 10 deletions(-)

            People

            • Assignee:
              zmanji Zameer Manji
              Reporter:
              zmanji Zameer Manji
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development