Uploaded image for project: 'Aurora'
  1. Aurora
  2. AURORA-1945

Rescinds received but not processed in time before offer accept

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Scheduler
    • Labels:
      None

      Description

      The current race condition for offers is possible:

      1. Scheduler receives an offer and adds it to the executor queue for processing.
      2. The executor processes the offer and adds it to the HostOffers list.
      3. Scheduler receives a rescind for that offer and adds it to the executor queue for processing. However, there is a lot of load on the executor so there might be a delay between receiving the rescind and processing it.
      4. Scheduler accepts the offer before the rescind is processed by the executor. This will result in launching a task with an invalid offer leading to TASK_LOST.

      The following logs show this in action:

      Mesos:

      I0810 14:33:45.744372 19274 master.cpp:6065] Removing offer OFFER_X with revocable resources...
      W0810 14:34:23.640905 19279 master.cpp:3696] Ignoring accept of offer OFFER_X since it is no longer valid
      W0810 14:34:23.640923 19279 master.cpp:3709] ACCEPT call used invalid offers '[ OFFER_X ]': Offer OFFER_X is no longer valid
      I0810 14:34:23.640974 19279 master.cpp:6253] Sending status update TASK_LOST for task TASK_Y with invalid offers: Offer OFFER_X is no longer valid'
      

      Aurora:

      I0810 14:28:45.676 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Received offer: OFFER_X 
      I0810 14:34:23.635 [TaskGroupBatchWorker, VersionedSchedulerDriverService] Accepting offer OFFER_X with ops [LAUNCH] 
      I0810 14:34:24.186 [Thread-4471585, MesosCallbackHandler$MesosCallbackHandlerImpl] Received status update for task TASK_Y in state TASK_LOST from SOURCE_MASTER with REASON_INVALID_OFFERS: Task launched with invalid offers: Offer_X is no longer valid 
      I0810 14:34:32.972 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Offer rescinded: OFFER_X
      W0810 14:34:32.972 [SchedulerImpl-0, OfferManager$OfferManagerImpl] Failed to cancel offer: OFFER_X. 
      

      We should find a way to prioritize/process rescinds immediately to avoid this delay. We should also take into account the previous race condition fixed by AURORA-1933 so we do not repeat that as well.

        Issue Links

          Activity

          Show
          jordanly Jordan Ly added a comment - https://reviews.apache.org/r/61804/
          Hide
          jordanly Jordan Ly added a comment -

          Merged.

          commit 62e46cdea5b3a2143e2fed601aca814346af750b
          Author: Jordan Ly <jordan.ly8@gmail.com>
          Date:   Thu Aug 24 09:36:57 2017 -0700
          
              Fix race condition where rescinds are received but not processed before offer is accepted
          
              The current race condition for offers is possible:
              ```
              1. Scheduler receives an offer and adds it to the executor queue for processing.
              2. The executor processes the offer and adds it to the HostOffers list.
              3. Scheduler receives a rescind for that offer and adds it to the executor queue for processing. However, there is a lot of load on the executor so there might be a delay between receiving the rescind and processing it.
              4. Scheduler accepts the offer before the rescind is processed by the executor. This will result in launching a task with an invalid offer leading to TASK_LOST.
              ```
              The following logs show this in action:
          
              Mesos:
              ```
              I0810 14:33:45.744372 19274 master.cpp:6065] Removing offer OFFER_X with revocable resources...
              W0810 14:34:23.640905 19279 master.cpp:3696] Ignoring accept of offer OFFER_X since it is no longer valid
              W0810 14:34:23.640923 19279 master.cpp:3709] ACCEPT call used invalid offers '[ OFFER_X ]': Offer OFFER_X is no longer valid
              I0810 14:34:23.640974 19279 master.cpp:6253] Sending status update TASK_LOST for task TASK_Y with invalid offers: Offer OFFER_X is no longer valid'
              ```
              Aurora:
              ```
              I0810 14:28:45.676 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Received offer: OFFER_X
              I0810 14:34:23.635 [TaskGroupBatchWorker, VersionedSchedulerDriverService] Accepting offer OFFER_X with ops [LAUNCH]
              I0810 14:34:24.186 [Thread-4471585, MesosCallbackHandler$MesosCallbackHandlerImpl] Received status update for task TASK_Y in state TASK_LOST from SOURCE_MASTER with REASON_INVALID_OFFERS: Task launched with invalid offers: Offer_X is no longer valid
              I0810 14:34:32.972 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Offer rescinded: OFFER_X
              W0810 14:34:32.972 [SchedulerImpl-0, OfferManager$OfferManagerImpl] Failed to cancel offer: OFFER_X.
              ```
              I would like to temporarily ban offers if we receive a rescind but the offer has not yet been added (ie. still in the executor queue). Then, when we actually process the offer we will not assign it to tasks since we know it has been rescinded already. When we ban the offer, we will also add a command to unban the offer to the executor queue so that future offers will not be affected. This solution should also avoid the race condition fixed in: https://issues.apache.org/jira/browse/AURORA-1933
          
              Testing Done:
              `./gradlew test`
          
              Ran `./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh` successfully.
          
              I will verify this patch on a live cluster as well before submitting.
          
              Bugs closed: AURORA-1945
          
              Reviewed at https://reviews.apache.org/r/61804/
          
          Show
          jordanly Jordan Ly added a comment - Merged. commit 62e46cdea5b3a2143e2fed601aca814346af750b Author: Jordan Ly <jordan.ly8@gmail.com> Date: Thu Aug 24 09:36:57 2017 -0700 Fix race condition where rescinds are received but not processed before offer is accepted The current race condition for offers is possible: ``` 1. Scheduler receives an offer and adds it to the executor queue for processing. 2. The executor processes the offer and adds it to the HostOffers list. 3. Scheduler receives a rescind for that offer and adds it to the executor queue for processing. However, there is a lot of load on the executor so there might be a delay between receiving the rescind and processing it. 4. Scheduler accepts the offer before the rescind is processed by the executor. This will result in launching a task with an invalid offer leading to TASK_LOST. ``` The following logs show this in action: Mesos: ``` I0810 14:33:45.744372 19274 master.cpp:6065] Removing offer OFFER_X with revocable resources... W0810 14:34:23.640905 19279 master.cpp:3696] Ignoring accept of offer OFFER_X since it is no longer valid W0810 14:34:23.640923 19279 master.cpp:3709] ACCEPT call used invalid offers '[ OFFER_X ]': Offer OFFER_X is no longer valid I0810 14:34:23.640974 19279 master.cpp:6253] Sending status update TASK_LOST for task TASK_Y with invalid offers: Offer OFFER_X is no longer valid' ``` Aurora: ``` I0810 14:28:45.676 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Received offer: OFFER_X I0810 14:34:23.635 [TaskGroupBatchWorker, VersionedSchedulerDriverService] Accepting offer OFFER_X with ops [LAUNCH] I0810 14:34:24.186 [Thread-4471585, MesosCallbackHandler$MesosCallbackHandlerImpl] Received status update for task TASK_Y in state TASK_LOST from SOURCE_MASTER with REASON_INVALID_OFFERS: Task launched with invalid offers: Offer_X is no longer valid I0810 14:34:32.972 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Offer rescinded: OFFER_X W0810 14:34:32.972 [SchedulerImpl-0, OfferManager$OfferManagerImpl] Failed to cancel offer: OFFER_X. ``` I would like to temporarily ban offers if we receive a rescind but the offer has not yet been added (ie. still in the executor queue). Then, when we actually process the offer we will not assign it to tasks since we know it has been rescinded already. When we ban the offer, we will also add a command to unban the offer to the executor queue so that future offers will not be affected. This solution should also avoid the race condition fixed in: https://issues.apache.org/jira/browse/AURORA-1933 Testing Done: `./gradlew test` Ran `./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh` successfully. I will verify this patch on a live cluster as well before submitting. Bugs closed: AURORA-1945 Reviewed at https://reviews.apache.org/r/61804/

            People

            • Assignee:
              jordanly Jordan Ly
              Reporter:
              jordanly Jordan Ly
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development