Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24755

Executor loss can cause task to not be resubmitted

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotVotersStop watchingWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments


    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.0
    • Fix Version/s: 2.3.3, 2.4.0
    • Component/s: Spark Core
    • Labels:


      As part of SPARK-22074, when an executor is lost, TSM.executorLost currently checks for "if (successful(index) && !killedByOtherAttempt(index))" to decide if task needs to be resubmitted for partition.

      Consider following:

      For partition P1, tasks T1 and T2 are running on exec-1 and exec-2 respectively (one of them being speculative task)

      T1 finishes successfully first.

      This results in setting "killedByOtherAttempt(P1) = true" due to running T2.
      We also end up killing task T2.

      Now, exec-1 if/when goes MIA.
      executorLost will no longer schedule task for P1 - since killedByOtherAttempt(P1) == true; even though P1 was hosted on T1 and there is no other copy of P1 around (T2 was killed when T1 succeeded).

      I noticed this bug as part of reviewing PR# 21653 for SPARK-13343

      Essentially, SPARK-22074 causes a regression (which I dont usually observe due to shuffle service, sigh) - and as such the fix is broken IMO.

      I dont have a PR handy for this, so if anyone wants to pick it up, please do feel free !
      +CC Yuanjian Li who fixed SPARK-22074 initially.



          $i18n.getText('security.level.explanation', $currentSelection) Viewable by All Users


            • Assignee:
              hthuynh2 Hieu Tri Huynh Assign to me
              mridulm80 Mridul Muralidharan


              • Created:

                Issue deployment