Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28699

Cache an indeterminate RDD could lead to incorrect result while stage rerun

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 2.3.3, 2.4.3, 3.0.0
    • 2.3.4, 2.4.4, 3.0.0
    • Spark Core

    Description

      It's another case for the indeterminate stage/RDD rerun while stage rerun happened.

      We can reproduce this by the following code, thanks to Tyson for reporting this!
       

      import scala.sys.process._
      import org.apache.spark.TaskContext
      
      val res = spark.range(0, 10000 * 10000, 1).map{ x => (x % 1000, x)}
      // kill an executor in the stage that performs repartition(239)
      val df = res.repartition(113).cache.repartition(239).map { x =>
       if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && TaskContext.get.stageAttemptNumber == 0) {
       throw new Exception("pkill -f -n java".!!)
       }
       x
      }
      
      val r2 = df.distinct.count()
      

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            XuanYuan Yuanjian Li
            XuanYuan Yuanjian Li
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment