Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23207

Shuffle+Repartition on an DataFrame could lead to incorrect answers

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 1.6.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0
    • 2.1.4, 2.2.3, 2.3.0
    • SQL

    Description

      Currently shuffle repartition uses RoundRobinPartitioning, the generated result is nondeterministic since the sequence of input rows are not determined.

      The bug can be triggered when there is a repartition call following a shuffle (which would lead to non-deterministic row ordering), as the pattern shows below:
      upstream stage -> repartition stage -> result stage
      (-> indicate a shuffle)
      When one of the executors process goes down, some tasks on the repartition stage will be retried and generate inconsistent ordering, and some tasks of the result stage will be retried generating different data.

      The following code returns 931532, instead of 1000000:

      import scala.sys.process._
      
      import org.apache.spark.TaskContext
      val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x =>
        x
      }.repartition(200).map { x =>
        if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {
          throw new Exception("pkill -f java".!!)
        }
        x
      }
      res.distinct().count()
      

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            jiangxb1987 Xingbo Jiang
            jiangxb1987 Xingbo Jiang
            Votes:
            1 Vote for this issue
            Watchers:
            21 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment