Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-4019

Shuffling with more than 2000 reducers may drop all data when partitons are mostly empty or cause deserialization errors if at least one partition is empty

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 1.2.0
    • 1.2.0
    • Spark Core
    • None

    Description

      sc.makeRDD(0 until 10, 1000).repartition(2001).collect()
      

      returns `Array()`.

      1.1.0 doesn't have this issue. Tried both HASH and SORT manager.

      This problem can also manifest itself as Snappy deserialization errors if the average map output status size is non-zero but there is at least one empty partition, e.g.

      sc.makeRDD(0 until 100000, 1000).repartition(2001).collect()

      Attachments

        Issue Links

          Activity

            People

              joshrosen Josh Rosen
              mengxr Xiangrui Meng
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: