Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-4550

In sort-based shuffle, store map outputs in serialized form

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.2.0
    • 1.4.0
    • Shuffle, Spark Core
    • None

    Description

      One drawback with sort-based shuffle compared to hash-based shuffle is that it ends up storing many more java objects in memory. If Spark could store map outputs in serialized form, it could

      • spill less often because the serialized form is more compact
      • reduce GC pressure

      This will only work when the serialized representations of objects are independent from each other and occupy contiguous segments of memory. E.g. when Kryo reference tracking is left on, objects may contain pointers to objects farther back in the stream, which means that the sort can't relocate objects without corrupting them.

      Attachments

        1. kryo-flush-benchmark.scala
          2 kB
          Sandy Ryza
        2. SPARK-4550-design-v1.pdf
          145 kB
          Sandy Ryza

        Issue Links

          Activity

            People

              sandyr Sandy Ryza
              sandyr Sandy Ryza
              Votes:
              1 Vote for this issue
              Watchers:
              21 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: