Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-7041

Avoid writing empty files in BypassMergeSortShuffleWriter

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • None
    • None
    • Shuffle, Spark Core
    • None

    Description

      In BypassMergeSortShuffleWriter, we may end up opening disk writers files for empty partitions; this occurs because we manually call open() after creating the writer, causing serialization and compression input streams to be created; these streams may write headers to the output stream, resulting in non-zero-length files being created for partitions that contain no records. This is unnecessary, though, since the disk object writer will automatically open itself when the first write is performed. Removing this eager open() call and rewriting the consumers to cope with the non-existence of empty files results in a large performance benefit for certain sparse workloads when using sort-based shuffle.

      Attachments

        Issue Links

          Activity

            People

              joshrosen Josh Rosen
              joshrosen Josh Rosen
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: