Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23029

Doc spark.shuffle.file.buffer units are kb when no units specified

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.2.1
    • Fix Version/s: 2.3.0
    • Component/s: Spark Core
    • Labels:
      None

      Description

      When setting the spark.shuffle.file.buffer setting, even to its default value, shuffles fail.
      This appears to affect small to medium size partitions. Strangely the error message is OutOfMemoryError, but it works with large partitions (at least >32MB).

      pyspark --conf "spark.shuffle.file.buffer=$((32*1024))"
      /gpfs/bbp.cscs.ch/scratch/gss/spykfunc/_sparkenv/lib/python2.7/site-packages/pyspark/bin/spark-submit pyspark-shell-main --name PySparkShell --conf spark.shuffle.file.buffer=32768
      version 2.2.1
      
      >>> spark.range(1e7, numPartitions=10).sort("id").write.parquet("a", mode="overwrite")
      
      [Stage 1:>                                                        (0 + 10) / 10]18/01/10 19:34:21 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 11)
      java.lang.OutOfMemoryError: Java heap space
      	at java.io.BufferedOutputStream.<init>(BufferedOutputStream.java:75)
      	at org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseBufferedOutputStream$1.<init>(DiskBlockObjectWriter.scala:107)
      	at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:108)
      	at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)
      	at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237)
      	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
      	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
      	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
      	at org.apache.spark.scheduler.Task.run(Task.scala:108)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      

        Attachments

          Activity

            People

            • Assignee:
              ferdonline Fernando Pereira
              Reporter:
              ferdonline Fernando Pereira
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: