Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23029

Doc spark.shuffle.file.buffer units are kb when no units specified

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.2.1
    • 2.3.0
    • Spark Core
    • None

    Description

      When setting the spark.shuffle.file.buffer setting, even to its default value, shuffles fail.
      This appears to affect small to medium size partitions. Strangely the error message is OutOfMemoryError, but it works with large partitions (at least >32MB).

      pyspark --conf "spark.shuffle.file.buffer=$((32*1024))"
      /gpfs/bbp.cscs.ch/scratch/gss/spykfunc/_sparkenv/lib/python2.7/site-packages/pyspark/bin/spark-submit pyspark-shell-main --name PySparkShell --conf spark.shuffle.file.buffer=32768
      version 2.2.1
      
      >>> spark.range(1e7, numPartitions=10).sort("id").write.parquet("a", mode="overwrite")
      
      [Stage 1:>                                                        (0 + 10) / 10]18/01/10 19:34:21 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 11)
      java.lang.OutOfMemoryError: Java heap space
      	at java.io.BufferedOutputStream.<init>(BufferedOutputStream.java:75)
      	at org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseBufferedOutputStream$1.<init>(DiskBlockObjectWriter.scala:107)
      	at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:108)
      	at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)
      	at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237)
      	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
      	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
      	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
      	at org.apache.spark.scheduler.Task.run(Task.scala:108)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      

      Attachments

        Activity

          People

            ferdonline Fernando Pereira
            ferdonline Fernando Pereira
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: