Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
2.2.1
-
None
Description
When setting the spark.shuffle.file.buffer setting, even to its default value, shuffles fail.
This appears to affect small to medium size partitions. Strangely the error message is OutOfMemoryError, but it works with large partitions (at least >32MB).
pyspark --conf "spark.shuffle.file.buffer=$((32*1024))" /gpfs/bbp.cscs.ch/scratch/gss/spykfunc/_sparkenv/lib/python2.7/site-packages/pyspark/bin/spark-submit pyspark-shell-main --name PySparkShell --conf spark.shuffle.file.buffer=32768 version 2.2.1 >>> spark.range(1e7, numPartitions=10).sort("id").write.parquet("a", mode="overwrite") [Stage 1:> (0 + 10) / 10]18/01/10 19:34:21 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 11) java.lang.OutOfMemoryError: Java heap space at java.io.BufferedOutputStream.<init>(BufferedOutputStream.java:75) at org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseBufferedOutputStream$1.<init>(DiskBlockObjectWriter.scala:107) at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:108) at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116) at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)