Details
-
Improvement
-
Status: Patch Available
-
Major
-
Resolution: Unresolved
-
3.2.0, 2.9.2
-
None
-
None
-
None
Description
If map task outputs multiple files it could be slow to move them from temp directory to output directory in object stores (GCS, S3, etc).
To improve performance we need to parallelize move of more than 1 file in FileOutputCommitter.
Repro:
Start spark-shell:
spark-shell --num-executors 2 --executor-memory 10G --executor-cores 4 --conf spark.dynamicAllocation.maxExecutors=2
From spark-shell:
val df = (1 to 10000).toList.toDF("value").withColumn("p", $"value" % 10).repartition(50) df.write.partitionBy("p").mode("overwrite").format("parquet").options(Map("path" -> s"gs://some/path")).saveAsTable("parquet_partitioned_bench")
With the fix execution time reduces from 130 seconds to 50 seconds.
Attachments
Attachments
Issue Links
- relates to
-
MAPREDUCE-7267 During commitJob, enable merge paths with multi threads
- Open