Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-7185

Parallelize part files move in FileOutputCommitter

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Patch Available
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 3.2.0, 2.9.2
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Target Version/s:

      Description

      If map task outputs multiple files it could be slow to move them from temp directory to output directory in object stores (GCS, S3, etc).

      To improve performance we need to parallelize move of more than 1 file in FileOutputCommitter.

      Repro:
      Start spark-shell:

      spark-shell --num-executors 2 --executor-memory 10G --executor-cores 4 --conf spark.dynamicAllocation.maxExecutors=2
      

      From spark-shell:

      val df = (1 to 10000).toList.toDF("value").withColumn("p", $"value" % 10).repartition(50)
      df.write.partitionBy("p").mode("overwrite").format("parquet").options(Map("path" -> s"gs://some/path")).saveAsTable("parquet_partitioned_bench")
      

      With the fix execution time reduces from 130 seconds to 50 seconds.

        Attachments

        1. MAPREDUCE-7185.patch
          4 kB
          Igor Dvorzhak

          Issue Links

            Activity

              People

              • Assignee:
                medb Igor Dvorzhak
                Reporter:
                medb Igor Dvorzhak
              • Votes:
                0 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated: