Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-7185

Parallelize part files move in FileOutputCommitter

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsAdd voteVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • 3.2.0, 2.9.2
    • None
    • None
    • None

    Description

      If map task outputs multiple files it could be slow to move them from temp directory to output directory in object stores (GCS, S3, etc).

      To improve performance we need to parallelize move of more than 1 file in FileOutputCommitter.

      Repro:
      Start spark-shell:

      spark-shell --num-executors 2 --executor-memory 10G --executor-cores 4 --conf spark.dynamicAllocation.maxExecutors=2
      

      From spark-shell:

      val df = (1 to 10000).toList.toDF("value").withColumn("p", $"value" % 10).repartition(50)
      df.write.partitionBy("p").mode("overwrite").format("parquet").options(Map("path" -> s"gs://some/path")).saveAsTable("parquet_partitioned_bench")
      

      With the fix execution time reduces from 130 seconds to 50 seconds.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            medb Igor Dvorzhak Assign to me
            medb Igor Dvorzhak

            Dates

              Created:
              Updated:

              Slack

                Issue deployment