Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-7465

performance problem in FileOutputCommitter for big list processed by single thread

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 3.2.3, 3.3.2, 3.2.4, 3.3.5, 3.3.3, 3.3.4, 3.3.6
    • None
    • performance

    Description

      when commiting a big hadoop job (for example via Spark) having many partitions,
      the class FileOutputCommiter process thousands of dirs/files to rename with a single Thread. This is performance issue, caused by lot of waits on FileStystem storage operations.

      I propose that above a configurable threshold (default=3, configurable via property 'mapreduce.fileoutputcommitter.parallel.threshold'), the class FileOutputCommiter process the list of files to rename using parallel threads, using the default jvm ExecutorService (ForkJoinPool.commonPool())

      See Pull-Request: https://github.com/apache/hadoop/pull/6378

      Notice that sub-class instances of FileOutputCommiter are supposed to be created at runtime dependending of a configurable property ([https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/PathOutputCommitterFactory.java|PathOutputCommitterFactory.java]).

      But for example in Parquet + Spark, this is buggy and can not be changed at runtime.
      There is an ongoing Jira and PR to fix it in Parquet + Spark: https://issues.apache.org/jira/browse/PARQUET-2416

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              arnaud.nauwynck Arnaud Nauwynck
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: