Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6823

FileOutputFormat to support configurable FileOutputCommitter factory

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 3.0.0-alpha2
    • Fix Version/s: None
    • Component/s: mrv2
    • Labels:
      None
    • Environment:

      Targeting S3 as the output of work

      Description

      In HADOOP-13786 I'm adding a custom subclass for FileOutputFormat, one which can talk direct to the S3A Filesystem for more efficient operations, better failure modes, and, most critically, as part of HADOOP-13345, atomic commit of output. The normal committer relies on directory rename() being atomic for this; for S3 we don't have that luxury.

      To support a custom committer, we need to be able to tell FileOutputFormat (and implicitly, all subclasses which don't have their own custom committer), to use our new S3AOutputCommitter.

      I propose:

      1. FileOutputFormat takes a factory to create committers.
      2. The factory to take a URI and TaskAttemptContext and return a committer
      3. the default implementation always returns a FileOutputCommitter
      4. A configuration option allows a new factory to be named
      5. An S3AOutputCommitterFactory to return a FileOutputCommitter or new S3AOutputCommitter depending upon the URI of the destination.

      Note that MRv1 already supports configurable committers; this is only the V2 API

        Issue Links

          Activity

          Hide
          stevel@apache.org Steve Loughran added a comment -

          This is the initial HADOOP-13786 001 PoC patch, to give the MR bit of the code some testing too. It adds a new factory for FileOutputFormatter to use when creating committers; the default one returns FileOutputCommitter instances as normal; a special S3a one in hadoop-aws to handle S3a specific operations.

          Now, the other way to do this (given the need to keep the s3a code in the s3a module) would be to allow a notion of a new algorithm, one which relayed to an implementation of an interface. That would hand a problem not addressed here: how to address subclasses of FileOutputFormat with custom subclasses of FileOutputCommitter, and make it easier to add committers for other non-FS-destinations, namely the other object stores.

          However, it would be a more significant change to FileOutputCommitter; I could go that way, but it'd need support before I started.

          Show
          stevel@apache.org Steve Loughran added a comment - This is the initial HADOOP-13786 001 PoC patch, to give the MR bit of the code some testing too. It adds a new factory for FileOutputFormatter to use when creating committers; the default one returns FileOutputCommitter instances as normal; a special S3a one in hadoop-aws to handle S3a specific operations. Now, the other way to do this (given the need to keep the s3a code in the s3a module) would be to allow a notion of a new algorithm, one which relayed to an implementation of an interface. That would hand a problem not addressed here: how to address subclasses of FileOutputFormat with custom subclasses of FileOutputCommitter , and make it easier to add committers for other non-FS-destinations, namely the other object stores. However, it would be a more significant change to FileOutputCommitter ; I could go that way, but it'd need support before I started.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 17m 16s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 6 new or modified test files.
          0 mvndep 1m 45s Maven dependency ordering for branch
          +1 mvninstall 7m 39s HADOOP-13345 passed
          +1 compile 7m 39s HADOOP-13345 passed
          +1 checkstyle 1m 30s HADOOP-13345 passed
          +1 mvnsite 0m 57s HADOOP-13345 passed
          +1 mvneclipse 0m 30s HADOOP-13345 passed
          +1 findbugs 1m 16s HADOOP-13345 passed
          +1 javadoc 0m 36s HADOOP-13345 passed
          0 mvndep 0m 15s Maven dependency ordering for patch
          +1 mvninstall 0m 43s the patch passed
          +1 compile 7m 17s the patch passed
          -1 javac 7m 17s root generated 2 new + 716 unchanged - 0 fixed = 718 total (was 716)
          -1 checkstyle 1m 35s root: The patch generated 48 new + 66 unchanged - 1 fixed = 114 total (was 67)
          +1 mvnsite 1m 4s the patch passed
          +1 mvneclipse 0m 34s the patch passed
          -1 whitespace 0m 0s The patch has 43 line(s) that end in whitespace. Use git apply --whitespace=fix.
          +1 xml 0m 1s The patch has no ill-formed XML file.
          +1 findbugs 1m 50s the patch passed
          -1 javadoc 0m 16s hadoop-aws in the patch failed.
          +1 unit 2m 52s hadoop-mapreduce-client-core in the patch passed.
          +1 unit 0m 38s hadoop-aws in the patch passed.
          +1 asflicense 0m 22s The patch does not generate ASF License warnings.
          57m 54s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12843640/HADOOP-13786-HADOOP-13345-001.patch
          JIRA Issue MAPREDUCE-6823
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle xml
          uname Linux 313569e92b73 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision HADOOP-13345 / c7885de
          Default Java 1.8.0_111
          findbugs v3.0.0
          javac https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6848/artifact/patchprocess/diff-compile-javac-root.txt
          checkstyle https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6848/artifact/patchprocess/diff-checkstyle-root.txt
          whitespace https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6848/artifact/patchprocess/whitespace-eol.txt
          javadoc https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6848/artifact/patchprocess/patch-javadoc-hadoop-tools_hadoop-aws.txt
          Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6848/testReport/
          modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-tools/hadoop-aws U: .
          Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6848/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 17m 16s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 6 new or modified test files. 0 mvndep 1m 45s Maven dependency ordering for branch +1 mvninstall 7m 39s HADOOP-13345 passed +1 compile 7m 39s HADOOP-13345 passed +1 checkstyle 1m 30s HADOOP-13345 passed +1 mvnsite 0m 57s HADOOP-13345 passed +1 mvneclipse 0m 30s HADOOP-13345 passed +1 findbugs 1m 16s HADOOP-13345 passed +1 javadoc 0m 36s HADOOP-13345 passed 0 mvndep 0m 15s Maven dependency ordering for patch +1 mvninstall 0m 43s the patch passed +1 compile 7m 17s the patch passed -1 javac 7m 17s root generated 2 new + 716 unchanged - 0 fixed = 718 total (was 716) -1 checkstyle 1m 35s root: The patch generated 48 new + 66 unchanged - 1 fixed = 114 total (was 67) +1 mvnsite 1m 4s the patch passed +1 mvneclipse 0m 34s the patch passed -1 whitespace 0m 0s The patch has 43 line(s) that end in whitespace. Use git apply --whitespace=fix. +1 xml 0m 1s The patch has no ill-formed XML file. +1 findbugs 1m 50s the patch passed -1 javadoc 0m 16s hadoop-aws in the patch failed. +1 unit 2m 52s hadoop-mapreduce-client-core in the patch passed. +1 unit 0m 38s hadoop-aws in the patch passed. +1 asflicense 0m 22s The patch does not generate ASF License warnings. 57m 54s Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12843640/HADOOP-13786-HADOOP-13345-001.patch JIRA Issue MAPREDUCE-6823 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle xml uname Linux 313569e92b73 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision HADOOP-13345 / c7885de Default Java 1.8.0_111 findbugs v3.0.0 javac https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6848/artifact/patchprocess/diff-compile-javac-root.txt checkstyle https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6848/artifact/patchprocess/diff-checkstyle-root.txt whitespace https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6848/artifact/patchprocess/whitespace-eol.txt javadoc https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6848/artifact/patchprocess/patch-javadoc-hadoop-tools_hadoop-aws.txt Test Results https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6848/testReport/ modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-tools/hadoop-aws U: . Console output https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/6848/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          Cancelling this PoC; redesigning. In order to support existing subclasses of FOF (e.g. the Parquet one); we'll have to come in lower.

          I propose adding a new algorithm, "3", which really means "plug in a new committer of classname X", with another property to define that classname. We can then add an s3 committer which supports this new protocol.

          This does mean that we will need to define a committer plugin...that we can declare as unstable/limited private, and implement the s3a one

          Show
          stevel@apache.org Steve Loughran added a comment - Cancelling this PoC; redesigning. In order to support existing subclasses of FOF (e.g. the Parquet one); we'll have to come in lower. I propose adding a new algorithm, "3", which really means "plug in a new committer of classname X", with another property to define that classname. We can then add an s3 committer which supports this new protocol. This does mean that we will need to define a committer plugin...that we can declare as unstable/limited private, and implement the s3a one
          Hide
          stevel@apache.org Steve Loughran added a comment -

          *2017/06/23 update* no, that's just messy. Best to find when those committers are used and allow them to be more generic. Example: all the parquet one does is add an optional schema summary file. If you don't want that, any FOF committer can be used

          Resubmitting the original patch, as it stands, from HADOOP-13786

          Show
          stevel@apache.org Steve Loughran added a comment - * 2017/06/23 update * no, that's just messy. Best to find when those committers are used and allow them to be more generic. Example: all the parquet one does is add an optional schema summary file. If you don't want that, any FOF committer can be used Resubmitting the original patch, as it stands, from HADOOP-13786

            People

            • Assignee:
              stevel@apache.org Steve Loughran
              Reporter:
              stevel@apache.org Steve Loughran
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Development