    • Targeting S3 as the output of work


      In HADOOP-13786 I'm adding a custom subclass for FileOutputFormat, one which can talk direct to the S3A Filesystem for more efficient operations, better failure modes, and, most critically, as part of HADOOP-13345, atomic commit of output. The normal committer relies on directory rename() being atomic for this; for S3 we don't have that luxury.

      To support a custom committer, we need to be able to tell FileOutputFormat (and implicitly, all subclasses which don't have their own custom committer), to use our new S3AOutputCommitter.

      I propose:

      1. FileOutputFormat takes a factory to create committers.
      2. The factory to take a URI and TaskAttemptContext and return a committer
      3. the default implementation always returns a FileOutputCommitter
      4. A configuration option allows a new factory to be named
      5. An S3AOutputCommitterFactory to return a FileOutputCommitter or new S3AOutputCommitter depending upon the URI of the destination.

      Note that MRv1 already supports configurable committers; this is only the V2 API


