Thanks for starting this
1. We're making changes to FileOutputFormat so that it doesn't require an instance of FileOutputCommitter, just any committer which also supplies a working directory. This lets us add new committers alongside the existing one, without playing games trying to subclass what is already a complex game.
1. All work is focused on getting the netflix "staging" Committer out the door first; the other one, which I'd started before netflix offered theres, does things inside S3a which could best be viewed as "dark magic". It will offer even more performance, but I'm neglecting it for now. The netflix one is in use in production, and has all its failure/abort algorithms thought out and implemented.
1. I'm keeping the magic committer tests working, but not going to consider that one ready to use until it passes lots of tests. Consider it a speedup for the future.
The netflix committer itself has two subclasess, "directory" and "partitioned", the directory one propagating a directory tree, the partitioned one expects paths like "dateint=20161116/hour=14"; it has a different conflict policy than the directory one.
Algorithm for the staging committer is
- tasks write to a local temp dir
- task abort: delete the files
- task commit: PUT the files as multipart uploads to their final destinations, do not commit the put. Instead the data needed for the commit is saved to the cluster FS, and committed using the normal algorithm
- job commit: load in the output of all committed tasks, commit them. Failure to commit triggers revert: delete all files already committed, abort the rest of the list.
- job abort: abort the output of all uncommitted tasks by reading in the files and aborting those uploads.
- retry logic? Whatever we is implemented by the AWS SDK (mutliple attempts to POST/PUT parts) and in S3A (retries of that final commit POST)
Nothing is visible until job commit; there's still a window of non-atomicity there, but its the time for N posts where N=#of files; this can be parallelised easily as it uses little bandwidth per post (unlike the uploads).
In tests, the dir committer works for the intermediate output of MR jobs saving data to part-000x directories; the partitioned one good for spark output which doesn't save the intermediate data, and wants to output partitioned style.