[SPARK-23977] Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.4.0
Fix Version/s: 3.0.0
Component/s: SQL
Labels:
None

Description

Hadoop 3.1 adds a mechanism for job-specific and store-specific committers (~~MAPREDUCE-6823~~, ~~MAPREDUCE-6956~~), and one key implementation, S3A committers, ~~HADOOP-13786~~

These committers deliver high-performance output of MR and spark jobs to S3, and offer the key semantics which Spark depends on: no visible output until job commit, a failure of a task at an stage, including partway through task commit, can be handled by executing and committing another task attempt.

In contrast, the FileOutputFormat commit algorithms on S3 have issues:

Awful performance because files are copied by rename
FileOutputFormat v1: weak task commit failure recovery semantics as the (v1) expectation: "directory renames are atomic" doesn't hold.
S3 metadata eventual consistency can cause rename to miss files or fail entirely (~~SPARK-15849~~)

Note also that FileOutputFormat "v2" commit algorithm doesn't offer any of the commit semantics w.r.t observability of or recovery from task commit failure, on any filesystem.

The S3A committers address these by way of uploading all data to the destination through multipart uploads, uploads which are only completed in job commit.

The new PathOutputCommitter factory mechanism allows applications to work with the S3A committers and any other, by adding a plugin mechanism into the MRv2 FileOutputFormat class, where it job config and filesystem configuration options can dynamically choose the output committer.

Spark can use these with some binding classes to

Add a subclass of HadoopMapReduceCommitProtocol which uses the MRv2 classes and PathOutputCommitterFactory to create the committers.
Add a BindingParquetOutputCommitter extends ParquetOutputCommitter
to wire up Parquet output even when code requires the committer to be a subclass of ParquetOutputCommitter

This patch builds on ~~SPARK-23807~~ for setting up the dependencies.

Attachments

Issue Links

is related to

HADOOP-15421 Stabilise/formalise the JSON _SUCCESS format used in the S3A committers

Resolved

links to

[Github] Pull Request #21066 (steveloughran)

GitHub Pull Request #21066

GitHub Pull Request #24970

Activity

People

Assignee:: Steve Loughran

Reporter:: Steve Loughran

Votes:: 2 Vote for this issue

Watchers:: 17 Start watching this issue

Dates

Created:: 13/Apr/18 12:39

Updated:: 16/Feb/22 15:28

Resolved:: 15/Aug/19 17:16