[SPARK-41551] Improve/complete PathOutputCommitProtocol support for dynamic partitioning - ASF JIRA

Details

Type: Improvement
Status: In Progress
Priority: Minor
Resolution: Unresolved
Affects Version/s: 3.3.1
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

that is incomplete as it doesn't record the partitions
as long at the job doesn't call `newTaskTempFileAbsPath()`, and slow renames are ok, both s3a committers are actually OK to use.

It's only that newTaskTempFileAbsPath operation which is unsupported in s3a committers; the post-job dir rename is O(data) but file by file rename is correct for a non-atomic job commit.

Cut PathOutputCommitProtocol.newTaskTempFile; to update super partitionPaths (needs a setter). The superclass can't just say if (committer instance of PathOutputCommitter as spark-core needs to compile with older hadoop versions)
downgrade failure in setup to log (info?)
retain failure in the newTaskTempFileAbsPath call.

Testing: yes

Attachments

Issue Links

requires

SPARK-40034 PathOutputCommitters to work with dynamic partition overwrite

Resolved

links to

[Github] Pull Request #39185 (steveloughran)

[Github] Pull Request #40221 (steveloughran)

Activity

Ascending order - Click to sort in descending order

Steve Loughran added a comment - 20/Dec/22 14:05

So there's an interesting little "feature" of HadoopMapReduceCommitProtocol.newTaskTempFile() which is:

If you call newTaskTempFile(tac, None, ext) when dynamicPartitionOverwrite is true, and spark-core was compiled with assertions -Xelide-below at a level which excludes assert(), then in job commit the entire directory tree is destroyed -both output and (implicitly) the .spark-staging dir. makes for a fairly messy job failure.

The good news: spark builds don't do that, and since spark-core/spark-sql itself doesn't seem to invoke newTaskTempFile(_, None, _) in dynamic partition mode, it's not a serious risk. Is it worth hardening?

Steve Loughran added a comment - 20/Dec/22 14:05 So there's an interesting little "feature" of HadoopMapReduceCommitProtocol.newTaskTempFile() which is: If you call newTaskTempFile(tac, None, ext) when dynamicPartitionOverwrite is true, and spark-core was compiled with assertions -Xelide-below at a level which excludes assert(), then in job commit the entire directory tree is destroyed -both output and (implicitly) the .spark-staging dir. makes for a fairly messy job failure. The good news: spark builds don't do that, and since spark-core/spark-sql itself doesn't seem to invoke newTaskTempFile(_, None, _) in dynamic partition mode, it's not a serious risk. Is it worth hardening?

Apache Spark added a comment - 22/Dec/22 18:47

User 'steveloughran' has created a pull request for this issue:
https://github.com/apache/spark/pull/39185

Apache Spark added a comment - 22/Dec/22 18:47 User 'steveloughran' has created a pull request for this issue: https://github.com/apache/spark/pull/39185

Apache Spark added a comment - 22/Dec/22 18:48

User 'steveloughran' has created a pull request for this issue:
https://github.com/apache/spark/pull/39185

Apache Spark added a comment - 22/Dec/22 18:48 User 'steveloughran' has created a pull request for this issue: https://github.com/apache/spark/pull/39185

Steve Loughran added a comment - 22/Dec/22 18:51

PR up. PathOutputCommitProtocol stops anyone trying to use a parent dir as the absolute path in dynamic update mode, as HadoopMapReduceCommitProtocol.commitJob() will blindly delete the entire dir tree at that point. I'm not convinced that feature is particularly safe.

Steve Loughran added a comment - 22/Dec/22 18:51 PR up. PathOutputCommitProtocol stops anyone trying to use a parent dir as the absolute path in dynamic update mode, as HadoopMapReduceCommitProtocol.commitJob() will blindly delete the entire dir tree at that point. I'm not convinced that feature is particularly safe.

Apache Spark added a comment - 28/Feb/23 16:15

User 'steveloughran' has created a pull request for this issue:
https://github.com/apache/spark/pull/40221

Apache Spark added a comment - 28/Feb/23 16:15 User 'steveloughran' has created a pull request for this issue: https://github.com/apache/spark/pull/40221

Apache Spark added a comment - 28/Feb/23 16:16

User 'steveloughran' has created a pull request for this issue:
https://github.com/apache/spark/pull/40221

Apache Spark added a comment - 28/Feb/23 16:16 User 'steveloughran' has created a pull request for this issue: https://github.com/apache/spark/pull/40221

People

Assignee:: Unassigned

Reporter:: Steve Loughran

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 16/Dec/22 17:45

Updated:: 28/Feb/23 16:16