Details
-
Improvement
-
Status: In Progress
-
Minor
-
Resolution: Unresolved
-
3.3.1
-
None
-
None
Description
Followup to SPARK-40034 as
- that is incomplete as it doesn't record the partitions
- as long at the job doesn't call `newTaskTempFileAbsPath()`, and slow renames are ok, both s3a committers are actually OK to use.
It's only that newTaskTempFileAbsPath operation which is unsupported in s3a committers; the post-job dir rename is O(data) but file by file rename is correct for a non-atomic job commit.
- Cut PathOutputCommitProtocol.newTaskTempFile; to update super partitionPaths (needs a setter). The superclass can't just say if (committer instance of PathOutputCommitter as spark-core needs to compile with older hadoop versions)
- downgrade failure in setup to log (info?)
- retain failure in the newTaskTempFileAbsPath call.
Testing: yes
Attachments
Issue Links
- requires
-
SPARK-40034 PathOutputCommitters to work with dynamic partition overwrite
- Resolved
- links to
So there's an interesting little "feature" of HadoopMapReduceCommitProtocol.newTaskTempFile() which is:
If you call newTaskTempFile(tac, None, ext) when dynamicPartitionOverwrite is true, and spark-core was compiled with assertions -Xelide-below at a level which excludes assert(), then in job commit the entire directory tree is destroyed -both output and (implicitly) the .spark-staging dir. makes for a fairly messy job failure.
The good news: spark builds don't do that, and since spark-core/spark-sql itself doesn't seem to invoke newTaskTempFile(_, None, _) in dynamic partition mode, it's not a serious risk. Is it worth hardening?