Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-41551

Improve/complete PathOutputCommitProtocol support for dynamic partitioning

Details

    • Improvement
    • Status: In Progress
    • Minor
    • Resolution: Unresolved
    • 3.3.1
    • None
    • SQL
    • None

    Description

      Followup to SPARK-40034 as

      • that is incomplete as it doesn't record the partitions
      • as long at the job doesn't call `newTaskTempFileAbsPath()`, and slow renames are ok, both s3a committers are actually OK to use.

      It's only that newTaskTempFileAbsPath operation which is unsupported in s3a committers; the post-job dir rename is O(data) but file by file rename is correct for a non-atomic job commit.

      1. Cut PathOutputCommitProtocol.newTaskTempFile; to update super partitionPaths (needs a setter). The superclass can't just say if (committer instance of PathOutputCommitter as spark-core needs to compile with older hadoop versions)
      2. downgrade failure in setup to log (info?)
      3. retain failure in the newTaskTempFileAbsPath call.

      Testing: yes

      Attachments

        Issue Links

          Activity

            stevel@apache.org Steve Loughran added a comment -

            So there's an interesting little "feature" of HadoopMapReduceCommitProtocol.newTaskTempFile() which is:

            If you call newTaskTempFile(tac, None, ext) when dynamicPartitionOverwrite is true, and spark-core was compiled with assertions -Xelide-below at a level which excludes assert(), then in job commit the entire directory tree is destroyed -both output and (implicitly) the .spark-staging dir. makes for a fairly messy job failure.

            The good news: spark builds don't do that, and since spark-core/spark-sql itself doesn't seem to invoke newTaskTempFile(_, None, _) in dynamic partition mode, it's not a serious risk. Is it worth hardening?

            stevel@apache.org Steve Loughran added a comment - So there's an interesting little "feature" of HadoopMapReduceCommitProtocol.newTaskTempFile() which is: If you call newTaskTempFile(tac, None, ext) when dynamicPartitionOverwrite is true, and spark-core was compiled with assertions -Xelide-below at a level which excludes assert(), then in job commit the entire directory tree is destroyed -both output and (implicitly) the .spark-staging dir. makes for a fairly messy job failure. The good news: spark builds don't do that, and since spark-core/spark-sql itself doesn't seem to invoke newTaskTempFile(_, None, _) in dynamic partition mode, it's not a serious risk. Is it worth hardening?
            apachespark Apache Spark added a comment -

            User 'steveloughran' has created a pull request for this issue:
            https://github.com/apache/spark/pull/39185

            apachespark Apache Spark added a comment - User 'steveloughran' has created a pull request for this issue: https://github.com/apache/spark/pull/39185
            apachespark Apache Spark added a comment -

            User 'steveloughran' has created a pull request for this issue:
            https://github.com/apache/spark/pull/39185

            apachespark Apache Spark added a comment - User 'steveloughran' has created a pull request for this issue: https://github.com/apache/spark/pull/39185
            stevel@apache.org Steve Loughran added a comment -

            PR up. PathOutputCommitProtocol stops anyone trying to use a parent dir as the absolute path in dynamic update mode, as HadoopMapReduceCommitProtocol.commitJob() will blindly delete the entire dir tree at that point. I'm not convinced that feature is particularly safe.

            stevel@apache.org Steve Loughran added a comment - PR up. PathOutputCommitProtocol stops anyone trying to use a parent dir as the absolute path in dynamic update mode, as HadoopMapReduceCommitProtocol.commitJob() will blindly delete the entire dir tree at that point. I'm not convinced that feature is particularly safe.
            apachespark Apache Spark added a comment -

            User 'steveloughran' has created a pull request for this issue:
            https://github.com/apache/spark/pull/40221

            apachespark Apache Spark added a comment - User 'steveloughran' has created a pull request for this issue: https://github.com/apache/spark/pull/40221
            apachespark Apache Spark added a comment -

            User 'steveloughran' has created a pull request for this issue:
            https://github.com/apache/spark/pull/40221

            apachespark Apache Spark added a comment - User 'steveloughran' has created a pull request for this issue: https://github.com/apache/spark/pull/40221

            People

              Unassigned Unassigned
              stevel@apache.org Steve Loughran
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: