Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-41551

Improve/complete PathOutputCommitProtocol support for dynamic partitioning

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Minor
    • Resolution: Unresolved
    • 3.3.1
    • None
    • SQL
    • None

    Description

      Followup to SPARK-40034 as

      • that is incomplete as it doesn't record the partitions
      • as long at the job doesn't call `newTaskTempFileAbsPath()`, and slow renames are ok, both s3a committers are actually OK to use.

      It's only that newTaskTempFileAbsPath operation which is unsupported in s3a committers; the post-job dir rename is O(data) but file by file rename is correct for a non-atomic job commit.

      1. Cut PathOutputCommitProtocol.newTaskTempFile; to update super partitionPaths (needs a setter). The superclass can't just say if (committer instance of PathOutputCommitter as spark-core needs to compile with older hadoop versions)
      2. downgrade failure in setup to log (info?)
      3. retain failure in the newTaskTempFileAbsPath call.

      Testing: yes

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              stevel@apache.org Steve Loughran
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: