Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-41551

Improve/complete PathOutputCommitProtocol support for dynamic partitioning

Rank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsAdd voteVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Minor
    • Resolution: Unresolved
    • 3.3.1
    • None
    • SQL
    • None

    Description

      Followup to SPARK-40034 as

      • that is incomplete as it doesn't record the partitions
      • as long at the job doesn't call `newTaskTempFileAbsPath()`, and slow renames are ok, both s3a committers are actually OK to use.

      It's only that newTaskTempFileAbsPath operation which is unsupported in s3a committers; the post-job dir rename is O(data) but file by file rename is correct for a non-atomic job commit.

      1. Cut PathOutputCommitProtocol.newTaskTempFile; to update super partitionPaths (needs a setter). The superclass can't just say if (committer instance of PathOutputCommitter as spark-core needs to compile with older hadoop versions)
      2. downgrade failure in setup to log (info?)
      3. retain failure in the newTaskTempFileAbsPath call.

      Testing: yes

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            stevel@apache.org Steve Loughran

            Dates

              Created:
              Updated:

              Slack

                Issue deployment