Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-38445

Are hadoop committers used in Structured Streaming?

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Question
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.2.1
    • None
    • Spark Core

    Description

      At the company I work at we're using Spark Structured Streaming to sink messages on kafka to HDFS. We're in the late stages of migrating this component to instead sink messages to AWS S3, and in connection with that we hit upon a couple of issues regarding hadoop committers.

      I've come to understand that the default "file" committer (documented here) is unsafe to use in S3, which is why this page in the spark documentation recommends using the "directory" (i.e. staging) committer, and in later versions of hadoop they also recommend to use the "magic" committer.

      However, it's not clear whether spark structured streaming even use committers. There's no "_SUCCESS" file in destination (as compared to normal spark jobs), and the documentation regarding committers used in streaming is non-existent.

      Can anyone please shed some light on this?

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            beregon87 Martin Andersson

            Dates

              Created:
              Updated:

              Slack

                Issue deployment