Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-33230

FileOutputWriter jobs have duplicate JobIDs if launched in same second

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.7, 3.0.1
    • 2.4.8, 3.0.2, 3.1.0
    • SQL
    • None

    Description

      The Hadoop S3A staging committer has problems with >1 spark sql query being launched simultaneously, as it uses the jobID for its path in the clusterFS to pass the commit information from tasks to job committer.

      If two queries are launched in the same second, they conflict and the output of job 1 includes that of all job2 files written so far; job 2 will fail with FNFE.

      Proposed:
      job conf to set "spark.sql.sources.writeJobUUID" to the value of WriteJobDescription.uuid

      That was the property name which used to serve this purpose; any committers already written which use this property will pick it up without needing any changes.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            stevel@apache.org Steve Loughran
            stevel@apache.org Steve Loughran
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment