Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-43152

User-defined output metadata path (_spark_metadata)

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      Currently path of metadata of output checkpoint is hardcoded. The metadata is saved in output path in _spark_metadata folder. It's a constraint on structure of paths, that might be easily relaxed by parametrisable path of output metadata. It would help with issues like changing output directory of spark streaming job, two jobs writing to the same output path or partition discovery. It would also help with separation of metadata from data in path structure.

      The main target of change is getMetadataLogPath method in FileStreamSink. It has got access to sqlConf, so this method can override the default _spark_metadata path if defined it config. Introduction of parametrised metadata path needs reconsidering of meaning of  hasMetadata method in FileStreamSink.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            woj_in Wojciech Indyk

            Dates

              Created:
              Updated:

              Slack

                Issue deployment