Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-43152

User-defined output metadata path (_spark_metadata)

    XMLWordPrintableJSON

Details

    Description

      Currently path of metadata of output checkpoint is hardcoded. The metadata is saved in output path in _spark_metadata folder. It's a constraint on structure of paths, that might be easily relaxed by parametrisable path of output metadata. It would help with issues like changing output directory of spark streaming job, two jobs writing to the same output path or partition discovery. It would also help with separation of metadata from data in path structure.

      The main target of change is getMetadataLogPath method in FileStreamSink. It has got access to sqlConf, so this method can override the default _spark_metadata path if defined it config. Introduction of parametrised metadata path needs reconsidering of meaning of  hasMetadata method in FileStreamSink.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              woj_in Wojciech Indyk
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: