[SPARK-43152] User-defined output metadata path (_spark_metadata) - ASF JIRA

Attach files

Attach Screenshot

Add vote

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.4.0
Fix Version/s: None
Component/s: Structured Streaming
Labels:
- pull-request-available

Language:
- scala

Description

Currently path of metadata of output checkpoint is hardcoded. The metadata is saved in output path in _spark_metadata folder. It's a constraint on structure of paths, that might be easily relaxed by parametrisable path of output metadata. It would help with issues like changing output directory of spark streaming job, two jobs writing to the same output path or partition discovery. It would also help with separation of metadata from data in path structure.

The main target of change is getMetadataLogPath method in FileStreamSink. It has got access to sqlConf, so this method can override the default _spark_metadata path if defined it config. Introduction of parametrised metadata path needs reconsidering of meaning of hasMetadata method in FileStreamSink.