[SPARK-35565] Add a config for ignoring metadata directory of file stream sink - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.2.0
Fix Version/s: 3.2.0
Component/s: Structured Streaming
Labels:
None

Description

FileStreamSink produces a metadata directory which logs output files per micro-batch. When we read from the output path, Spark will look at the metadata and ignore other files not in the log.

Normally it works well. But for some use-cases, we may need to ignore the metadata when reading the output path. For example, when we change the streaming query and must to run it with new checkpoint directory, we cannot use previous metadata. If we create a new metadata too, when we read the output path later in Spark, Spark only reads the files listed in the new metadata. The files written before we use new checkpoint and metadata are ignored by Spark.

Although seems we can output to different output directory every time, but it is bad idea as we will produce many directories unnecessarily.

Seems we need a config for ignoring the metadata of FileStreamSink when reading the output path.

Attachments

Issue Links

links to

[Github] Pull Request #32702 (viirya)

Activity

People

Assignee:: L. C. Hsieh

Reporter:: L. C. Hsieh

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 30/May/21 08:57

Updated:: 18/Jun/21 23:21

Resolved:: 18/Jun/21 23:21