[SPARK-27188] FileStreamSink: provide a new option to have retention on output files - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.1.0
Fix Version/s: 3.1.0
Component/s: Structured Streaming
Labels:
None

Description

From SPARK-24295 we indicated various end users are struggling with dealing with huge FileStreamSink metadata log. Unfortunately, given we have arbitrary readers which leverage metadata log to determine which files are safely read (to ensure 'exactly-once'), pruning metadata log is not trivial to implement.

While we may be able to deal with checking deleted output files in FileStreamSink and get rid of them when compacting metadata, that operation would take additional overhead for running query. (I'll try to address this via another issue though.)

We can still get time-to-live (TTL) of output files from end users, and filter out files in metadata so that metadata is not growing linearly. Also filtered out files will be no longer seen in reader queries which leverage File(Stream)Source.

Attachments

Issue Links

is related to

SPARK-24295 Purge Structured streaming FileStreamSinkLog metadata compact file data.

In Progress

SPARK-30462 Structured Streaming _spark_metadata fills up Spark Driver memory when having lots of objects

Resolved

links to

[Github] Pull Request #28363 (HeartSaVioR)

GitHub Pull Request #24128

Activity

People

Assignee:: Jungtaek Lim

Reporter:: Jungtaek Lim

Votes:: 1 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 18/Mar/19 10:08

Updated:: 01/Dec/20 05:43

Resolved:: 01/Dec/20 05:43