[SPARK-30900] FileStreamSource: Avoid reading compact metadata log twice if the query stops from compact batch and restarts - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 3.1.0
Fix Version/s: 3.1.0
Component/s: Structured Streaming
Labels:
None

Description

When restarting the query, there is a case which the query starts from compaction batch, and the batch has source metadata file to read. One case is that the previous query succeeded to read from inputs, but not finalized the batch for various reasons.

This case FileStreamSource will read the compact metadata file twice, one for retrieving all files to build seen file map, another one for retrieving entries in the batch. If the query processes huge number of inputs so far, compact metadata file becomes considerably bigger, so reading once more adds unnecessary latency on processing startup batch.

This issue tracks the effort to address this case.

Attachments

Issue Links

links to

GitHub Pull Request #27649

Activity

People

Assignee:: Jungtaek Lim

Reporter:: Jungtaek Lim

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 20/Feb/20 14:04

Updated:: 01/Dec/20 04:11

Resolved:: 01/Dec/20 04:11