[SPARK-44924] Add configurations for FileStreamSource cached files - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 3.1.0
Fix Version/s: 4.0.0
Component/s: Structured Streaming
Labels:
- pull-request-available

Description

With https://issues.apache.org/jira/browse/SPARK-30866, caching of listed files was added for structured streaming to reduce cost of relisting from filesystem each batch. The settings that drive this are currently hardcoded and there is no way to change them.

This impacts some of our workloads where we process large datasets where its unknown how "heavy" some files are, so a single batch can take a long period of time. When we set maxFilesPerTrigger to 100k files, a subsequent batch using the cached max of 10k files is causing the job to take longer since the cluster is capable of handling the 100k files but is stuck doing 10% of the workload. The benefit of the caching doesn't outweigh the cost of the performance on the rest of the job.

With config settings available for this, we could either absorb some increased driver memory usage for caching the next 100k files, or opt to disable caching entirely and just relist files each batch by setting the cache amount to 0.

Attachments

Issue Links

links to

[Github] Pull Request #42623 (ragnarok56)

GitHub Pull Request #42623

GitHub Pull Request #45362

Activity

People

Assignee:: kevin nacios

Reporter:: kevin nacios

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 23/Aug/23 02:32

Updated:: 20/May/24 01:51

Resolved:: 20/May/24 01:51