Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
3.1.0
Description
With https://issues.apache.org/jira/browse/SPARK-30866, caching of listed files was added for structured streaming to reduce cost of relisting from filesystem each batch. The settings that drive this are currently hardcoded and there is no way to change them.
This impacts some of our workloads where we process large datasets where its unknown how "heavy" some files are, so a single batch can take a long period of time. When we set maxFilesPerTrigger to 100k files, a subsequent batch using the cached max of 10k files is causing the job to take longer since the cluster is capable of handling the 100k files but is stuck doing 10% of the workload. The benefit of the caching doesn't outweigh the cost of the performance on the rest of the job.
With config settings available for this, we could either absorb some increased driver memory usage for caching the next 100k files, or opt to disable caching entirely and just relist files each batch by setting the cache amount to 0.