[BEAM-4783] Add bundleSize parameter to control splitting of Spark sources (useful for Dynamic Allocation) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Triage Needed
Priority: P2
Resolution: Fixed
Affects Version/s: 2.8.0, 2.9.0
Fix Version/s: 2.11.0
Component/s: runner-spark
Labels:
None

Description

When the spark-runner is used along with the configuration spark.dynamicAllocation.enabled=true the SourceRDD does not detect this. It then falls back to the value calculated in this description:
// when running on YARN/SparkDeploy it's the result of max(totalCores, 2).
// when running on Mesos it's 8.
// when running local it's the total number of cores (local = 1, local[N] = N,
// local[*] = estimation of the machine's cores).
// ** the configuration "spark.default.parallelism" takes precedence over all of the above **
So in most cases this default is quite small. This is an issue when using a very large input file as it will only get split in half.

I believe that when Dynamic Allocation is enable the SourceRDD should use the DEFAULT_BUNDLE_SIZE and possibly expose a SparkPipelineOptions that allows you to change this DEFAULT_BUNDLE_SIZE.

Attachments

Issue Links

links to

GitHub Pull Request #6181

GitHub Pull Request #6884

GitHub Pull Request #7690

Activity

People

Assignee:: Kyle Winkelman

Reporter:: Kyle Winkelman

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 13/Jul/18 15:01

Updated:: 13/Apr/23 11:10

Resolved:: 08/Feb/19 11:13

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

8h 20m