[BEAM-1074] Set default-partitioner in SourceRDD.Unbounded. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: P2
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.0.0
Component/s: runner-spark
Labels:
None

Description

The SparkRunner uses mapWithState to read and manage CheckpointMarks, and this stateful operation will be followed by a shuffle:
https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/MapWithStateDStream.scala#L159

Since the stateful read maps "splitSource" -> "partition of a list of read values", the following shuffle won't benefit in any way (the list of read values has not been flatMapped yet). In order to avoid shuffle we need to set the input RDD (SourceRDD.Unbounded) partitioner to be a default HashPartitioner since mapWithState would use the same partitioner and will skip shuffle if the partitioners match.

Attachments

Issue Links

relates to

BEAM-848 Shuffle input read-values to get maximum parallelism.

Resolved

links to

GitHub Pull Request #1500

GitHub Pull Request #2288

Activity

People

Assignee:: Aviem Zur

Reporter:: Amit Sela

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 02/Dec/16 17:06

Updated:: 16/May/20 13:45

Resolved:: 26/Mar/17 03:53