[BEAM-5392] GroupByKey on Spark: All values for a single key need to fit in-memory at once - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: P2
Resolution: Fixed
Affects Version/s: 2.6.0
Fix Version/s: 2.11.0
Component/s: runner-spark
Labels:
- performance

Description

Currently, when using GroupByKey, all values for a single key need to fit in-memory at once.

There are following issues, that need to be addressed:
a) We can not use Spark's groupByKey, because it requires all values to fit in memory for a single key (it is implemented as "list combiner")
b) ReduceFnRunner iterates over values multiple times in order to group also by window

Solution:

In Dataflow Worker code, there are optimized versions of ReduceFnRunner, that can take advantage of having elements for a single key sorted by timestamp.

We can use Spark's `repartitionAndSortWithinPartitions` in order to meet this constraint.

For non-merging windows, we can put window itself into the key resulting in smaller groupings.

This approach was already tested in ~100TB input scale on Spark 2.3.x. (custom Spark runner).

I'll submit a patch once the Dataflow Worker code donation is complete.

Attachments

Issue Links

is related to

BEAM-5309 Add streaming support for HadoopOutputFormatIO

Resolved

BEAM-8848 Flink: Memory efficient GBK implementation for batch runner

Resolved

relates to

BEAM-7341 Portable Spark: testGlobalCombineWithDefaultsAndTriggers fails

Resolved

links to

GitHub Pull Request #7601

Activity

People

Assignee:: David Morávek

Reporter:: David Morávek

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 17/Sep/18 07:28

Updated:: 16/May/20 14:11

Resolved:: 13/Feb/19 15:40

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

8h 50m