[BEAM-12153] OOM on GBK in SparkRunner and SpartStructuredStreamingRunner - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: P3
Resolution: Not A Problem
Affects Version/s: None
Fix Version/s: Not applicable
Component/s: runner-spark
Labels:
None

Description

We have being experiencing OOM errors on GroupByKey in batch mode in both Spark runners even if behind the woods spark spills data to disk in such cases: taking a look at the translation in the two runners, it might be due to using ReduceFnRunner for merging windows in GBK translation. ReduceFnRunner.processElements expects to have all elements to merge the windows between each other.:

RDD spark runner:

https://github.com/apache/beam/blob/752798e3b0e6911fa84f8d138dacccdb6fcc85ef/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkGroupAlsoByWindowViaOutputBufferFn.java#L99

structured streaming spark: runner: https://github.com/apache/beam/blob/752798e3b0e6911fa84f8d138dacccdb6fcc85ef/runners/spark/2/src/main/java/org/apache/beam/runners/spark/structuredstreaming/translation/batch/functions/GroupAlsoByWindowViaOutputBufferFn.java#L74

Even replacing the Iterable with an Iterator in ReduceFnRunner to avoid materialization does not work because deep in ReduceFnRunner.processElements(), the collection is iterated twice.

It could be better to do what flink runner does and translate GBK as CombinePerKey with a Concatenate combine fn and thus avoid elements materialization.

Attachments

Issue Links

links to

GitHub Pull Request #15267

GitHub Pull Request #15508

Activity

People

Assignee:: Etienne Chauchot

Reporter:: Etienne Chauchot

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 12/Apr/21 13:47

Updated:: 15/Sep/21 11:46

Resolved:: 15/Sep/21 11:46

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

10h 10m