Details
-
Bug
-
Status: Resolved
-
P2
-
Resolution: Fixed
-
2.6.0
Description
Currently, when using GroupByKey, all values for a single key need to fit in-memory at once.
There are following issues, that need to be addressed:
a) We can not use Spark's groupByKey, because it requires all values to fit in memory for a single key (it is implemented as "list combiner")
b) ReduceFnRunner iterates over values multiple times in order to group also by window
Solution:
In Dataflow Worker code, there are optimized versions of ReduceFnRunner, that can take advantage of having elements for a single key sorted by timestamp.
We can use Spark's `repartitionAndSortWithinPartitions` in order to meet this constraint.
For non-merging windows, we can put window itself into the key resulting in smaller groupings.
This approach was already tested in ~100TB input scale on Spark 2.3.x. (custom Spark runner).
I'll submit a patch once the Dataflow Worker code donation is complete.