Details
-
Improvement
-
Status: Resolved
-
P2
-
Resolution: Fixed
-
None
Description
In current batch runner, all the values for a single key need to fit in memory, because the resulting GBK iterable is materialized using "List" data structure.
Implications:
- This blocks user from using custom sharding in most of the IOs, as the whole shard needs to fit in memory.
- Frequent OOM failures in case of skewed data (pipeline should be running slow instead of failing). This is super hard to debug for inexperienced user.
We can do way better for non-merging windows, the same way we do for Spark runner. Only drawback is, that this implementation does not support result re-iterations. We'll support turning this implementation on and off, if user needs to trade off reiterations for memory efficiency.
Attachments
Issue Links
- depends upon
-
BEAM-8608 Chain DoFns in Flink batch runner when possible.
- Resolved
- relates to
-
BEAM-5392 GroupByKey on Spark: All values for a single key need to fit in-memory at once
- Resolved
-
BEAM-10164 Flink: Memory efficient combine implementation for batch runner
- Resolved
-
BEAM-4228 The FlinkRunner shouldn't require all of the values for a key to fit in memory
- Triage Needed
- requires
-
BEAM-8850 Flink: Batch translation context respects provided PTransform during Input / Output lookups.
- Resolved