[FLINK-18203] Reduce objects usage in redistributing union states - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: 1.10.1, 1.11.0, 1.11.1
Fix Version/s: 1.13.0
Component/s: Runtime / Checkpointing
Labels:
- stale-major

Description

#RoundRobinOperatorStateRepartitioner#repartitionUnionState creates a new OperatorStreamStateHandle instance for every StreamStateHandle instance used in every execution, which causes the number of new OperatorStreamStateHandle instances up to m * n (jobvertex parallelism * count of all executions' StreamStateHandle).

But in fact, all executions can share the same collection of StreamStateHandle and the number of OperatorStreamStateHandle can be reduced down to the count of all executions' StreamStateHandle.

I met this problem on production when we're testing a job with parallelism=10k and the memory problem is getting more serious when yarn containers go dead and the job starts doing failover.

Attachments

Issue Links

relates to

FLINK-21436 Speed up the restore of UnionListState

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Jiayi Liao

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 09/Jun/20 07:27

Updated:: 16/Nov/22 08:30

Resolved:: 22/Apr/21 11:39