[FLINK-26315] Stream job which have multily region would not recover when connection with zookeeper/hbase lost. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Not a Priority
Resolution: Unresolved
Affects Version/s: 1.12.0
Fix Version/s: None
Component/s: Runtime / Coordination
Labels:

Flags:

Patch

Description

Our platfrom use failure-rate (failure-rate-interval: 5min,max-failures-per-interval: 6) as default restart-strategy. And failover-strategy is region level.
Let's asume a job with concurrency of 10, all the edges in stream graph is FORWARD, then the region count is equal to job parallelism. If more than 5 Task failed caused by connection lost between Taskmanager and external System such as zookeeper、hbase, failure rate will exceeded immediately. So our job will never recover from such situition(very common when use zookeeper for ha).

possible solution：

Imporve failure-rate strategy: record last task failure cause and timestamp,. If the task failure cause occur multiple times in a short period of time, it will ingore the rest.

I already implement it and work well.

useage:

restart-strategy: failure-rate
restart-strategy.failure-rate.cause.insensitive: true
restart-strategy.failure-rate.cause.insensitive-interval: 1min

this configure will ignore continuously repeating exception in 1min.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

[FLINK-26315]_improve_failure-rate_restart_strategy,_support_same_failure_cause_ignore.patch
24/Feb/22 03:25
22 kB
Janick Wu

Activity

People

Assignee:: Unassigned

Reporter:: Janick Wu

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 23/Feb/22 03:55

Updated:: 20/Aug/23 22:35