[KAFKA-13600] Rebalances while streams is in degraded state can cause stores to be reassigned and restore from scratch - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.1.0, 2.8.1, 3.0.0
Fix Version/s: 3.2.0
Component/s: streams
Labels:
None

Description

Consider this scenario:

A node is lost from the cluster.
A rebalance is kicked off with a new "target assignment"'s(ie the rebalance is attempting to move a lot of tasks - see https://issues.apache.org/jira/browse/KAFKA-10121).
The kafka cluster is now a bit more sluggish from the increased load.
A Rolling Deploy happens triggering rebalances, during the rebalance processing continues but offsets can't be committed(Or nodes are restarted but fail to commit offsets)
The most caught up nodes now aren't within `acceptableRecoveryLag` and so the task is started in it's "target assignment" location, restoring all state from scratch and delaying further processing instead of using the "almost caught up" node.

We've hit this a few times and having lots of state (~25TB worth) and being heavy users of IQ this is not ideal for us.

While we can increase `acceptableRecoveryLag` to larger values to try get around this that causes other issues (ie a warmup becoming active when its still quite far behind)

The solution seems to be to balance "balanced assignment" with "most caught up nodes".

We've got a fork where we do just this and it's made a huge difference to the reliability of our cluster.

Our change is to simply use the most caught up node if the "target node" is more than `acceptableRecoveryLag` behind.
This gives up some of the load balancing type behaviour of the existing code but in practise doesn't seem to matter too much.

I guess maybe an algorithm that identified candidate nodes as those being within `acceptableRecoveryLag` of the most caught up node might allow the best of both worlds.

Our fork is

https://github.com/apache/kafka/compare/trunk...tim-patterson:fix_balance_uncaughtup?expand=1
(We also moved the capacity constraint code to happen after all the stateful assignment to prioritise standby tasks over warmup tasks)

Ideally we don't want to maintain a fork of kafka streams going forward so are hoping to get a bit of discussion / agreement on the best way to handle this.
More than happy to contribute code/test different algo's in production system or anything else to help with this issue

Attachments

Issue Links

links to

GitHub Pull Request #11760

Activity

People

Assignee:: Unassigned

Reporter:: Tim Patterson

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 19/Jan/22 00:05

Updated:: 28/Mar/22 14:57

Resolved:: 28/Mar/22 14:57