I'd like to propose a slightly different approach to fix 7114 while making no-worse tradeoffs between stickiness and sub-topology balance. The key idea is to try to adjust the assignment to gets the distribution as closer as to the sub-topologies' num.tasks distribution.
Here is a detailed workflow:
1. at the beginning, we first calculate for each client C, how many tasks should it be assigned ideally, as num.total_tasks / num.total_capacity * C_capacity rounded down, call it C_a. Note that since we round down this number, the summing C_a across all C would be <= num.total_tasks, but this does not matter.
2. and then for each client C, based on its num. previous assigned tasks C_p, we calculate how many tasks it should take over, or give up as C_a - C_p (if it is positive, it should take over some, otherwise it should give up some).
Note that because of the round down, when we calculate the C_a - C_p for each client, we need to make sure that the total number of give ups and total number of take overs should be equal, some ad-hoc heuristics can be used.
3. then we calculate the tasks distribution across the sub-topologies as a whole. For example, if we have three sub-topologies, st0 and st1, and st0 has 4 total tasks, st1 has 4 total tasks, and st2 has 8 total tasks, then the distribution between st0, st1 and st2 should be 1:1:2. Let's call it the global distribution, and note that currently since num.tasks per sub-topology never change, this distribution should NEVER change.
4. then for each client that should give up some, we decides which tasks it should give up so that the remaining tasks distribution is proportional to the above global distribution.
For example, if a client previously own 4 tasks of st0, no tasks of st1, and 2 tasks of st2, and now it needs to give up 3 tasks, I should then give up 2 of st0 and 1 of st1, so that the remaining distribution is closer to 1:1:2.
5. now we've collected a list of given-up tasks plus the ones that does not have any prev active assignment (normally operations it should not happen since all tasks should have been created since day one), we now migrate them to those who needs to take over some, similarly proportional to the global distribution.
For example if a client previously own 1 task of st0, and nothing of st1 and st2, and now it needs to take over 3 tasks, we would try to give it 1 task of st1 and 2 tasks of st2, so that the resulted distribution becomes 1:1:2. And we ONLY consider prev-standby tasks when we decide which one of st1 / st2 should we get for that client.
Now, consider the following scenarios:
a) this is a clean start and there is no prev-assignment at all, step 4 would be a no-op; the result should still be fine.
b) a client leaves the group, no client needs to give up and all clients may need to take over some, so step 4 is no-op, and the cumulated step 5 only contains the tasks of the left client.
c) a new client joins the group, all clients need to give up some, and only the new client need to take over all the given-up ones. Hence step 5 is straight-forward.