There seems to be a race condition that is now causing a rejoining member to potentially get stuck infinitely initiating a rejoin. The relevant client logs are attached (streams.log.tgz; all others attachments are broker logs), but basically it repeats this message (and nothing else) continuously until killed/shutdown:
This issue was uncovered by the Streams version probing upgrade test, which fails with a varying frequency. Here is the rate of failures for different system test runs so far:
trunk (cooperative): 1/1 and 2/10 failures
2.4 (cooperative) : 0/10 and 1/15 failures
trunk (eager): 0/10 failures
I've kicked off some high-repeat runs to complete overnight and hopefully shed more light.
Note that I have also kicked off runs of both 2.4 and trunk with the PR for
KAFKA-8104 reverted. Both of them saw 2/10 failures, due to hitting the bug that was fixed by PR 7460. It is therefore unclear whether PR 7460 introduced another or a new race condition/bug, or merely uncovered an existing one that previously would have first failed due to KAFKA-8104.