[IGNITE-20441] ItRebalanceRecoveryTest is flaky - ASF JIRA

XML

Word

Printable

JSON

org.apache.ignite.internal.rebalance.ItRebalanceRecoveryTest is flaky on TC, see TC history.

After some invsetigation, I found out that the problem is the following:

Imagine two threads: A and B.
Thread A executes an update from the test. Primary replica generates a timestamp, creates a Raft command and tries to apply it, but the thread stalls for any reason.
Thread B performs idle safe time sync. Primary replica generates a timestamp (larger than the timestamp from the previous step), creates a Raft command and successfully applies it.
Thread A resumes its execution and applies its command. This means that the Raft command from thread B will be applied before the command from thread A, despite their timestamps being ordered differently.
According to the test protocol, a node gets restarted and needs to apply the missing Raft log part, which means re-applying the two commands above. However, the second command (which inserts data) will be ignored, because there's code in PartitionListener that ignores storage updates if their timestamp is smaller than the current timestamp (which got updated by the first command).

Therefore, this bug is definitely caused by ~~IGNITE-20116~~

is caused by

IGNITE-20116 Linearize storage updates with safeTime adjustment rules