Details
Description
Currently, historical rebalance at CheckpointHistory#searchEarliestWalPointer seeks for the newest checkpoint with counter less that lowest entry has to be rebalanced.
Unfortunately, we may have more that one checkpoint with the same counter and it's impossible to use the newest one as a rebalance start point.
For example, we have partition with LWM=100, some gaps and HWM=200.
Checkpoint will have the counter == 100.
Then we may close some gaps, exluding 101 (to keep LWM == 100).
And again, checkpoint will have counter == 100.
Newest checkpoint (marked with counter 100) will not cointain all committed entries with counter > 100.
Then lets close the rest of the gaps to make historical rebalance possible.
And after the rebalance finish, we'll see a warning "Some partition entries were missed during historical rebalance" and inconsistent cluster state.
See reproducer at HistoricalRebalanceCheckpointTest.java
Possible solution is to use HWM instead of LWM during the search.
Attachments
Attachments
Issue Links
- Dependency
-
IGNITE-17908 AssertionError LWM after reserved on data insertion after the cluster restart
- Resolved
- is related to
-
IGNITE-18343 Refactor partition counters API.
- Open
- links to