Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-17845 Make Ignite Consistent Again
  3. IGNITE-17793

Historical rebalance must use HWM instead of LWM to seek the proper checkpoint to avoid the data loss

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.15
    • None
    • Fixed potential data loss on historical rebalance
    • Docs Required, Release Notes Required

    Description

      Currently, historical rebalance at CheckpointHistory#searchEarliestWalPointer seeks for the newest checkpoint with counter less that lowest entry has to be rebalanced.

      Unfortunately, we may have more that one checkpoint with the same counter and it's impossible to use the newest one as a rebalance start point.

      For example, we have partition with LWM=100, some gaps and HWM=200.
      Checkpoint will have the counter == 100.
      Then we may close some gaps, exluding 101 (to keep LWM == 100).
      And again, checkpoint will have counter == 100.
      Newest checkpoint (marked with counter 100) will not cointain all committed entries with counter > 100.
      Then lets close the rest of the gaps to make historical rebalance possible.
      And after the rebalance finish, we'll see a warning "Some partition entries were missed during historical rebalance" and inconsistent cluster state.

      See reproducer at HistoricalRebalanceCheckpointTest.java

      Possible solution is to use HWM instead of LWM during the search.

      Attachments

        1. HistoricalRebalanceCheckpointTest.java
          10 kB
          Anton Vinogradov

        Issue Links

          Activity

            People

              vladsz83 Vladimir Steshin
              av Anton Vinogradov
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: