Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-12617

PME-free switch should wait for recovery only at affected nodes.

    XMLWordPrintableJSON

Details

    • Task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.9
    • None
    • Release Notes Required

    Description

      Since IGNITE-9913, new-topology operations allowed immediately after cluster-wide recovery finished.

      But is there any reason to wait for a cluster-wide recovery if only one node failed?
      In this case, we should recover only the failed node's backups.
      Unfortunately, RendezvousAffinityFunction tends to spread the node's backup partitions to the whole cluster. In this case, we, obviously, have to wait for cluster-wide recovery on switch.

      But what if only some nodes will be the backups for every primary?

      In case nodes combined into virtual cells where, for each partition, backups located at the same cell with primaries, it's possible to finish the switch outside the affected cell before tx recovery finish.

      This optimization will allow us to start and even finish new operations outside the failed cell without a cluster-wide switch finish (broken cell recovery) waiting.

      In other words, switch (when left/fail + baseline + rebalanced) will have little effect on the operation's (not related to failed cell) latency.

      In other words

      • We should wait for tx recovery before finishing the switch only on a broken cell.
      • We should wait for replicated caches tx recovery everywhere since every node is a backup of a failed one.
      • Upcoming operations related to the broken cell (including all replicated caches operations) will require a cluster-wide switch finish to be processed.

      Attachments

        Issue Links

          Activity

            People

              avinogradov Anton Vinogradov (Obsolete, actual is "av")
              avinogradov Anton Vinogradov (Obsolete, actual is "av")
              Alexey Scherbakov Alexey Scherbakov
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m