Solr
  1. Solr
  2. SOLR-7033

RecoveryStrategy should not publish any state when closed / cancelled.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.10.4, 5.0, 6.0
    • Component/s: None
    • Labels:
      None

      Description

      Currently, when closed / cancelled, RecoveryStrategy can publish a recovery failed state. In a bad loop (like when no one can become leader because no one had a last state of active) this can cause very fast looped publishing of this state to zk.

      It's an outstanding item to improve that specific scenario anyway, but regardless, we should fix the close / cancel path to never publish any state to zk.

      1. SOLR-7033.patch
        4 kB
        Mark Miller
      2. SOLR-7033.patch
        13 kB
        Mark Miller
      3. SOLR-7033.patch
        11 kB
        Mark Miller
      4. SOLR-7033.patch
        9 kB
        Mark Miller
      5. SOLR-7033.patch
        3 kB
        Mark Miller

        Issue Links

          Activity

          Hide
          Mark Miller added a comment -

          Here is a new patch that also attempts to make sure that stopped / started recoveries will pause when the last attempt started less than 10 seconds ago.

          Show
          Mark Miller added a comment - Here is a new patch that also attempts to make sure that stopped / started recoveries will pause when the last attempt started less than 10 seconds ago.
          Hide
          Mark Miller added a comment -

          Need to move that to a per core location.

          Show
          Mark Miller added a comment - Need to move that to a per core location.
          Hide
          Mark Miller added a comment -

          Better patch attatched.

          Show
          Mark Miller added a comment - Better patch attatched.
          Hide
          Shalin Shekhar Mangar added a comment -

          +1 LGTM

          Show
          Shalin Shekhar Mangar added a comment - +1 LGTM
          Hide
          Mark Miller added a comment -

          I've added another throttle on how fast a core will attempt to become the leader in the latest patch.

          Show
          Mark Miller added a comment - I've added another throttle on how fast a core will attempt to become the leader in the latest patch.
          Hide
          Mark Miller added a comment -

          + private final ActionThrottle recoveryThrottle = new ActionThrottle("recovery attempt", 10000);

          + private final ActionThrottle leaderThrottle = new ActionThrottle("leader attempt", 5000);

          + log.info("Throttling {} attempts - waiting for {} ms", name, sleep);

          I'll consolidate the attempts usage.

          Show
          Mark Miller added a comment - + private final ActionThrottle recoveryThrottle = new ActionThrottle("recovery attempt", 10000); + private final ActionThrottle leaderThrottle = new ActionThrottle("leader attempt", 5000); + log.info("Throttling {} attempts - waiting for {} ms", name, sleep); I'll consolidate the attempts usage.
          Hide
          Mark Miller added a comment -

          Given the reports in SOLR-5961 and other cases of this I've seen, I think this is as important as the corrupted index issue to put in 5.0.

          Show
          Mark Miller added a comment - Given the reports in SOLR-5961 and other cases of this I've seen, I think this is as important as the corrupted index issue to put in 5.0.
          Hide
          ASF subversion and git services added a comment -

          Commit 1658236 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1658236 ]

          SOLR-7033, SOLR-5961: RecoveryStrategy should not publish any state when closed / cancelled and there should always be a pause between recoveries even when recoveries are rapidly stopped and started as well as when a node attempts to become the leader for a shard.

          Show
          ASF subversion and git services added a comment - Commit 1658236 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1658236 ] SOLR-7033 , SOLR-5961 : RecoveryStrategy should not publish any state when closed / cancelled and there should always be a pause between recoveries even when recoveries are rapidly stopped and started as well as when a node attempts to become the leader for a shard.
          Hide
          ASF subversion and git services added a comment -

          Commit 1658237 from Mark Miller in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1658237 ]

          SOLR-7033, SOLR-5961: RecoveryStrategy should not publish any state when closed / cancelled and there should always be a pause between recoveries even when recoveries are rapidly stopped and started as well as when a node attempts to become the leader for a shard.

          Show
          ASF subversion and git services added a comment - Commit 1658237 from Mark Miller in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1658237 ] SOLR-7033 , SOLR-5961 : RecoveryStrategy should not publish any state when closed / cancelled and there should always be a pause between recoveries even when recoveries are rapidly stopped and started as well as when a node attempts to become the leader for a shard.
          Hide
          ASF subversion and git services added a comment -

          Commit 1658251 from Mark Miller in branch 'dev/branches/lucene_solr_5_0'
          [ https://svn.apache.org/r1658251 ]

          SOLR-7033, SOLR-5961: RecoveryStrategy should not publish any state when closed / cancelled and there should always be a pause between recoveries even when recoveries are rapidly stopped and started as well as when a node attempts to become the leader for a shard.

          Show
          ASF subversion and git services added a comment - Commit 1658251 from Mark Miller in branch 'dev/branches/lucene_solr_5_0' [ https://svn.apache.org/r1658251 ] SOLR-7033 , SOLR-5961 : RecoveryStrategy should not publish any state when closed / cancelled and there should always be a pause between recoveries even when recoveries are rapidly stopped and started as well as when a node attempts to become the leader for a shard.
          Hide
          Gregory Chanan added a comment -
          if (lastActionStartedAt == 0) {
          

          Probably unlikely to happen, but I believe the call to nanoTime could return any value, even 0. So it's possible you could markAttemptingAction and then all calls to minimumWaitBetweenActions just return.

          Show
          Gregory Chanan added a comment - if (lastActionStartedAt == 0) { Probably unlikely to happen, but I believe the call to nanoTime could return any value, even 0. So it's possible you could markAttemptingAction and then all calls to minimumWaitBetweenActions just return.
          Hide
          Mark Miller added a comment -

          I'll add a check. See anything else?

          Show
          Mark Miller added a comment - I'll add a check. See anything else?
          Hide
          Gregory Chanan added a comment -

          Besides RecoveryStrategy.java, the rest looked fine. I'm not really familiar with the RecoveryStrategy code, so I'm not sure I can say anything intelligent about it at this point.

          Show
          Gregory Chanan added a comment - Besides RecoveryStrategy.java, the rest looked fine. I'm not really familiar with the RecoveryStrategy code, so I'm not sure I can say anything intelligent about it at this point.
          Hide
          Mark Miller added a comment -

          Here is patch. If we don't end up doing an rc3, I'll spin it off into a new issue.

          Show
          Mark Miller added a comment - Here is patch. If we don't end up doing an rc3, I'll spin it off into a new issue.
          Hide
          Anshum Gupta added a comment -

          Bulk close after 5.0 release.

          Show
          Anshum Gupta added a comment - Bulk close after 5.0 release.
          Hide
          Steve Rowe added a comment -

          Mark Miller, looks like your last patch on this issue was never committed?

          Show
          Steve Rowe added a comment - Mark Miller , looks like your last patch on this issue was never committed?
          Hide
          Steve Rowe added a comment -

          Reopening to backport to 4.10.4

          Show
          Steve Rowe added a comment - Reopening to backport to 4.10.4
          Hide
          ASF subversion and git services added a comment -

          Commit 1662784 from Steve Rowe in branch 'dev/branches/lucene_solr_4_10'
          [ https://svn.apache.org/r1662784 ]

          SOLR-7033, SOLR-5961: RecoveryStrategy should not publish any state when closed / cancelled and there should always be a pause between recoveries even when recoveries are rapidly stopped and started as well as when a node attempts to become the leader for a shard. (merged branch_5x r1658237)

          Show
          ASF subversion and git services added a comment - Commit 1662784 from Steve Rowe in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1662784 ] SOLR-7033 , SOLR-5961 : RecoveryStrategy should not publish any state when closed / cancelled and there should always be a pause between recoveries even when recoveries are rapidly stopped and started as well as when a node attempts to become the leader for a shard. (merged branch_5x r1658237)
          Hide
          Steve Rowe added a comment -

          Committed to lucene_solr_4_10.

          Show
          Steve Rowe added a comment - Committed to lucene_solr_4_10.
          Hide
          Michael McCandless added a comment -

          Bulk close for 4.10.4 release

          Show
          Michael McCandless added a comment - Bulk close for 4.10.4 release

            People

            • Assignee:
              Steve Rowe
              Reporter:
              Mark Miller
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development