Solr
  1. Solr
  2. SOLR-6405

ZooKeeper calls can easily not be retried enough on ConnectionLoss.

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.10, 6.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      The current design requires that we are sure we retry on connection loss until session expiration.

        Issue Links

          Activity

          Hide
          ASF subversion and git services added a comment -

          Commit 1619810 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1619810 ]

          SOLR-6405: ZooKeeper calls can easily not be retried enough on ConnectionLoss.

          Show
          ASF subversion and git services added a comment - Commit 1619810 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1619810 ] SOLR-6405 : ZooKeeper calls can easily not be retried enough on ConnectionLoss.
          Hide
          ASF subversion and git services added a comment -

          Commit 1619811 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1619811 ]

          SOLR-6405: ZooKeeper calls can easily not be retried enough on ConnectionLoss.

          Show
          ASF subversion and git services added a comment - Commit 1619811 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1619811 ] SOLR-6405 : ZooKeeper calls can easily not be retried enough on ConnectionLoss.
          Hide
          ASF subversion and git services added a comment -

          Commit 1619817 from Mark Miller in branch 'dev/branches/lucene_solr_4_10'
          [ https://svn.apache.org/r1619817 ]

          SOLR-6405: ZooKeeper calls can easily not be retried enough on ConnectionLoss.

          Show
          ASF subversion and git services added a comment - Commit 1619817 from Mark Miller in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1619817 ] SOLR-6405 : ZooKeeper calls can easily not be retried enough on ConnectionLoss.
          Hide
          Jessica Cheng Mallet added a comment -

          Should the "if (attemptCount > 0)" check be removed in retryDelay, now that the sleep is (attemptCount + 1) * retryDelay?

          I think in practice we'd never miss the initial 1.5s sleep since the padding on the retryDelay is enough to make up for it, but it's slightly harder to reason about. (The way I thought it worked was that roughly retry count is calculated so 1+2+3+4+...+retryCount ~= timeoutSec, so when that's multiplied by 1.5x (the retryDelay), we have the timeout covered. Is this right?)

          Show
          Jessica Cheng Mallet added a comment - Should the "if (attemptCount > 0)" check be removed in retryDelay, now that the sleep is (attemptCount + 1) * retryDelay? I think in practice we'd never miss the initial 1.5s sleep since the padding on the retryDelay is enough to make up for it, but it's slightly harder to reason about. (The way I thought it worked was that roughly retry count is calculated so 1+2+3+4+...+retryCount ~= timeoutSec, so when that's multiplied by 1.5x (the retryDelay), we have the timeout covered. Is this right?)
          Hide
          Mark Miller added a comment -

          Is this right?

          Yeah, that was the idea basically - a fall back timeout that has some padding to ensure we try longer than the session timeout, but not so long as we are waiting for no reason and tying up threads.

          Show
          Mark Miller added a comment - Is this right? Yeah, that was the idea basically - a fall back timeout that has some padding to ensure we try longer than the session timeout, but not so long as we are waiting for no reason and tying up threads.
          Hide
          Mark Miller added a comment -

          Should the "if (attemptCount > 0)" check be removed in retryDelay, now that the sleep is (attemptCount + 1) * retryDelay?

          Yeah, good catch.

          I think in practice we'd never miss the initial 1.5s sleep since the padding on the retryDelay is enough to make up for it,

          Agreed.

          Show
          Mark Miller added a comment - Should the "if (attemptCount > 0)" check be removed in retryDelay, now that the sleep is (attemptCount + 1) * retryDelay? Yeah, good catch. I think in practice we'd never miss the initial 1.5s sleep since the padding on the retryDelay is enough to make up for it, Agreed.
          Hide
          ASF subversion and git services added a comment -

          Commit 1619879 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1619879 ]

          SOLR-6405: Remove unnecessary attemptCount > 0 check.

          Show
          ASF subversion and git services added a comment - Commit 1619879 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1619879 ] SOLR-6405 : Remove unnecessary attemptCount > 0 check.
          Hide
          ASF subversion and git services added a comment -

          Commit 1619886 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1619886 ]

          SOLR-6405: Remove unnecessary attemptCount > 0 check.

          Show
          ASF subversion and git services added a comment - Commit 1619886 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1619886 ] SOLR-6405 : Remove unnecessary attemptCount > 0 check.
          Hide
          Jessica Cheng Mallet added a comment -

          but not so long as we are waiting for no reason and tying up threads

          This is actually the other thing that I was worried about. With the padding being on a multiplier, for the default 15s timeout, we're already doing 7.5s of total extra sleep (1.5+3+4.5+6+7.5=22.5). Is that too much?

          With your change of the comment "// 1500 ms over for padding", did you actually mean to do something like (attemptCount + 1) * 1000 + retryDelay?

          Show
          Jessica Cheng Mallet added a comment - but not so long as we are waiting for no reason and tying up threads This is actually the other thing that I was worried about. With the padding being on a multiplier, for the default 15s timeout, we're already doing 7.5s of total extra sleep (1.5+3+4.5+6+7.5=22.5). Is that too much? With your change of the comment "// 1500 ms over for padding", did you actually mean to do something like (attemptCount + 1) * 1000 + retryDelay?
          Hide
          Mark Miller added a comment -

          Is that too much?

          I'm much more worried about doing too little than too much, that is why it was done that way. Perhaps the padding could be brought down (it was probably brought up to combat a sign of this bug - it was 1200 initially I think), but I think we want to error on the long wait side.

          // 1500 ms over for padding

          Hmm, actually 500 over was right. It's mean to be a second, so anything over 1000 is the padding.

          Show
          Mark Miller added a comment - Is that too much? I'm much more worried about doing too little than too much, that is why it was done that way. Perhaps the padding could be brought down (it was probably brought up to combat a sign of this bug - it was 1200 initially I think), but I think we want to error on the long wait side. // 1500 ms over for padding Hmm, actually 500 over was right. It's mean to be a second, so anything over 1000 is the padding.
          Hide
          ASF subversion and git services added a comment -

          Commit 1619891 from Mark Miller in branch 'dev/branches/lucene_solr_4_10'
          [ https://svn.apache.org/r1619891 ]

          SOLR-6405: Remove unnecessary attemptCount > 0 check.

          Show
          ASF subversion and git services added a comment - Commit 1619891 from Mark Miller in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1619891 ] SOLR-6405 : Remove unnecessary attemptCount > 0 check.
          Hide
          ASF subversion and git services added a comment -

          Commit 1619892 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1619892 ]

          SOLR-6405: Update comment around retry pause padding.

          Show
          ASF subversion and git services added a comment - Commit 1619892 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1619892 ] SOLR-6405 : Update comment around retry pause padding.
          Hide
          Mark Miller added a comment - - edited

          I'm much more worried about doing too little than too much

          Also, keep in mind, it only keeps retrying on connection loss - it's going to end up getting the session expiration which will usually cause it to bail before waiting the full timeout.

          Show
          Mark Miller added a comment - - edited I'm much more worried about doing too little than too much Also, keep in mind, it only keeps retrying on connection loss - it's going to end up getting the session expiration which will usually cause it to bail before waiting the full timeout.
          Hide
          ASF subversion and git services added a comment -

          Commit 1619895 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1619895 ]

          SOLR-6405: Update comment around retry pause padding.

          Show
          ASF subversion and git services added a comment - Commit 1619895 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1619895 ] SOLR-6405 : Update comment around retry pause padding.
          Hide
          Jessica Cheng Mallet added a comment - - edited

          Right, most likely the first time it hits the ConnectionLoss it's not time=0 of the connection loss, so by loop i=4, it would've slept for 15s since the i=0 and therefore hit a SessionExpired.

          But then, thinking about it again, why be clever at all about the padding or back-off?

          Not to propose that we change this now, but let's pretend we don't do back-off and just sleep 1s between each loop. If we were to get ConnectionLoss back in the next attempt, there's no harm to try at all because if we're disconnected, the attempt wouldn't be hitting zookeeper anyway. If we were to get SessionExpired back, great, we can break out now and throw the exception. If we've reconnected, then yay, we succeeded. Because with each call we're expecting to get either success, failure (SessionExpired), or "in progress" (ConnectionLoss), we can really just retry "forever" without limiting the loop count (unless we're worried that somehow we'll keep getting ConnectionLoss even though the session has expired, but that'd be a pretty serious zookeeper client bug. And if we're really worried about that, we can always say do 10 more loops after we have slept a total of timeout already). The advantage of this approach is to never sleep for too long before finding out the definitive answer of success or SessionExpired, while if the answer is ConnectionLoss, it's not really incurring any extra load on zookeeper anyway.

          In the end, it's really weird that this method should ever semantically allow throwing a ConnectionLoss exception, if we got the math wrong, because the intent is to retry until we get a SessionExpired, isn't it? (Oh, or success of course. )

          Show
          Jessica Cheng Mallet added a comment - - edited Right, most likely the first time it hits the ConnectionLoss it's not time=0 of the connection loss, so by loop i=4, it would've slept for 15s since the i=0 and therefore hit a SessionExpired. But then, thinking about it again, why be clever at all about the padding or back-off? Not to propose that we change this now, but let's pretend we don't do back-off and just sleep 1s between each loop. If we were to get ConnectionLoss back in the next attempt, there's no harm to try at all because if we're disconnected, the attempt wouldn't be hitting zookeeper anyway. If we were to get SessionExpired back, great, we can break out now and throw the exception. If we've reconnected, then yay, we succeeded. Because with each call we're expecting to get either success, failure (SessionExpired), or "in progress" (ConnectionLoss), we can really just retry "forever" without limiting the loop count (unless we're worried that somehow we'll keep getting ConnectionLoss even though the session has expired, but that'd be a pretty serious zookeeper client bug. And if we're really worried about that, we can always say do 10 more loops after we have slept a total of timeout already). The advantage of this approach is to never sleep for too long before finding out the definitive answer of success or SessionExpired, while if the answer is ConnectionLoss, it's not really incurring any extra load on zookeeper anyway. In the end, it's really weird that this method should ever semantically allow throwing a ConnectionLoss exception, if we got the math wrong, because the intent is to retry until we get a SessionExpired, isn't it? (Oh, or success of course. )
          Hide
          Mark Miller added a comment -

          we can really just retry "forever" without limiting the loop count

          You can get other exceptions as well though. The original implementation around this waited forever I think and it caused a lot of issues in some cases - perhaps that was mainly in failure cases. I don't really remember the details currently and a lot has changed since then.

          The advantage of this approach is to never sleep for too long before finding out the definitive answer of success or SessionExpired

          Yeah, I guess it's just not been a big concern of mine since the extra wait is not really enough to matter, and the worst cases is not common or that bad either. This should be an exceptional case and either it's a quick blip or a really expensive recovery anyway. I didn't really see it as a goal to get out as fast as possible because there didn't seem to be much benefit, just to ensure we waited longer than expiration.

          If you want to explore this alternative approach, let's open a new JIRA for it.

          Show
          Mark Miller added a comment - we can really just retry "forever" without limiting the loop count You can get other exceptions as well though. The original implementation around this waited forever I think and it caused a lot of issues in some cases - perhaps that was mainly in failure cases. I don't really remember the details currently and a lot has changed since then. The advantage of this approach is to never sleep for too long before finding out the definitive answer of success or SessionExpired Yeah, I guess it's just not been a big concern of mine since the extra wait is not really enough to matter, and the worst cases is not common or that bad either. This should be an exceptional case and either it's a quick blip or a really expensive recovery anyway. I didn't really see it as a goal to get out as fast as possible because there didn't seem to be much benefit, just to ensure we waited longer than expiration. If you want to explore this alternative approach, let's open a new JIRA for it.
          Hide
          Mark Miller added a comment -

          SOLR-6409 Improve ZkCmdExecutor retry implementation.

          Show
          Mark Miller added a comment - SOLR-6409 Improve ZkCmdExecutor retry implementation.
          Hide
          Jessica Cheng Mallet added a comment -

          OK, thanks for the explanations! It's really helpful.

          Show
          Jessica Cheng Mallet added a comment - OK, thanks for the explanations! It's really helpful.
          Hide
          Mark Miller added a comment -

          Thanks Jessica!

          Show
          Mark Miller added a comment - Thanks Jessica!

            People

            • Assignee:
              Mark Miller
              Reporter:
              Mark Miller
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development