Solr
  1. Solr
  2. SOLR-8371

Try and prevent too many recovery requests from stacking up and clean up some faulty logic.

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.5, 6.0
    • Component/s: SolrCloud
    • Labels:
      None
    1. SOLR-8371.patch
      12 kB
      Mark Miller
    2. SOLR-8371.patch
      11 kB
      Mark Miller
    3. SOLR-8371.patch
      10 kB
      Mark Miller
    4. SOLR-8371.patch
      10 kB
      Mark Miller
    5. SOLR-8371.patch
      10 kB
      Mark Miller
    6. SOLR-8371.patch
      6 kB
      Mark Miller
    7. SOLR-8371.patch
      5 kB
      Mark Miller
    8. SOLR-8371.patch
      4 kB
      Mark Miller
    9. SOLR-8371-2.patch
      6 kB
      Ramkumar Aiyengar

      Issue Links

        Activity

        Hide
        Mark Miller added a comment -

        patch ive started playing with attatched

        Show
        Mark Miller added a comment - patch ive started playing with attatched
        Hide
        Mark Miller added a comment -

        This is a lot uglier than it once was. I don't think cancel is actually async in this case, so I'm not sure that we are bailing on recovery early to try the next one, and we have the 10 sec between recoveries throttle. If you stack up a bunch of them, it can take a long time to unwind.

        Not sure this patch is quite right yet, but trying to make it so that if cancel is actually async and if there is a recovery attempt waiting and more come in, we try and only do one or as few as we can.

        Show
        Mark Miller added a comment - This is a lot uglier than it once was. I don't think cancel is actually async in this case, so I'm not sure that we are bailing on recovery early to try the next one, and we have the 10 sec between recoveries throttle. If you stack up a bunch of them, it can take a long time to unwind. Not sure this patch is quite right yet, but trying to make it so that if cancel is actually async and if there is a recovery attempt waiting and more come in, we try and only do one or as few as we can.
        Hide
        Mark Miller added a comment -

        Updated patch, fixed to work correctly.

        Show
        Mark Miller added a comment - Updated patch, fixed to work correctly.
        Hide
        Mark Miller added a comment -

        And another pass to make parts that were async, async again.

        Show
        Mark Miller added a comment - And another pass to make parts that were async, async again.
        Hide
        Mark Miller added a comment -

        Still doing test runs to see if anything random falls out, but I think the latest patch is good.

        Show
        Mark Miller added a comment - Still doing test runs to see if anything random falls out, but I think the latest patch is good.
        Hide
        Mark Miller added a comment -

        New patch. We have actually been using the update executor for recovery threads - those threads can actually lead to IO (in tests im mostly seeing it as jmx getStats calls on close) and we now interrupt the update executor. I've made a new 'recoveryExecutor' to handle the main Recovery threads.

        Show
        Mark Miller added a comment - New patch. We have actually been using the update executor for recovery threads - those threads can actually lead to IO (in tests im mostly seeing it as jmx getStats calls on close) and we now interrupt the update executor. I've made a new 'recoveryExecutor' to handle the main Recovery threads.
        Hide
        ASF subversion and git services added a comment -

        Commit 1720718 from Mark Miller in branch 'dev/trunk'
        [ https://svn.apache.org/r1720718 ]

        SOLR-8371: Try and prevent too many recovery requests from stacking up and clean up some faulty cancel recovery logic.

        Show
        ASF subversion and git services added a comment - Commit 1720718 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1720718 ] SOLR-8371 : Try and prevent too many recovery requests from stacking up and clean up some faulty cancel recovery logic.
        Hide
        Mark Miller added a comment -

        I'll subject this to Jenkins for a bit before backport. Reviews welcome.

        Show
        Mark Miller added a comment - I'll subject this to Jenkins for a bit before backport. Reviews welcome.
        Hide
        Mark Miller added a comment -

        The future could be a local variable now instead of a field.

        Show
        Mark Miller added a comment - The future could be a local variable now instead of a field.
        Hide
        ASF subversion and git services added a comment -

        Commit 1721158 from Mark Miller in branch 'dev/trunk'
        [ https://svn.apache.org/r1721158 ]

        SOLR-8371: The Future field should now be a local variable.

        Show
        ASF subversion and git services added a comment - Commit 1721158 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1721158 ] SOLR-8371 : The Future field should now be a local variable.
        Hide
        Ramkumar Aiyengar added a comment -

        Looks good functionally. But I don't fully understand why you need the locking/queue-checks.. Can't you have a bounded recovery queue (of say 1 or 2 – not sure of the semantics), and just return if the queue limit is exceeded?

        Show
        Ramkumar Aiyengar added a comment - Looks good functionally. But I don't fully understand why you need the locking/queue-checks.. Can't you have a bounded recovery queue (of say 1 or 2 – not sure of the semantics), and just return if the queue limit is exceeded?
        Hide
        Mark Miller added a comment -

        Perhaps, I have not worked through it. I came more from an approach of molding what was there, going down a couple paths until I hit something that worked right. Anyhow, I'd be happy to examine an alternate impl patch, but my effort has been on testing and is still on working out any changes in behavior that are affecting our tests on trunk (either because of this or to rule this out).

        Show
        Mark Miller added a comment - Perhaps, I have not worked through it. I came more from an approach of molding what was there, going down a couple paths until I hit something that worked right. Anyhow, I'd be happy to examine an alternate impl patch, but my effort has been on testing and is still on working out any changes in behavior that are affecting our tests on trunk (either because of this or to rule this out).
        Hide
        Ramkumar Aiyengar added a comment -

        Here's a patch of what I meant, on top of your changes. I haven't tested it extensively though, beyond a run or two of the existing TestRecovery..

        Show
        Ramkumar Aiyengar added a comment - Here's a patch of what I meant, on top of your changes. I haven't tested it extensively though, beyond a run or two of the existing TestRecovery..
        Hide
        Mark Miller added a comment -

        Thanks, I'll merge this back to 5x and take a look.

        Show
        Mark Miller added a comment - Thanks, I'll merge this back to 5x and take a look.
        Hide
        Mark Miller added a comment -

        Here's a patch of what I meant, on top of your changes. I haven't tested it extensively though, beyond a run or two of the existing TestRecovery..

        Hmm...perhaps I'm missing something, I just browsed over it, but does this fit the requirements?

        When a recovery is running, we don't want to drop any recovery attempts necessarily. We need the last recovery request to actually trigger a recovery, not just be sure one is in progress.

        Show
        Mark Miller added a comment - Here's a patch of what I meant, on top of your changes. I haven't tested it extensively though, beyond a run or two of the existing TestRecovery.. Hmm...perhaps I'm missing something, I just browsed over it, but does this fit the requirements? When a recovery is running, we don't want to drop any recovery attempts necessarily. We need the last recovery request to actually trigger a recovery, not just be sure one is in progress.
        Hide
        Mark Miller added a comment -

        We also want, if a 1000 recovery requests come in really quick, for most of them to dropped - but we still need to be sure a recovery starts after the last one hits. This looks like it won't drop recoveries? It will block and execute all of them?

        Show
        Mark Miller added a comment - We also want, if a 1000 recovery requests come in really quick, for most of them to dropped - but we still need to be sure a recovery starts after the last one hits. This looks like it won't drop recoveries? It will block and execute all of them?
        Hide
        ASF subversion and git services added a comment -

        Commit 1726076 from Mark Miller in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1726076 ]

        SOLR-8371: Try and prevent too many recovery requests from stacking up and clean up some faulty cancel recovery logic.

        Show
        ASF subversion and git services added a comment - Commit 1726076 from Mark Miller in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1726076 ] SOLR-8371 : Try and prevent too many recovery requests from stacking up and clean up some faulty cancel recovery logic.
        Hide
        ASF subversion and git services added a comment -

        Commit 1726077 from Mark Miller in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1726077 ]

        SOLR-8371: The Future field should now be a local variable.

        Show
        ASF subversion and git services added a comment - Commit 1726077 from Mark Miller in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1726077 ] SOLR-8371 : The Future field should now be a local variable.
        Hide
        ASF subversion and git services added a comment -

        Commit 1726079 from Mark Miller in branch 'dev/trunk'
        [ https://svn.apache.org/r1726079 ]

        SOLR-8371: Move 6.0 changes entry to 5.5

        Show
        ASF subversion and git services added a comment - Commit 1726079 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1726079 ] SOLR-8371 : Move 6.0 changes entry to 5.5
        Hide
        Stephan Lagraulet added a comment -

        Can you update Fix Version to 5.5 if it is going to be shipped with this version ?

        Show
        Stephan Lagraulet added a comment - Can you update Fix Version to 5.5 if it is going to be shipped with this version ?
        Hide
        Stephan Lagraulet added a comment -

        I'm trying to gather all issues related to SolrCloud that affects Solr 5.4. Can you affect SolrCloud component to this issue ?

        Show
        Stephan Lagraulet added a comment - I'm trying to gather all issues related to SolrCloud that affects Solr 5.4. Can you affect SolrCloud component to this issue ?
        Hide
        Shalin Shekhar Mangar added a comment -

        Mark Miller – Looking at the following code in doRecovery:

                  // if we can't get the lock, another recovery is running
                  // we check to see if there is already one waiting to go
                  // after the current one, and if there is, bail
                  boolean locked = recoveryLock.tryLock();
                  try {
                    if (!locked) {
                      if (recoveryWaiting.get() > 0) {
                        return;
                      }
                      recoveryWaiting.incrementAndGet();
                    } else {
                      recoveryWaiting.incrementAndGet();
                      cancelRecovery();
                    }
                    
                    recoveryLock.lock();
                    try {
                      recoveryWaiting.decrementAndGet();
                      ...
                      ...
        

        In the case where the tryLock fails, you bail out if the recoveryWaiting > 0 but in case it is not – we should increment recoveryWaiting and then again bail out if recoveryWaiting > 1 (and decrement recoveryWaiting). The idea is to run only the latest recovery request which has come in and no more. What do you think?

        Show
        Shalin Shekhar Mangar added a comment - Mark Miller – Looking at the following code in doRecovery: // if we can't get the lock, another recovery is running // we check to see if there is already one waiting to go // after the current one, and if there is, bail boolean locked = recoveryLock.tryLock(); try { if (!locked) { if (recoveryWaiting.get() > 0) { return ; } recoveryWaiting.incrementAndGet(); } else { recoveryWaiting.incrementAndGet(); cancelRecovery(); } recoveryLock.lock(); try { recoveryWaiting.decrementAndGet(); ... ... In the case where the tryLock fails, you bail out if the recoveryWaiting > 0 but in case it is not – we should increment recoveryWaiting and then again bail out if recoveryWaiting > 1 (and decrement recoveryWaiting). The idea is to run only the latest recovery request which has come in and no more. What do you think?
        Hide
        Keith Laban added a comment -

        I see that this part of the code is being ported from the previous implementation, but, what are the effects of canceling recovery and then throttling the next recovery? Would it be more efficient to let to original recovery finish and have the next pending recovery fired once the original one is done? It seems counter intuitive to cancel the current progress then wait to start it again if the throttling strategy says so. With these changes there should always be one pending recovery as long there has been a recovery request since the currently running one has started.

        My depth of knowledge here is pretty limited so I'm not sure if finishing current recovery and then picking up the missing pieces would be better than stopping and starting, just throwing in some thoughts.

        Show
        Keith Laban added a comment - I see that this part of the code is being ported from the previous implementation, but, what are the effects of canceling recovery and then throttling the next recovery? Would it be more efficient to let to original recovery finish and have the next pending recovery fired once the original one is done? It seems counter intuitive to cancel the current progress then wait to start it again if the throttling strategy says so. With these changes there should always be one pending recovery as long there has been a recovery request since the currently running one has started. My depth of knowledge here is pretty limited so I'm not sure if finishing current recovery and then picking up the missing pieces would be better than stopping and starting, just throwing in some thoughts.
        Hide
        Mark Miller added a comment -

        Yeah, I think we can work on improvements but let's open some new JIRA's and relate them. Gregory Chanan and Ramkumar Aiyengar have also both expressed some ideas. These changes take a fair amount of testing and time, and this improvement over what we had just went out in 5.5. I'll help implement, test, and make any further ideas in new improvement JIRAs.

        throttling strategy

        Recovery throttling was put in as a stop gap to handle recovery stack up overload. We should reconsider it and it's implementation in light of these other improvements.

        Show
        Mark Miller added a comment - Yeah, I think we can work on improvements but let's open some new JIRA's and relate them. Gregory Chanan and Ramkumar Aiyengar have also both expressed some ideas. These changes take a fair amount of testing and time, and this improvement over what we had just went out in 5.5. I'll help implement, test, and make any further ideas in new improvement JIRAs. throttling strategy Recovery throttling was put in as a stop gap to handle recovery stack up overload. We should reconsider it and it's implementation in light of these other improvements.
        Hide
        Mark Miller added a comment -

        more efficient to let to original recovery finish

        I think it really depends. If you cancel before doing what could be a long replication, on a large index or if updates are coming in fast relative to how fast they can be replayed, it can be a huge time savings. But yes, again, we should re-examine throttling with this change and related improvements.

        Show
        Mark Miller added a comment - more efficient to let to original recovery finish I think it really depends. If you cancel before doing what could be a long replication, on a large index or if updates are coming in fast relative to how fast they can be replayed, it can be a huge time savings. But yes, again, we should re-examine throttling with this change and related improvements.
        Hide
        Ramkumar Aiyengar added a comment -

        Sorry Mark, I completely missed your comments on my patch. The intent there was to have at least one in the queue after the currently processing one (hence a fixed queue of size 1), so yes, there will be at least one more attempt after the last request. And any other attempt will just be rejected. But as I said, I haven't tested it – I know the ArrayBlockingQueue will block any queueing when it hits capacity, but the ThreadpoolExecutor docs say that when a bound queue size is hit, an exception is thrown. The exception was my intention (so all jobs won't be run), perhaps testing will show otherwise..

        Show
        Ramkumar Aiyengar added a comment - Sorry Mark, I completely missed your comments on my patch. The intent there was to have at least one in the queue after the currently processing one (hence a fixed queue of size 1), so yes, there will be at least one more attempt after the last request. And any other attempt will just be rejected. But as I said, I haven't tested it – I know the ArrayBlockingQueue will block any queueing when it hits capacity, but the ThreadpoolExecutor docs say that when a bound queue size is hit, an exception is thrown. The exception was my intention (so all jobs won't be run), perhaps testing will show otherwise..
        Hide
        Mark Miller added a comment -

        Thanks Ramkumar Aiyengar. I'll open a new issue shortly and ping those interested.

        Show
        Mark Miller added a comment - Thanks Ramkumar Aiyengar . I'll open a new issue shortly and ping those interested.
        Hide
        Mark Miller added a comment -

        I created SOLR-8702 Improve our strategy for preventing recovery requests from stacking up.

        Show
        Mark Miller added a comment - I created SOLR-8702 Improve our strategy for preventing recovery requests from stacking up.

          People

          • Assignee:
            Mark Miller
            Reporter:
            Mark Miller
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development