Solr
  1. Solr
  2. SOLR-7203

NoHttpResponseException handling in HttpSolrClient is wrong

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.1
    • Component/s: None
    • Labels:
      None

      Description

      We've got logic in HttpSolrClient to catch NoHttpResponseException and retry. However, this logic appears to be in the wrong place - it's in the createMethod function, which doesn't actually execute any http requests at all. It ought to be in executeMethod.

      Fixing this might help sort out the persistent Jenkins failures as well.

      1. SOLR-7203.patch
        14 kB
        Alan Woodward
      2. SOLR-7203.patch
        20 kB
        Alan Woodward
      3. SOLR-7203.patch
        13 kB
        Alan Woodward

        Issue Links

          Activity

          Hide
          Alan Woodward added a comment -

          Quick n dirty patch moving the retry logic. Would be good to get some more eyes on this though.

          Show
          Alan Woodward added a comment - Quick n dirty patch moving the retry logic. Would be good to get some more eyes on this though.
          Hide
          Mark Miller added a comment -

          We have to be careful here - we can't auto retry on NoHttpResponseException with updates - it means you don't know if the update was accepted or not.

          Show
          Mark Miller added a comment - We have to be careful here - we can't auto retry on NoHttpResponseException with updates - it means you don't know if the update was accepted or not.
          Hide
          Alan Woodward added a comment -

          we can't auto retry on NoHttpResponseException with updates

          I was thinking about this - I'm not sure this is actually true. Updates in Solr are idempotent, aren't they? If you add a document twice, then the second addition just overwrites the first. And if you delete a document twice, then the second deletion just has no effect. So automatically retrying is actually fine.

          Show
          Alan Woodward added a comment - we can't auto retry on NoHttpResponseException with updates I was thinking about this - I'm not sure this is actually true. Updates in Solr are idempotent, aren't they? If you add a document twice, then the second addition just overwrites the first. And if you delete a document twice, then the second deletion just has no effect. So automatically retrying is actually fine.
          Hide
          Alan Woodward added a comment -

          Better patch, with a test. I've refactored executeMethod into two parts, executeMethod and parseResponse, mainly to make testing easier.

          I've noticed some other problems here, like the fact that we're wrapping IOExceptions in SolrServerExceptions, which sort of defeats the point of the SolrClient.request() contract, but that can be dealt with separately, I think.

          Show
          Alan Woodward added a comment - Better patch, with a test. I've refactored executeMethod into two parts, executeMethod and parseResponse, mainly to make testing easier. I've noticed some other problems here, like the fact that we're wrapping IOExceptions in SolrServerExceptions, which sort of defeats the point of the SolrClient.request() contract, but that can be dealt with separately, I think.
          Hide
          Greg Solovyev added a comment -

          This is practically a duplicate of https://issues.apache.org/jira/browse/SOLR-6724

          Show
          Greg Solovyev added a comment - This is practically a duplicate of https://issues.apache.org/jira/browse/SOLR-6724
          Hide
          Mark Miller added a comment -

          No there will be races and chaos monkey test runs will fail. At most it can be an option that defaults to false.

          Show
          Mark Miller added a comment - No there will be races and chaos monkey test runs will fail. At most it can be an option that defaults to false.
          Hide
          Alan Woodward added a comment -

          maxRetries is set to 0 by default, so it does default to false (at the moment). I'm still not clear on where the races are though - can you give an example of a failure that this could cause?

          Greg Solovyev hah, synchronicity! I'll merge your patch into this one, thanks.

          Show
          Alan Woodward added a comment - maxRetries is set to 0 by default, so it does default to false (at the moment). I'm still not clear on where the races are though - can you give an example of a failure that this could cause? Greg Solovyev hah, synchronicity! I'll merge your patch into this one, thanks.
          Hide
          Alan Woodward added a comment -

          OK, I've had a closer look at the partition tests that are failing, and they already try and handle these errors with custom retry logic. Which suggests that the HttpClient persists in using dead connections even for retries.

          What I think we need to do is to call ClientConnectionManager.closeExpiredConnections() when we hit a NoHttpResponseException. That way the retry should be guaranteed not to use a dead connection.

          Show
          Alan Woodward added a comment - OK, I've had a closer look at the partition tests that are failing, and they already try and handle these errors with custom retry logic. Which suggests that the HttpClient persists in using dead connections even for retries. What I think we need to do is to call ClientConnectionManager.closeExpiredConnections() when we hit a NoHttpResponseException. That way the retry should be guaranteed not to use a dead connection.
          Hide
          Mark Miller added a comment -

          can you give an example of a failure that this could cause?

          You can't have requests retrying in the background between nodes, into nodes, whatever when there is more than one client, unless you can guarantee the behavior of the clients. That's why this must be on the user.

          And even if we default to it false and allow one to turn it on, it must come with the warning that the user must guarantee this or else risk inconsistency and data loss.

          Show
          Mark Miller added a comment - can you give an example of a failure that this could cause? You can't have requests retrying in the background between nodes, into nodes, whatever when there is more than one client, unless you can guarantee the behavior of the clients. That's why this must be on the user. And even if we default to it false and allow one to turn it on, it must come with the warning that the user must guarantee this or else risk inconsistency and data loss.
          Hide
          Mark Miller added a comment -

          and they already try and handle these errors with custom retry logic

          Yes, in the user code! Where it needs to be.

          Show
          Mark Miller added a comment - and they already try and handle these errors with custom retry logic Yes, in the user code! Where it needs to be.
          Hide
          Mark Miller added a comment -

          (at the moment).

          I'd have to -1 anything else since it breaks the system

          Show
          Mark Miller added a comment - (at the moment). I'd have to -1 anything else since it breaks the system
          Hide
          Alan Woodward added a comment -

          I'd have to -1 anything else since it breaks the system

          Fussy, fussy.

          The patch as it stands just moves the retry logic to the right place, and leaves the default number of retries as 0. I think that should be OK to go in?

          Show
          Alan Woodward added a comment - I'd have to -1 anything else since it breaks the system Fussy, fussy. The patch as it stands just moves the retry logic to the right place, and leaves the default number of retries as 0. I think that should be OK to go in?
          Hide
          Mark Miller added a comment -

          I'm just not okay with introducing data loss It's kind of a big deal in the hardening of SolrCloud. If that is fussy, sue me. Especially when it's an issue that has already been fixed.

          I think that should be OK to go in?

          I think it's fine to have as an option that defaults to false - for many users, a simple retry is the option they will pick - but I'm going to add javadoc to it that explains using it is not safe unless you can gaurantee your clients won't fight each other.

          Leaving it off by default will also ensure inter node communication doesn't use it, which is important to avoid replica inconsistency (which is the same as data loss).

          Show
          Mark Miller added a comment - I'm just not okay with introducing data loss It's kind of a big deal in the hardening of SolrCloud. If that is fussy, sue me. Especially when it's an issue that has already been fixed. I think that should be OK to go in? I think it's fine to have as an option that defaults to false - for many users, a simple retry is the option they will pick - but I'm going to add javadoc to it that explains using it is not safe unless you can gaurantee your clients won't fight each other. Leaving it off by default will also ensure inter node communication doesn't use it, which is important to avoid replica inconsistency (which is the same as data loss).
          Hide
          Mark Miller added a comment -

          Also, as I said, if the default was not off, random legit chaos monkey tests will fail - also a no no.

          Show
          Mark Miller added a comment - Also, as I said, if the default was not off, random legit chaos monkey tests will fail - also a no no.
          Hide
          Mark Miller added a comment -

          To define what I mean by data loss:

          This means that data you added to the system and got an ack for goes away. It means we can lie to the user and they can't trust important data to our system. The only way that should happen is if a user says it's okay or if there is a bug.

          Show
          Mark Miller added a comment - To define what I mean by data loss: This means that data you added to the system and got an ack for goes away. It means we can lie to the user and they can't trust important data to our system. The only way that should happen is if a user says it's okay or if there is a bug.
          Hide
          Alan Woodward added a comment -

          Sorry, British sense of humour not coming across in text there. I agree with everything you've said.

          Show
          Alan Woodward added a comment - Sorry, British sense of humour not coming across in text there. I agree with everything you've said.
          Hide
          Mark Miller added a comment -

          I took it as, mention -1, dont be a dick Not a lot of text there to go on.

          In either case, I meant it as, "i'm really positive this is a problem!"

          Show
          Mark Miller added a comment - I took it as, mention -1, dont be a dick Not a lot of text there to go on. In either case, I meant it as, "i'm really positive this is a problem!"
          Hide
          Mark Miller added a comment - - edited

          Since we seem to have gotten past that, moving on:

          In some cases, if we can properly detect certain things, we do want an auto retry. On read only operations, this makes complete sense. I also think, if its an update or a delete and optimistic locking is being used, that may be okay too. The part with holes is when a new doc comes in, some other client doing periodic updates to docs that match a certain criteria updates it, and then retry number 3 hits and undoes the update that was acked against a doc that was already part of the system according to client 2. If you do that type of auto retry on internode communication, it may also vary replica to replica which update survives - and once a replica gets out of sync, unless you have a very simple add only system, fixing the issue without data loss is almost impossible in all cases.

          Finally, overall, we probably want the retry behavior centered in one spot - the retryhandler used by the HttpClient. Having retry logic there and at a higher level gets confusing.

          Show
          Mark Miller added a comment - - edited Since we seem to have gotten past that, moving on: In some cases, if we can properly detect certain things, we do want an auto retry. On read only operations, this makes complete sense. I also think, if its an update or a delete and optimistic locking is being used, that may be okay too. The part with holes is when a new doc comes in, some other client doing periodic updates to docs that match a certain criteria updates it, and then retry number 3 hits and undoes the update that was acked against a doc that was already part of the system according to client 2. If you do that type of auto retry on internode communication, it may also vary replica to replica which update survives - and once a replica gets out of sync, unless you have a very simple add only system, fixing the issue without data loss is almost impossible in all cases. Finally, overall, we probably want the retry behavior centered in one spot - the retryhandler used by the HttpClient. Having retry logic there and at a higher level gets confusing.
          Hide
          Alan Woodward added a comment -

          Maybe the best way forward here then is to just remove the broken code, and deprecate setMaxRetries() on HttpSolrClient (seeing as it doesn't do anything now anyway)?

          Show
          Alan Woodward added a comment - Maybe the best way forward here then is to just remove the broken code, and deprecate setMaxRetries() on HttpSolrClient (seeing as it doesn't do anything now anyway)?
          Hide
          Greg Solovyev added a comment -

          +1

          Show
          Greg Solovyev added a comment - +1
          Hide
          Shawn Heisey added a comment -

          and deprecate setMaxRetries() on HttpSolrClient (seeing as it doesn't do anything now anyway)

          I admit that my grasp of the code may not be very good, but in both branch_5x and trunk, maxRetries does seem to be used to control a while loop that contains the http request. When I read this, it looks to me like setMaxRetries does do something, so what am I missing?

          Show
          Shawn Heisey added a comment - and deprecate setMaxRetries() on HttpSolrClient (seeing as it doesn't do anything now anyway) I admit that my grasp of the code may not be very good, but in both branch_5x and trunk, maxRetries does seem to be used to control a while loop that contains the http request. When I read this, it looks to me like setMaxRetries does do something, so what am I missing?
          Hide
          Alan Woodward added a comment -

          Patch removing the dead code and deprecating setMaxRetries().

          Shawn: the retry logic is wrapping code that creates an Http method, but it doesn't actually execute it. So it's essentially a no-op.

          Show
          Alan Woodward added a comment - Patch removing the dead code and deprecating setMaxRetries(). Shawn: the retry logic is wrapping code that creates an Http method, but it doesn't actually execute it. So it's essentially a no-op.
          Hide
          Greg Solovyev added a comment - - edited

          Shawn, if you look closely at what is happening in the while loop, you will see that the request object is being created in it, but is not actually being sent. The request is being sent in HttpSolrClient::executeMethod and maxRetries is being used in createMethod. FWIF, there isn't anything in that while look that can throw NoHttpResponseException

          Show
          Greg Solovyev added a comment - - edited Shawn, if you look closely at what is happening in the while loop, you will see that the request object is being created in it, but is not actually being sent. The request is being sent in HttpSolrClient::executeMethod and maxRetries is being used in createMethod. FWIF, there isn't anything in that while look that can throw NoHttpResponseException
          Hide
          Shawn Heisey added a comment -

          Thanks for the explanation. I did not grok it fully, because I'm only familiar with HttpClient on a very shallow basis. I should remedy that, but there's never enough time to learn everything.

          Show
          Shawn Heisey added a comment - Thanks for the explanation. I did not grok it fully, because I'm only familiar with HttpClient on a very shallow basis. I should remedy that, but there's never enough time to learn everything.
          Hide
          ASF subversion and git services added a comment -

          Commit 1669313 from Alan Woodward in branch 'dev/trunk'
          [ https://svn.apache.org/r1669313 ]

          SOLR-7203: Remove buggy no-op retries in HttpSolrClient

          Show
          ASF subversion and git services added a comment - Commit 1669313 from Alan Woodward in branch 'dev/trunk' [ https://svn.apache.org/r1669313 ] SOLR-7203 : Remove buggy no-op retries in HttpSolrClient
          Hide
          ASF subversion and git services added a comment -

          Commit 1669316 from Alan Woodward in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1669316 ]

          SOLR-7203: Remove buggy no-op retries in HttpSolrClient

          Show
          ASF subversion and git services added a comment - Commit 1669316 from Alan Woodward in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1669316 ] SOLR-7203 : Remove buggy no-op retries in HttpSolrClient
          Hide
          Timothy Potter added a comment -

          Bulk close after 5.1 release

          Show
          Timothy Potter added a comment - Bulk close after 5.1 release

            People

            • Assignee:
              Alan Woodward
              Reporter:
              Alan Woodward
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development