Solr
  1. Solr
  2. SOLR-7130

Make stale state notification work without failing the requests

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.1, 6.0
    • Component/s: clients - java, SolrCloud, SolrJ
    • Labels:
      None

      Description

      I wasn't sure whether to label this a bug or an improvement.

      For collections with stateFormat=2, we now fail requests (because of stale state) which we didn't previously. The client having stale cached cluster state is not a sufficient reason to fail and retry the entire request because in most of such cases, the node receiving the request is still perfectly capable of returning a valid response (either from local replicas or remote ones).

      We should find a better way to notify clients that they have stale state. Perhaps we can modify the response and add a "routing" section instead of outright exception.

      1. SOLR-7130.patch
        16 kB
        Noble Paul
      2. SOLR-7130.patch
        14 kB
        Noble Paul

        Issue Links

          Activity

          Hide
          Noble Paul added a comment -

          When the node is capable of serving the requests locally, We can add the missing details into the payload. But, when we are proxying the requests to a remote server, we can't afford to unmarshal the response and add the extra information. So , for such requests, out options are

          1. Send an exception as it does today
          2. Send the information in the response headers. I'm not sure about using HTTP headers because if we move to a different transport mechanism in the future it is bad

          Another point we need to keep in mind is backward compatibility . Should we start sending the version number of SolrJ along with the request? So, based on that request parameter we can choose to throw an exception or send the information in the payload itself. This will also help us avoiding breaking backcompat in javabin format

          Show
          Noble Paul added a comment - When the node is capable of serving the requests locally, We can add the missing details into the payload. But, when we are proxying the requests to a remote server, we can't afford to unmarshal the response and add the extra information. So , for such requests, out options are Send an exception as it does today Send the information in the response headers. I'm not sure about using HTTP headers because if we move to a different transport mechanism in the future it is bad Another point we need to keep in mind is backward compatibility . Should we start sending the version number of SolrJ along with the request? So, based on that request parameter we can choose to throw an exception or send the information in the payload itself. This will also help us avoiding breaking backcompat in javabin format
          Hide
          Noble Paul added a comment -

          The response NamedList contains information about the znode version of the collection in the server . After the response is received, the client would try to update local colection cache to the version at the server . If another thread updated the version this node expects, then the ZK call is avoided

          Show
          Noble Paul added a comment - The response NamedList contains information about the znode version of the collection in the server . After the response is received, the client would try to update local colection cache to the version at the server . If another thread updated the version this node expects, then the ZK call is avoided
          Hide
          Shalin Shekhar Mangar added a comment -

          Thanks Noble. Some comments:

          1. Maybe we should use the responseHeader to send this routing information? That way we can avoid having to parse the entire response from remote servers too?
          2. This patch still throws an exception if the request is proxied to a remote node. IMO, that isn't acceptable.
          3. Can you add some java docs and define the behaviour/result of ZkStateReader.compareStateVersions method?
          4. Please take care of formatting in the new/changed code.
          Show
          Shalin Shekhar Mangar added a comment - Thanks Noble. Some comments: Maybe we should use the responseHeader to send this routing information? That way we can avoid having to parse the entire response from remote servers too? This patch still throws an exception if the request is proxied to a remote node. IMO, that isn't acceptable. Can you add some java docs and define the behaviour/result of ZkStateReader.compareStateVersions method? Please take care of formatting in the new/changed code.
          Hide
          Noble Paul added a comment -

          Maybe we should use the responseHeader to send this routing information? That way we can avoid having to parse the entire response from remote servers too?

          I don't think it is a good idea. We don't use it anywhere as of now and if we use another transport mechanism it will be a problem

          This patch still throws an exception if the request is proxied to a remote node. IMO, that isn't acceptable.

          That is a rare case . It happened when a core went down completely (and the node is still up).. We can afford to resend the request. Comparing the cost of proxying a request to resending a request , we are not saving much

          This is a first patch , Not fully cleaned up or anything.

          Show
          Noble Paul added a comment - Maybe we should use the responseHeader to send this routing information? That way we can avoid having to parse the entire response from remote servers too? I don't think it is a good idea. We don't use it anywhere as of now and if we use another transport mechanism it will be a problem This patch still throws an exception if the request is proxied to a remote node. IMO, that isn't acceptable. That is a rare case . It happened when a core went down completely (and the node is still up).. We can afford to resend the request. Comparing the cost of proxying a request to resending a request , we are not saving much This is a first patch , Not fully cleaned up or anything.
          Hide
          Shalin Shekhar Mangar added a comment -

          I don't think it is a good idea. We don't use it anywhere as of now and if we use another transport mechanism it will be a problem

          I am referring to the response header which is part of the response. Any transport mechanism that we choose will likely have a responseHeader section.

          That is a rare case . It happened when a core went down completely (and the node is still up).. We can afford to resend the request. Comparing the cost of proxying a request to resending a request , we are not saving much

          Okay but in that case we should change the following logging to INFO level because it will flood the logs needlessly:

          log.error("Request to collection {} failed due to ("+errorCode+
                    ") {}, retry? "+retryCount, collection, rootCause.toString());
          
          Show
          Shalin Shekhar Mangar added a comment - I don't think it is a good idea. We don't use it anywhere as of now and if we use another transport mechanism it will be a problem I am referring to the response header which is part of the response. Any transport mechanism that we choose will likely have a responseHeader section. That is a rare case . It happened when a core went down completely (and the node is still up).. We can afford to resend the request. Comparing the cost of proxying a request to resending a request , we are not saving much Okay but in that case we should change the following logging to INFO level because it will flood the logs needlessly: log.error( "Request to collection {} failed due to (" +errorCode+ ") {}, retry? " +retryCount, collection, rootCause.toString());
          Hide
          Noble Paul added a comment -

          I am referring to the response header which is part of the response. Any transport mechanism that we choose will likely have a responseHeader section.

          We don't/can't parse just the header part of a payload

          Okay but in that case we should change the following logging to INFO level because it will flood the logs needlessly:

          Sure

          Show
          Noble Paul added a comment - I am referring to the response header which is part of the response. Any transport mechanism that we choose will likely have a responseHeader section. We don't/can't parse just the header part of a payload Okay but in that case we should change the following logging to INFO level because it will flood the logs needlessly: Sure
          Hide
          ASF subversion and git services added a comment -

          Commit 1662439 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1662439 ]

          SOLR-7130: Make stale state notification work without failing the requests

          Show
          ASF subversion and git services added a comment - Commit 1662439 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1662439 ] SOLR-7130 : Make stale state notification work without failing the requests
          Hide
          ASF subversion and git services added a comment -

          Commit 1662445 from Noble Paul in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1662445 ]

          SOLR-7130: Make stale state notification work without failing the requests

          Show
          ASF subversion and git services added a comment - Commit 1662445 from Noble Paul in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1662445 ] SOLR-7130 : Make stale state notification work without failing the requests
          Hide
          Mark Miller added a comment -

          + if(col !=null) {

          These if's should have a space after them. Please install the correct lucene / solr code formating rules into your IDE.

          Show
          Mark Miller added a comment - + if(col !=null) { These if's should have a space after them. Please install the correct lucene / solr code formating rules into your IDE.
          Hide
          Mark Miller added a comment -

          if(col !=null) {

          This one in particular needs a space before null as well. It's easier to let your IDE do the work of proper formatting rather than introducing a lot of sloppy formatting as has been happening. We have a code standard - it's not the law, but everyone should at least make a half hearted attempt at following it.

          Show
          Mark Miller added a comment - if(col !=null) { This one in particular needs a space before null as well. It's easier to let your IDE do the work of proper formatting rather than introducing a lot of sloppy formatting as has been happening. We have a code standard - it's not the law, but everyone should at least make a half hearted attempt at following it.
          Hide
          Noble Paul added a comment -

          sure mark

          Show
          Noble Paul added a comment - sure mark
          Hide
          Noble Paul added a comment -

          tests failing

          Build: http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Linux/11887/
          Java: 64bit/jdk1.8.0_40-ea-b22 -XX:-UseCompressedOops -XX:+UseConcMarkSweepGC
          
          1 tests failed.
          FAILED:  org.apache.solr.cloud.BasicDistributedZk2Test.test
          
          Error Message:
          ._stateVer_:{collection1=16}!=null
          
          Stack Trace:
          junit.framework.AssertionFailedError: ._stateVer_:{collection1=16}!=null
                  at __randomizedtesting.SeedInfo.seed([827245A9BAAE634A:A267A7314520EB2]:0)
                  at junit.framework.Assert.fail(Assert.java:50)
                  at org.apache.solr.BaseDistributedSearchTestCase.compareSolrResponses(BaseDistributedSearchTestCase.java:873)
                  at org.apache.solr.BaseDistributedSearchTestCase.compareResponses(BaseDistributedSearchTestCase.java:892)
                  at org.apache.solr.BaseDistributedSearchTestCase.query(BaseDistributedSearchTestCase.java:593)
                  at org.apache.solr.BaseDistributedSearchTestCase.query(BaseDistributedSearchTestCase.java:573)
                  at org.apache.solr.BaseDistributedSearchTestCase.query(BaseDistributedSearchTestCase.java:552)
                  at org.apache.solr.cloud.BasicDistributedZk2Test.brindDownShardIndexSomeDocsAndRecover(BasicDistributedZk2Test.java:280)
          
          Show
          Noble Paul added a comment - tests failing Build: http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Linux/11887/ Java: 64bit/jdk1.8.0_40-ea-b22 -XX:-UseCompressedOops -XX:+UseConcMarkSweepGC 1 tests failed. FAILED: org.apache.solr.cloud.BasicDistributedZk2Test.test Error Message: ._stateVer_:{collection1=16}!=null Stack Trace: junit.framework.AssertionFailedError: ._stateVer_:{collection1=16}!=null at __randomizedtesting.SeedInfo.seed([827245A9BAAE634A:A267A7314520EB2]:0) at junit.framework.Assert.fail(Assert.java:50) at org.apache.solr.BaseDistributedSearchTestCase.compareSolrResponses(BaseDistributedSearchTestCase.java:873) at org.apache.solr.BaseDistributedSearchTestCase.compareResponses(BaseDistributedSearchTestCase.java:892) at org.apache.solr.BaseDistributedSearchTestCase.query(BaseDistributedSearchTestCase.java:593) at org.apache.solr.BaseDistributedSearchTestCase.query(BaseDistributedSearchTestCase.java:573) at org.apache.solr.BaseDistributedSearchTestCase.query(BaseDistributedSearchTestCase.java:552) at org.apache.solr.cloud.BasicDistributedZk2Test.brindDownShardIndexSomeDocsAndRecover(BasicDistributedZk2Test.java:280)
          Hide
          Noble Paul added a comment -

          If I reformat the entire file I end up making the diffs too much because a lot of the existing code does not follow the standards. So I have to selectively make these changes . Some times they slip through

          Show
          Noble Paul added a comment - If I reformat the entire file I end up making the diffs too much because a lot of the existing code does not follow the standards. So I have to selectively make these changes . Some times they slip through
          Hide
          ASF subversion and git services added a comment -

          Commit 1662546 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1662546 ]

          SOLR-7130: formatting mistakes

          Show
          ASF subversion and git services added a comment - Commit 1662546 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1662546 ] SOLR-7130 : formatting mistakes
          Hide
          ASF subversion and git services added a comment -

          Commit 1662547 from Noble Paul in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1662547 ]

          SOLR-7130: formatting mistakes

          Show
          ASF subversion and git services added a comment - Commit 1662547 from Noble Paul in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1662547 ] SOLR-7130 : formatting mistakes
          Hide
          ASF subversion and git services added a comment -

          Commit 1662554 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1662554 ]

          SOLR-7130: remove the special entry from response after reading, so that tests don't see them

          Show
          ASF subversion and git services added a comment - Commit 1662554 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1662554 ] SOLR-7130 : remove the special entry from response after reading, so that tests don't see them
          Hide
          ASF subversion and git services added a comment -

          Commit 1662555 from Noble Paul in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1662555 ]

          SOLR-7130: remove the special entry from response after reading, so that tests don't see them

          Show
          ASF subversion and git services added a comment - Commit 1662555 from Noble Paul in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1662555 ] SOLR-7130 : remove the special entry from response after reading, so that tests don't see them
          Hide
          Timothy Potter added a comment -

          Bulk close after 5.1 release

          Show
          Timothy Potter added a comment - Bulk close after 5.1 release

            People

            • Assignee:
              Noble Paul
              Reporter:
              Shalin Shekhar Mangar
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development