Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9512

CloudSolrClient's cluster state cache can break direct updates to leaders

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.4
    • Component/s: None
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      This is the root cause of SOLR-9305 and (at least some of) SOLR-9390. The process goes something like this:

      Documents are added to the cluster via a CloudSolrClient, with directUpdatesToLeadersOnly set to true. CSC caches its view of the DocCollection. The leader then goes down, and is reassigned. Next time documents are added, CSC checks its cache again, and gets the old view of the DocCollection. It then tries to send the update directly to the old, now down, leader, and we get ConnectionRefused.

      1. SOLR-9512.patch
        46 kB
        Noble Paul
      2. SOLR-9512.patch
        7 kB
        Noble Paul
      3. SOLR-9512.patch
        15 kB
        Alan Woodward

        Issue Links

          Activity

          Hide
          romseygeek Alan Woodward added a comment -

          Possibly related: SOLR-9090, SOLR-6312.

          I think the best case here is to catch a ConnectionRefusedException on a routed update and expire the collection cache entry before retrying?

          Show
          romseygeek Alan Woodward added a comment - Possibly related: SOLR-9090 , SOLR-6312 . I think the best case here is to catch a ConnectionRefusedException on a routed update and expire the collection cache entry before retrying?
          Hide
          noble.paul Noble Paul added a comment -
          • Catch the Connectionrefused Exception
          • Check how old is the state. if it is older than a few seconds invalidate the cache and retry
          • if it is fresh, throw the error back
          Show
          noble.paul Noble Paul added a comment - Catch the Connectionrefused Exception Check how old is the state. if it is older than a few seconds invalidate the cache and retry if it is fresh, throw the error back
          Hide
          romseygeek Alan Woodward added a comment -

          Having played with this a bit, I think adding extra retry logic to CloudSolrClient isn't the best solution; instead, I think we should make directUpdatesToLeaders a hint, rather than a directive, and just make sure that the leader is the first URL in the list passed to the load-balancer. We can then check in the response if the leader was in fact the shard that served that particular request, and if not, then we invalidate the collection cache. Christine Poerschke you worked on SOLR-9090, does this make sense to you?

          Show
          romseygeek Alan Woodward added a comment - Having played with this a bit, I think adding extra retry logic to CloudSolrClient isn't the best solution; instead, I think we should make directUpdatesToLeaders a hint, rather than a directive, and just make sure that the leader is the first URL in the list passed to the load-balancer. We can then check in the response if the leader was in fact the shard that served that particular request, and if not, then we invalidate the collection cache. Christine Poerschke you worked on SOLR-9090 , does this make sense to you?
          Hide
          noble.paul Noble Paul added a comment -

          It's just hiding the problem. The fact that there is no leader for that shard means that the request will fail eventually.

          Show
          noble.paul Noble Paul added a comment - It's just hiding the problem. The fact that there is no leader for that shard means that the request will fail eventually.
          Hide
          romseygeek Alan Woodward added a comment -

          I don't think so - there is a leader, it's just that we locally have the wrong one cached. And because we're sending our updates only to the previous (and now down) leader, we get a failure, when instead we ought to be sending this update on to the next leader.

          Show
          romseygeek Alan Woodward added a comment - I don't think so - there is a leader, it's just that we locally have the wrong one cached. And because we're sending our updates only to the previous (and now down) leader, we get a failure, when instead we ought to be sending this update on to the next leader.
          Hide
          noble.paul Noble Paul added a comment -

          there is a leader, it's just that we locally have the wrong one cached

          When we get an error , we must invalidate the cache. That's part of the solution. But the larger problem is that leader election takes a while and the state.json will have that information after sometime. The solution of sending the doc to another replica in the shard is just kicking the can down the road

          Show
          noble.paul Noble Paul added a comment - there is a leader, it's just that we locally have the wrong one cached When we get an error , we must invalidate the cache. That's part of the solution. But the larger problem is that leader election takes a while and the state.json will have that information after sometime. The solution of sending the doc to another replica in the shard is just kicking the can down the road
          Hide
          romseygeek Alan Woodward added a comment -

          It's kicking the can to a live server, though, which will be able to tell us either a) yes, everything's fine now, you just need to invalidate your cache, or b) there's no leader for this shard, so fail the update. Either way, we get useful information, whereas at the moment we're trying to find out what's happening from a dead server.

          Show
          romseygeek Alan Woodward added a comment - It's kicking the can to a live server, though, which will be able to tell us either a) yes, everything's fine now, you just need to invalidate your cache, or b) there's no leader for this shard, so fail the update. Either way, we get useful information, whereas at the moment we're trying to find out what's happening from a dead server.
          Hide
          romseygeek Alan Woodward added a comment -

          (apologies for the terrible metaphor there, it's early on a Saturday)

          Show
          romseygeek Alan Woodward added a comment - (apologies for the terrible metaphor there, it's early on a Saturday)
          Hide
          noble.paul Noble Paul added a comment -

          So in your solution here is what happens

          1. Instead of just passing one server , we pass all the nodes to LBHttpSolrClient (LBHSC). The shard leader should be the first in the list
          2. LBHSC knows that the leader is a dead node (or it will soon know that). So it would pick up the next server in the list and makes a request there
          3. The request would come back with an error (no leader)
          4. CSC returns the call with an error "no leader"

          Is that right?

          Show
          noble.paul Noble Paul added a comment - So in your solution here is what happens Instead of just passing one server , we pass all the nodes to LBHttpSolrClient (LBHSC). The shard leader should be the first in the list LBHSC knows that the leader is a dead node (or it will soon know that). So it would pick up the next server in the list and makes a request there The request would come back with an error (no leader) CSC returns the call with an error "no leader" Is that right?
          Hide
          romseygeek Alan Woodward added a comment -

          Right, there are two scenarios:
          1. The leader is down, and there's no replacement voted for yet, in which case things happen much as you describe above
          2. The old leader is down, a new leader has been selected, but the cache hasn't updated yet. In this case the update actually succeeds, as it's passed to the next node in the list and then forwarded on to the relevant leader.
          In both cases, we need to invalidate the cache.

          Separately, there's a bit of cleanup we can do in the directUpdate() method call - at the moment we have two paths, dependent on whether or not we're using parallel updates or not, and they end up doing things like throwing slightly different exceptions for the same failure types. I'll open up another JIRA for that.

          Show
          romseygeek Alan Woodward added a comment - Right, there are two scenarios: 1. The leader is down, and there's no replacement voted for yet, in which case things happen much as you describe above 2. The old leader is down, a new leader has been selected, but the cache hasn't updated yet. In this case the update actually succeeds, as it's passed to the next node in the list and then forwarded on to the relevant leader. In both cases, we need to invalidate the cache. Separately, there's a bit of cleanup we can do in the directUpdate() method call - at the moment we have two paths, dependent on whether or not we're using parallel updates or not, and they end up doing things like throwing slightly different exceptions for the same failure types. I'll open up another JIRA for that.
          Hide
          noble.paul Noble Paul added a comment -

          The old leader is down, a new leader has been selected, but the cache hasn't updated yet. In this case the update actually succeeds, as it's passed to the next node in the list and then forwarded on to the relevant leader.

          This is already taken care of. When SolrJ makes a request to the node, it responds with a flag to invalidate the cache. So the next request will get the updated state

          Show
          noble.paul Noble Paul added a comment - The old leader is down, a new leader has been selected, but the cache hasn't updated yet. In this case the update actually succeeds, as it's passed to the next node in the list and then forwarded on to the relevant leader. This is already taken care of. When SolrJ makes a request to the node, it responds with a flag to invalidate the cache. So the next request will get the updated state
          Hide
          romseygeek Alan Woodward added a comment -

          Patch, changing the 'buildUrlMap()' method to always include a full list of replica URLs, with the leader in front. In reality, this makes 'directUpdateToLeadersOnly' a no-op, so maybe we should just deprecate and remove it?

          When SolrJ makes a request to the node, it responds with a flag to invalidate the cache. So the next request will get the updated state

          I don't think this is the case for update requests? At the least, when running the test in the patch, a breakpoint in ZkStateReader.compareStateVersions is never triggered. Looking at HttpSolrCall it appears that it's only used in /select requests, which may be a bug by itself, of course.

          Show
          romseygeek Alan Woodward added a comment - Patch, changing the 'buildUrlMap()' method to always include a full list of replica URLs, with the leader in front. In reality, this makes 'directUpdateToLeadersOnly' a no-op, so maybe we should just deprecate and remove it? When SolrJ makes a request to the node, it responds with a flag to invalidate the cache. So the next request will get the updated state I don't think this is the case for update requests? At the least, when running the test in the patch, a breakpoint in ZkStateReader.compareStateVersions is never triggered. Looking at HttpSolrCall it appears that it's only used in /select requests, which may be a bug by itself, of course.
          Hide
          noble.paul Noble Paul added a comment -

          Looking at HttpSolrCall it appears that it's only used in /select requests,

          It must be a serious bug. Without proper invalidation, the caching is useless

          Show
          noble.paul Noble Paul added a comment - Looking at HttpSolrCall it appears that it's only used in /select requests, It must be a serious bug. Without proper invalidation, the caching is useless
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit f96017d9e10c665e7ab6b9161f2af08efc491946 in lucene-solr's branch refs/heads/branch_6x from Alan Woodward
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f96017d ]

          SOLR-9512: CloudSolrClient tries other replicas if a cached leader is down

          Show
          jira-bot ASF subversion and git services added a comment - Commit f96017d9e10c665e7ab6b9161f2af08efc491946 in lucene-solr's branch refs/heads/branch_6x from Alan Woodward [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f96017d ] SOLR-9512 : CloudSolrClient tries other replicas if a cached leader is down
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 3d130097b7768a8d753476ffe26b83db070c8e20 in lucene-solr's branch refs/heads/master from Alan Woodward
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3d13009 ]

          SOLR-9512: CloudSolrClient tries other replicas if a cached leader is down

          Show
          jira-bot ASF subversion and git services added a comment - Commit 3d130097b7768a8d753476ffe26b83db070c8e20 in lucene-solr's branch refs/heads/master from Alan Woodward [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3d13009 ] SOLR-9512 : CloudSolrClient tries other replicas if a cached leader is down
          Hide
          noble.paul Noble Paul added a comment - - edited

          Alan Woodward The patch contains way more changes than you mentioned. You committed it without any review from the people who are collaborating with you.

          If there are other people collaborating on a ticket, the general protocol is that you submit a patch with the changes explained and give some review time before committing stuff.

          You submitted a patch and 3 hours later you committed it

          Show
          noble.paul Noble Paul added a comment - - edited Alan Woodward The patch contains way more changes than you mentioned. You committed it without any review from the people who are collaborating with you. If there are other people collaborating on a ticket, the general protocol is that you submit a patch with the changes explained and give some review time before committing stuff. You submitted a patch and 3 hours later you committed it
          Hide
          noble.paul Noble Paul added a comment -

          The following is not the way we should invalidate the cache.

           if (response.getServer().equals(url) == false) {
                      // we didn't hit our first-preference server, which means that our cached
                      // collection state is no longer valid
                      invalidateCollectionState(collection);
                    }
          
          Show
          noble.paul Noble Paul added a comment - The following is not the way we should invalidate the cache. if (response.getServer().equals(url) == false ) { // we didn't hit our first-preference server, which means that our cached // collection state is no longer valid invalidateCollectionState(collection); }
          Hide
          romseygeek Alan Woodward added a comment -

          The patch contains way more changes than you mentioned.

          I ... don't think it does? It makes the changes I described above to CloudSolrClient, and adds a test case. What else is there?

          The following is not the way we should invalidate the cache.

          How so?

          Show
          romseygeek Alan Woodward added a comment - The patch contains way more changes than you mentioned. I ... don't think it does? It makes the changes I described above to CloudSolrClient, and adds a test case. What else is there? The following is not the way we should invalidate the cache. How so?
          Hide
          noble.paul Noble Paul added a comment -

          How so?

          You invalidate the cache if the first server did not serve the request. That's a problem. When the next request comes , it gets the fresh state which is exactly the same as the entry that was just invalidated because the new leader is not elected yet and the state. json is not yet updated in ZK. As we discussed before, the cache must be invalidated when a server says the version is stale

          Show
          noble.paul Noble Paul added a comment - How so? You invalidate the cache if the first server did not serve the request. That's a problem. When the next request comes , it gets the fresh state which is exactly the same as the entry that was just invalidated because the new leader is not elected yet and the state. json is not yet updated in ZK. As we discussed before, the cache must be invalidated when a server says the version is stale
          Hide
          romseygeek Alan Woodward added a comment -

          because the new leader is not elected yet

          The invalidation clause above is only called when the request was successful, ie if a new leader is up. If leader election hasn't happened yet, then LBHttpSolrClient will throw an exception and the whole request will fail. The test case should illustrate this.

          Show
          romseygeek Alan Woodward added a comment - because the new leader is not elected yet The invalidation clause above is only called when the request was successful, ie if a new leader is up. If leader election hasn't happened yet, then LBHttpSolrClient will throw an exception and the whole request will fail. The test case should illustrate this.
          Hide
          noble.paul Noble Paul added a comment -

          The invalidation clause above is only called when the request was successful,

          It shouldn't need to. The response must have the flag to invalidate the collection. if that code is not working as expected we should fix that.

          Show
          noble.paul Noble Paul added a comment - The invalidation clause above is only called when the request was successful, It shouldn't need to. The response must have the flag to invalidate the collection. if that code is not working as expected we should fix that.
          Hide
          romseygeek Alan Woodward added a comment -

          OK, but lets deal with that in a separate issue, as I think it's a bit more complicated than just fixing cache invalidation for /update requests. The current invalidation and retry logic happens at the level of the whole request, but CSC is splitting updates up into several sub-requests, and we probably don't want to redo the whole thing if a single subshard has changed leader.

          Show
          romseygeek Alan Woodward added a comment - OK, but lets deal with that in a separate issue, as I think it's a bit more complicated than just fixing cache invalidation for /update requests. The current invalidation and retry logic happens at the level of the whole request, but CSC is splitting updates up into several sub-requests, and we probably don't want to redo the whole thing if a single subshard has changed leader.
          Hide
          noble.paul Noble Paul added a comment - - edited

          Let's first discuss what is the fix and don't rush into solutions.If I were you I would revert the commit and discuss the solution and get a consensus

          Show
          noble.paul Noble Paul added a comment - - edited Let's first discuss what is the fix and don't rush into solutions.If I were you I would revert the commit and discuss the solution and get a consensus
          Hide
          cpoerschke Christine Poerschke added a comment -

          Hi Alan Woodward and Noble Paul - am only now joining this ticket late here since i have been on vacation.

          The SOLR-9090 directUpdatesToLeadersOnly motivation/intention was for the flag to be not a hint but a directive and for updates to 'fail fast' if there is (temporarily or otherwise) no shard leader. Fail fast (and let the caller of the CloudSolrClient handle alarming and retries as it sees fit) as opposed to sending or retry-sending to a non-leader which would then forward to the leader (and potentially still fail eventually, eventually/not-fast-slowly).

          Marvin Justice and I worked together on SOLR-9090 - Marvin, any thoughts on this ticket here?

          Show
          cpoerschke Christine Poerschke added a comment - Hi Alan Woodward and Noble Paul - am only now joining this ticket late here since i have been on vacation. The SOLR-9090 directUpdatesToLeadersOnly motivation/intention was for the flag to be not a hint but a directive and for updates to 'fail fast' if there is (temporarily or otherwise) no shard leader. Fail fast (and let the caller of the CloudSolrClient handle alarming and retries as it sees fit) as opposed to sending or retry-sending to a non-leader which would then forward to the leader (and potentially still fail eventually, eventually/not-fast-slowly). Marvin Justice and I worked together on SOLR-9090 - Marvin, any thoughts on this ticket here?
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit a4293ce7c4e849b171430a34f36b830a84927a93 in lucene-solr's branch refs/heads/branch_6x from Alan Woodward
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a4293ce ]

          Revert "SOLR-9512: CloudSolrClient tries other replicas if a cached leader is down"

          This reverts commit f96017d9e10c665e7ab6b9161f2af08efc491946.

          Show
          jira-bot ASF subversion and git services added a comment - Commit a4293ce7c4e849b171430a34f36b830a84927a93 in lucene-solr's branch refs/heads/branch_6x from Alan Woodward [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a4293ce ] Revert " SOLR-9512 : CloudSolrClient tries other replicas if a cached leader is down" This reverts commit f96017d9e10c665e7ab6b9161f2af08efc491946.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit bd3fc7f43ff54a174660b7ad51f031d2104f84b5 in lucene-solr's branch refs/heads/master from Alan Woodward
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bd3fc7f ]

          Revert "SOLR-9512: CloudSolrClient tries other replicas if a cached leader is down"

          This reverts commit 3d130097b7768a8d753476ffe26b83db070c8e20.

          Show
          jira-bot ASF subversion and git services added a comment - Commit bd3fc7f43ff54a174660b7ad51f031d2104f84b5 in lucene-solr's branch refs/heads/master from Alan Woodward [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bd3fc7f ] Revert " SOLR-9512 : CloudSolrClient tries other replicas if a cached leader is down" This reverts commit 3d130097b7768a8d753476ffe26b83db070c8e20.
          Hide
          romseygeek Alan Woodward added a comment -

          OK, have reverted. For the moment, I'll set directUpdatesToLeaders=false on the two regularly failing test cases.

          Show
          romseygeek Alan Woodward added a comment - OK, have reverted. For the moment, I'll set directUpdatesToLeaders=false on the two regularly failing test cases.
          Hide
          noble.paul Noble Paul added a comment -

          Does it make any sense to have the explicit flag directUpdatesToLeaders ? IMHO it should be the only supported behavior. Why would we ever send a request ever to another node when the shard leader is down?

          My proposal is as follows.

          • The LBHttpSolrClient is aware of down servers. So, if the leader for the shard is down we already know it and we can fail fast
          • If we have to fail because the shard leader is dead, we should try to explicitly read the state.json for that collection provided the cached state is older than a certain threshold (say 1 sec?). This means, the client may fire a request to ZK to every second to refresh the state. For such refreshes, check the version first to optimize ZK reads
          Show
          noble.paul Noble Paul added a comment - Does it make any sense to have the explicit flag directUpdatesToLeaders ? IMHO it should be the only supported behavior. Why would we ever send a request ever to another node when the shard leader is down? My proposal is as follows. The LBHttpSolrClient is aware of down servers. So, if the leader for the shard is down we already know it and we can fail fast If we have to fail because the shard leader is dead, we should try to explicitly read the state.json for that collection provided the cached state is older than a certain threshold (say 1 sec?). This means, the client may fire a request to ZK to every second to refresh the state. For such refreshes, check the version first to optimize ZK reads
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit bae66f7cca8cff796d142eb19585d8e79fae34f8 in lucene-solr's branch refs/heads/branch_6x from Alan Woodward
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bae66f7 ]

          SOLR-9305, SOLR-9390: Don't use directToLeaders updates in partition tests (see SOLR-9512)

          Show
          jira-bot ASF subversion and git services added a comment - Commit bae66f7cca8cff796d142eb19585d8e79fae34f8 in lucene-solr's branch refs/heads/branch_6x from Alan Woodward [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bae66f7 ] SOLR-9305 , SOLR-9390 : Don't use directToLeaders updates in partition tests (see SOLR-9512 )
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit d326adc8bfa33432d50293402a39454d60e070e4 in lucene-solr's branch refs/heads/master from Alan Woodward
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d326adc ]

          SOLR-9305, SOLR-9390: Don't use directToLeaders updates in partition tests (see SOLR-9512)

          Show
          jira-bot ASF subversion and git services added a comment - Commit d326adc8bfa33432d50293402a39454d60e070e4 in lucene-solr's branch refs/heads/master from Alan Woodward [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d326adc ] SOLR-9305 , SOLR-9390 : Don't use directToLeaders updates in partition tests (see SOLR-9512 )
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Noble and I discussed this offline. Here is a summary of the problem and the solution:

          There are five cases that we need to tackle. Assuming replica x is leader:

          1. Case 1: x is disconnected from zk, y becomes leader
            • currently — x throws error on indexing, client fails and keeps trying to send requests to x and fail. This will continue until X re-connects and the client gets a stale state flag in the response.
          2. Case 2: x is dead, y becomes leader
            • currently - client gets connect exception or NoResponseException (for in-flight requests) and client keeps retrying request to x. This will continue until x comes back online.
          3. Case 3: x is disconnected from zk, no one is leader
            • currently – client keeps sending requests to x which fail because x is disconnected from leader. This will continue until X re-connects and the client gets a stale state flag in the response.
          4. Case 4: x is dead, no one is leader yet
            • currently - client gets connect exception or NoResponseException (for in-flight requests) and client keeps retrying request to x. This will continue until x comes back online.
          5. Case 5: x is alive but now y is leader
            • currently – client gets a stale state flag from x and refreshes cluster state to see y as the new leader. All further indexing requests are sent to y.
          6. Case 6: client is disconnected from zk
            • currently – client keeps indexing to x. If it receives a stale state error, it will try to refresh cluster state, fail and continue to send further requests to x, keep failing and keep trying to read from zk and be stuck in a cycle.

          Cases 1-5 are solved by a single solution – On ConnectException, NoHttpResponseException, Leader disconnected from zk error, client should fetch state from zk again. If client fetches from zk and does not get a new version then this should be marked in a flag and subsequent retries should only happen after N seconds are elapsed or if we know for a fact that version has changed since the last zk fetch was made. N could be something small as 2 seconds or so.

          Case 6 is more difficult. Either we can keep failing the indexing requests or we can ask a random Solr instance to return the latest cluster state. This is kinda dangerous because it can open us up to very difficult to debug bugs so I am inclined to punt on this for now.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Noble and I discussed this offline. Here is a summary of the problem and the solution: There are five cases that we need to tackle. Assuming replica x is leader: Case 1: x is disconnected from zk, y becomes leader currently — x throws error on indexing, client fails and keeps trying to send requests to x and fail. This will continue until X re-connects and the client gets a stale state flag in the response. Case 2: x is dead, y becomes leader currently - client gets connect exception or NoResponseException (for in-flight requests) and client keeps retrying request to x. This will continue until x comes back online. Case 3: x is disconnected from zk, no one is leader currently – client keeps sending requests to x which fail because x is disconnected from leader. This will continue until X re-connects and the client gets a stale state flag in the response. Case 4: x is dead, no one is leader yet currently - client gets connect exception or NoResponseException (for in-flight requests) and client keeps retrying request to x. This will continue until x comes back online. Case 5: x is alive but now y is leader currently – client gets a stale state flag from x and refreshes cluster state to see y as the new leader. All further indexing requests are sent to y. Case 6: client is disconnected from zk currently – client keeps indexing to x. If it receives a stale state error, it will try to refresh cluster state, fail and continue to send further requests to x, keep failing and keep trying to read from zk and be stuck in a cycle. Cases 1-5 are solved by a single solution – On ConnectException, NoHttpResponseException, Leader disconnected from zk error, client should fetch state from zk again. If client fetches from zk and does not get a new version then this should be marked in a flag and subsequent retries should only happen after N seconds are elapsed or if we know for a fact that version has changed since the last zk fetch was made. N could be something small as 2 seconds or so. Case 6 is more difficult. Either we can keep failing the indexing requests or we can ask a random Solr instance to return the latest cluster state. This is kinda dangerous because it can open us up to very difficult to debug bugs so I am inclined to punt on this for now.
          Hide
          cpoerschke Christine Poerschke added a comment -

          Thanks Shalin and Noble for analysing and summarising this.

          The proposed cases 1-5 solution sounds good to me (though i have not actually looked at the code concerned to see what the implementation of that solution might look like).

          Case 6: do i understand it right that we would keep failing the indexing requests but 'only' until eventually the client manages to reconnect to zk? I agree that asking a random solr instance for its latest cluster state could be problematic.

          Show
          cpoerschke Christine Poerschke added a comment - Thanks Shalin and Noble for analysing and summarising this. The proposed cases 1-5 solution sounds good to me (though i have not actually looked at the code concerned to see what the implementation of that solution might look like). Case 6: do i understand it right that we would keep failing the indexing requests but 'only' until eventually the client manages to reconnect to zk? I agree that asking a random solr instance for its latest cluster state could be problematic.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Case 6: do i understand it right that we would keep failing the indexing requests but 'only' until eventually the client manages to reconnect to zk?

          Yes, that is correct.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Case 6: do i understand it right that we would keep failing the indexing requests but 'only' until eventually the client manages to reconnect to zk? Yes, that is correct.
          Hide
          noble.paul Noble Paul added a comment -

          no tests yet

          Show
          noble.paul Noble Paul added a comment - no tests yet
          Hide
          noble.paul Noble Paul added a comment -

          Added tests

          Show
          noble.paul Noble Paul added a comment - Added tests
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit e309f9058985375076cac0ed982a158dd865b86a in lucene-solr's branch refs/heads/branch_6x from Noble Paul
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e309f90 ]

          SOLR-9784: Refactor CloudSolrClient to eliminate direct dependency on ZK
          SOLR-9512: CloudSolrClient's cluster state cache can break direct updates to leaders

          Show
          jira-bot ASF subversion and git services added a comment - Commit e309f9058985375076cac0ed982a158dd865b86a in lucene-solr's branch refs/heads/branch_6x from Noble Paul [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e309f90 ] SOLR-9784 : Refactor CloudSolrClient to eliminate direct dependency on ZK SOLR-9512 : CloudSolrClient's cluster state cache can break direct updates to leaders
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit d87ffa4bf82c30e9a6f0bbb6b8c0087a5c07f9d6 in lucene-solr's branch refs/heads/master from Noble Paul
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d87ffa4 ]

          SOLR-9784: Refactor CloudSolrClient to eliminate direct dependency on ZK
          SOLR-9512: CloudSolrClient's cluster state cache can break direct updates to leaders

          Show
          jira-bot ASF subversion and git services added a comment - Commit d87ffa4bf82c30e9a6f0bbb6b8c0087a5c07f9d6 in lucene-solr's branch refs/heads/master from Noble Paul [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d87ffa4 ] SOLR-9784 : Refactor CloudSolrClient to eliminate direct dependency on ZK SOLR-9512 : CloudSolrClient's cluster state cache can break direct updates to leaders
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 5650939a8d41b7bad584947a2c9dcedf3774b8de in lucene-solr's branch refs/heads/master from Noble Paul
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5650939 ]

          SOLR-9784: Refactor CloudSolrClient to eliminate direct dependency on ZK
          SOLR-9512: CloudSolrClient's cluster state cache can break direct updates to leaders

          Show
          jira-bot ASF subversion and git services added a comment - Commit 5650939a8d41b7bad584947a2c9dcedf3774b8de in lucene-solr's branch refs/heads/master from Noble Paul [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5650939 ] SOLR-9784 : Refactor CloudSolrClient to eliminate direct dependency on ZK SOLR-9512 : CloudSolrClient's cluster state cache can break direct updates to leaders
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 142461b395506efa01ec2509346bae755f1b2726 in lucene-solr's branch refs/heads/master from Noble Paul
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=142461b ]

          SOLR-9512: removed unused import

          Show
          jira-bot ASF subversion and git services added a comment - Commit 142461b395506efa01ec2509346bae755f1b2726 in lucene-solr's branch refs/heads/master from Noble Paul [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=142461b ] SOLR-9512 : removed unused import
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 3a3d6afef08e3c49c29cc9a015f4e7cca40cf52d in lucene-solr's branch refs/heads/branch_6x from Noble Paul
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3a3d6af ]

          SOLR-9512: removed unused import

          Show
          jira-bot ASF subversion and git services added a comment - Commit 3a3d6afef08e3c49c29cc9a015f4e7cca40cf52d in lucene-solr's branch refs/heads/branch_6x from Noble Paul [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3a3d6af ] SOLR-9512 : removed unused import

            People

            • Assignee:
              noble.paul Noble Paul
              Reporter:
              romseygeek Alan Woodward
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development