Solr
  1. Solr
  2. SOLR-6511

Fencepost error in LeaderInitiatedRecoveryThread

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.10.2, 5.0
    • Component/s: None
    • Labels:
      None

      Description

      At line 106:

          while (continueTrying && ++tries < maxTries) {
      

      should be

          while (continueTrying && ++tries <= maxTries) {
      

      This is only a problem when called from DistributedUpdateProcessor, as it can have maxTries set to 1, which means the loop is never actually run.

      1. SOLR-6511.patch
        29 kB
        Timothy Potter
      2. SOLR-6511.patch
        11 kB
        Timothy Potter

        Activity

        Hide
        Alan Woodward added a comment -

        So here's how this manifested:

        • replica1 is busy sending updates to replica2 when it gets a network blip and it's ZK connection times out
        • replica2 is then elected leader
        • replica1 also still thinks it's leader (because it hasn't noticed the ZK timeout yet) and then gets errors back from replica2 saying "I'm the leader, stop sending me these updates!"
        • replica1 interprets these as errors, and attempts to put replica2 into leader-initiated recovery
        • what ought to happen here is that replica2 sends a message back saying "no need, I'm the leader, I'll take it from here, thanks". But because of the fencepost error, the message to replica2 is never actually sent, and replica1 then writes replica2's state as DOWN into the LIRT zk node
        • the two replicas send each other some request-recover messages, trying to work out who is actually leader
        • replica2 then tries to recover, but it can't publish itself as active, because you can't do that if your LIRT state is DOWN, so it eventually goes into RECOVERY_FAILED

        There is a bunch of fairly confusing logging around all this as well. I particularly liked the messages that said "WaitingForState recovering, but I see state: recovering"

        Show
        Alan Woodward added a comment - So here's how this manifested: replica1 is busy sending updates to replica2 when it gets a network blip and it's ZK connection times out replica2 is then elected leader replica1 also still thinks it's leader (because it hasn't noticed the ZK timeout yet) and then gets errors back from replica2 saying "I'm the leader, stop sending me these updates!" replica1 interprets these as errors, and attempts to put replica2 into leader-initiated recovery what ought to happen here is that replica2 sends a message back saying "no need, I'm the leader, I'll take it from here, thanks". But because of the fencepost error, the message to replica2 is never actually sent, and replica1 then writes replica2's state as DOWN into the LIRT zk node the two replicas send each other some request-recover messages, trying to work out who is actually leader replica2 then tries to recover, but it can't publish itself as active, because you can't do that if your LIRT state is DOWN, so it eventually goes into RECOVERY_FAILED There is a bunch of fairly confusing logging around all this as well. I particularly liked the messages that said "WaitingForState recovering, but I see state: recovering"
        Hide
        Timothy Potter added a comment -

        yeah, that WaitForState stuff is not actually related to this code and is very confusing, but that's another issue ... for this issue, I'll post a patch shortly but I think it should also include a check for whether the node trying to start LIR thread is still the leader

        not hitting this loop at least once is pretty bad actually, so we try to get this into 4.10.1 IMHO

        Show
        Timothy Potter added a comment - yeah, that WaitForState stuff is not actually related to this code and is very confusing, but that's another issue ... for this issue, I'll post a patch shortly but I think it should also include a check for whether the node trying to start LIR thread is still the leader not hitting this loop at least once is pretty bad actually, so we try to get this into 4.10.1 IMHO
        Hide
        Timothy Potter added a comment -

        Here's a patch to address this problem, specifically it does:

        1) tries the recovery command at least once (Alan's fix)

        2) doesn't put a replica into LIR if the node is not currently the leader anymore (added safeguard) ... the LIR thread also checks to make sure it is still running in the leader before continuing to nag the replica

        Show
        Timothy Potter added a comment - Here's a patch to address this problem, specifically it does: 1) tries the recovery command at least once (Alan's fix) 2) doesn't put a replica into LIR if the node is not currently the leader anymore (added safeguard) ... the LIR thread also checks to make sure it is still running in the leader before continuing to nag the replica
        Hide
        Timothy Potter added a comment -

        what ought to happen here is that replica2 sends a message back saying "no need, I'm the leader, I'll take it from here, thanks". But because of the fencepost error, the message to replica2 is never actually sent, and replica1 then writes replica2's state as DOWN into the LIRT zk node

        The more I think about this, I don't see how the fencepost error gets hit here? maxTries will be 120 if replica1 is setting replica2 to down

        So I think the real fix is to do what Alan suggests - have the new leader respond with: no need, I'm the leader, I'll take it from here, thanks

        The patch I posted earlier has some good improvements in it, but I think we need a unit test that proves the code works correctly for the scenario described above.

        Show
        Timothy Potter added a comment - what ought to happen here is that replica2 sends a message back saying "no need, I'm the leader, I'll take it from here, thanks". But because of the fencepost error, the message to replica2 is never actually sent, and replica1 then writes replica2's state as DOWN into the LIRT zk node The more I think about this, I don't see how the fencepost error gets hit here? maxTries will be 120 if replica1 is setting replica2 to down So I think the real fix is to do what Alan suggests - have the new leader respond with: no need, I'm the leader, I'll take it from here, thanks The patch I posted earlier has some good improvements in it, but I think we need a unit test that proves the code works correctly for the scenario described above.
        Hide
        Alan Woodward added a comment -

        I think you might have some extra stuff on the end of the patch?

        Digging a bit further into the logs, maxTries is set to 1 because ensureReplicaInLeaderInitiatedRecovery throws a SessionExpiredException (presumably because ZK has noticed the network blip and removed the relevant ephemeral node). Maybe maxTries should always be set to 120?

        One thing that might be nice here would be to add a utility method to ZkController called something like ensureLeadership(CloudDescriptor cd), which checks if the core described by the CloudDescriptor really is the current leader according to ZK, and throws an exception if it isn't.

        Show
        Alan Woodward added a comment - I think you might have some extra stuff on the end of the patch? Digging a bit further into the logs, maxTries is set to 1 because ensureReplicaInLeaderInitiatedRecovery throws a SessionExpiredException (presumably because ZK has noticed the network blip and removed the relevant ephemeral node). Maybe maxTries should always be set to 120? One thing that might be nice here would be to add a utility method to ZkController called something like ensureLeadership(CloudDescriptor cd), which checks if the core described by the CloudDescriptor really is the current leader according to ZK, and throws an exception if it isn't.
        Hide
        Timothy Potter added a comment -

        I now have a test case that duplicates Alan's scenario exactly, which is good. In devising a fix, the following problem has come up – the request has been accepted locally on the used to be leader and is failing on one of the replicas because of the leader change ("Request says it is coming from leader, but we are the leader").

        So does the old leader (the one receiving the error back from the new leader) try to be clever and forward the request to the leader as any replica would do under normal circumstances? Keep in mind that this request has already been accepted locally and possibly on other replicas. Or does this old leader just propagate the failure back to the client and let it decide what to do? Guess it comes down to whether we think its safe to just re-process a request? Seems like it would be but wanted feedback before assuming that.

        Show
        Timothy Potter added a comment - I now have a test case that duplicates Alan's scenario exactly, which is good. In devising a fix, the following problem has come up – the request has been accepted locally on the used to be leader and is failing on one of the replicas because of the leader change ("Request says it is coming from leader, but we are the leader"). So does the old leader (the one receiving the error back from the new leader) try to be clever and forward the request to the leader as any replica would do under normal circumstances? Keep in mind that this request has already been accepted locally and possibly on other replicas. Or does this old leader just propagate the failure back to the client and let it decide what to do? Guess it comes down to whether we think its safe to just re-process a request? Seems like it would be but wanted feedback before assuming that.
        Hide
        Alan Woodward added a comment -

        I think the safest response is to return the error to the client. Updates are idempotent, right? An ADD will just overwrite the previous ADD, DELETE doesn't necessarily have to delete anything to be successful, etc. So if the client gets a 503 back again it can just resend.

        The only tricky bit might be what happens if a replica finds itself ahead of its leader, as would be the case here. Does it automatically try and send updates on, or does it roll back?

        Show
        Alan Woodward added a comment - I think the safest response is to return the error to the client. Updates are idempotent, right? An ADD will just overwrite the previous ADD, DELETE doesn't necessarily have to delete anything to be successful, etc. So if the client gets a 503 back again it can just resend. The only tricky bit might be what happens if a replica finds itself ahead of its leader, as would be the case here. Does it automatically try and send updates on, or does it roll back?
        Hide
        Timothy Potter added a comment -

        Here's an updated patch. It'll need to be updated again after SOLR-6530 is committed. Key things in this patch are:

        1) HttpPartitionTest.testLeaderZkSessionLoss: reproduces the scenario described in this ticket

        2) DistributedUpdateProcessor now checks to see if the reason for a failure is because of a leader change and if so, the request fails and an error is sent to the client

        I had to add a way to pass-thru some additional context information about an error from server to client, which I'll do that work in another ticket but this patch shows the approach I'm taking.

        Lastly, HttpPartitionTest continues to be a problem - I beast'd it for 10 times and it failed after 6 runs locally (sometimes fewer), so will need to get that problem resolved before committing this patch too. It consistently fails in the testRf3WithLeaderFailover but for different reasons. My thinking is that I'll break the problem test case (testRf3WithLeaderFailover) out to its own test class as the other tests in this class work well and cover a lot of important functionality.

        Show
        Timothy Potter added a comment - Here's an updated patch. It'll need to be updated again after SOLR-6530 is committed. Key things in this patch are: 1) HttpPartitionTest.testLeaderZkSessionLoss: reproduces the scenario described in this ticket 2) DistributedUpdateProcessor now checks to see if the reason for a failure is because of a leader change and if so, the request fails and an error is sent to the client I had to add a way to pass-thru some additional context information about an error from server to client, which I'll do that work in another ticket but this patch shows the approach I'm taking. Lastly, HttpPartitionTest continues to be a problem - I beast'd it for 10 times and it failed after 6 runs locally (sometimes fewer), so will need to get that problem resolved before committing this patch too. It consistently fails in the testRf3WithLeaderFailover but for different reasons. My thinking is that I'll break the problem test case (testRf3WithLeaderFailover) out to its own test class as the other tests in this class work well and cover a lot of important functionality.
        Hide
        Timothy Potter added a comment -

        I'm ready to commit this solution, but before I do, I'd like some feedback on SOLR-6550, which is how I'm implementing Alan's "no need, I'm the leader, I'll take it from here, thanks" recommendation.

        Show
        Timothy Potter added a comment - I'm ready to commit this solution, but before I do, I'd like some feedback on SOLR-6550 , which is how I'm implementing Alan's "no need, I'm the leader, I'll take it from here, thanks" recommendation.
        Hide
        ASF subversion and git services added a comment -

        Commit 1627347 from Timothy Potter in branch 'dev/trunk'
        [ https://svn.apache.org/r1627347 ]

        SOLR-6511: Fencepost error in LeaderInitiatedRecoveryThread; refactor HttpPartitionTest to resolve jenkins failures.

        Show
        ASF subversion and git services added a comment - Commit 1627347 from Timothy Potter in branch 'dev/trunk' [ https://svn.apache.org/r1627347 ] SOLR-6511 : Fencepost error in LeaderInitiatedRecoveryThread; refactor HttpPartitionTest to resolve jenkins failures.
        Hide
        ASF subversion and git services added a comment -

        Commit 1628203 from Timothy Potter in branch 'dev/trunk'
        [ https://svn.apache.org/r1628203 ]

        SOLR-6511: adjust test logic to account for timing issues in zk session expiration scenario.

        Show
        ASF subversion and git services added a comment - Commit 1628203 from Timothy Potter in branch 'dev/trunk' [ https://svn.apache.org/r1628203 ] SOLR-6511 : adjust test logic to account for timing issues in zk session expiration scenario.
        Hide
        Shalin Shekhar Mangar added a comment -

        Tim, I've committed SOLR-6530 on trunk. I'll merge it to branch_5x after you merge these changes.

        Show
        Shalin Shekhar Mangar added a comment - Tim, I've committed SOLR-6530 on trunk. I'll merge it to branch_5x after you merge these changes.
        Hide
        ASF subversion and git services added a comment -

        Commit 1628989 from Timothy Potter in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1628989 ]

        SOLR-6511: Fencepost error in LeaderInitiatedRecoveryThread; refactor HttpPartitionTest to resolve jenkins failures.

        Show
        ASF subversion and git services added a comment - Commit 1628989 from Timothy Potter in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1628989 ] SOLR-6511 : Fencepost error in LeaderInitiatedRecoveryThread; refactor HttpPartitionTest to resolve jenkins failures.
        Hide
        Shalin Shekhar Mangar added a comment -

        Digging a bit further into the logs, maxTries is set to 1 because ensureReplicaInLeaderInitiatedRecovery throws a SessionExpiredException (presumably because ZK has noticed the network blip and removed the relevant ephemeral node).

        It's not just SessionExpiredException. Sometime it might throw a ConnectionLossException which also should be handled in the same way. I got the following stack trace in my testing when a node was partitioned from ZooKeeper for a long time:

        7984566 [qtp1600876769-17] ERROR org.apache.solr.update.processor.DistributedUpdateProcessor  – Leader failed to set replica http://n4:8983/solr/collection_5x3_shard4_replica3/ state to DOWN due to: org.apache.solr.common.SolrException: Failed to update data to down for znode: /collections/collection_5x3/leader_initiated_recovery/shard4/core_node10
        org.apache.solr.common.SolrException: Failed to update data to down for znode: /collections/collection_5x3/leader_initiated_recovery/shard4/core_node10
                at org.apache.solr.cloud.ZkController.updateLeaderInitiatedRecoveryState(ZkController.java:1959)
                at org.apache.solr.cloud.ZkController.ensureReplicaInLeaderInitiatedRecovery(ZkController.java:1841)
                at org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:837)
                at org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1679)
                at org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:179)
                at org.apache.solr.update.processor.UpdateRequestProcessor.finish(UpdateRequestProcessor.java:76)
                at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
                at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
                at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
                at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
                at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
                at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
                at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
                at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
                at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
                at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
                at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
                at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
                at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
                at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
                at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
                at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
                at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
                at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
                at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
                at org.eclipse.jetty.server.Server.handle(Server.java:368)
                at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
                at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
                at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)
                at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)
                at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:953)
                at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
                at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
                at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
                at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
                at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
                at java.lang.Thread.run(Thread.java:745)
        Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /collections/collection_5x3/leader_initiated_recovery/shard4/core_node10
                at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
                at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
                at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
                at org.apache.solr.common.cloud.SolrZkClient$5.execute(SolrZkClient.java:256)
                at org.apache.solr.common.cloud.SolrZkClient$5.execute(SolrZkClient.java:253)
                at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:74)
                at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:253)
                at org.apache.solr.cloud.ZkController.updateLeaderInitiatedRecoveryState(ZkController.java:1949)
                ... 36 more
        
        Show
        Shalin Shekhar Mangar added a comment - Digging a bit further into the logs, maxTries is set to 1 because ensureReplicaInLeaderInitiatedRecovery throws a SessionExpiredException (presumably because ZK has noticed the network blip and removed the relevant ephemeral node). It's not just SessionExpiredException. Sometime it might throw a ConnectionLossException which also should be handled in the same way. I got the following stack trace in my testing when a node was partitioned from ZooKeeper for a long time: 7984566 [qtp1600876769-17] ERROR org.apache.solr.update.processor.DistributedUpdateProcessor – Leader failed to set replica http: //n4:8983/solr/collection_5x3_shard4_replica3/ state to DOWN due to: org.apache.solr.common.SolrException: Failed to update data to down for znode: /collections/collection_5x3/leader_initiated_recovery/shard4/core_node10 org.apache.solr.common.SolrException: Failed to update data to down for znode: /collections/collection_5x3/leader_initiated_recovery/shard4/core_node10 at org.apache.solr.cloud.ZkController.updateLeaderInitiatedRecoveryState(ZkController.java:1959) at org.apache.solr.cloud.ZkController.ensureReplicaInLeaderInitiatedRecovery(ZkController.java:1841) at org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:837) at org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1679) at org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:179) at org.apache.solr.update.processor.UpdateRequestProcessor.finish(UpdateRequestProcessor.java:76) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:368) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:953) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang. Thread .run( Thread .java:745) Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /collections/collection_5x3/leader_initiated_recovery/shard4/core_node10 at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045) at org.apache.solr.common.cloud.SolrZkClient$5.execute(SolrZkClient.java:256) at org.apache.solr.common.cloud.SolrZkClient$5.execute(SolrZkClient.java:253) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:74) at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:253) at org.apache.solr.cloud.ZkController.updateLeaderInitiatedRecoveryState(ZkController.java:1949) ... 36 more
        Hide
        Timothy Potter added a comment -

        Re-opening this to address the problem Shalin noticed.

        Show
        Timothy Potter added a comment - Re-opening this to address the problem Shalin noticed.
        Hide
        ASF subversion and git services added a comment -

        Commit 1629720 from Timothy Potter in branch 'dev/trunk'
        [ https://svn.apache.org/r1629720 ]

        SOLR-6511: Better handling of ZooKeeper related exceptions when deciding to start the leader-initiated recovery nag thread

        Show
        ASF subversion and git services added a comment - Commit 1629720 from Timothy Potter in branch 'dev/trunk' [ https://svn.apache.org/r1629720 ] SOLR-6511 : Better handling of ZooKeeper related exceptions when deciding to start the leader-initiated recovery nag thread
        Hide
        ASF subversion and git services added a comment -

        Commit 1629966 from Timothy Potter in branch 'dev/trunk'
        [ https://svn.apache.org/r1629966 ]

        SOLR-6511: keep track of the node that created the leader-initiated recovery znode, helpful for debugging

        Show
        ASF subversion and git services added a comment - Commit 1629966 from Timothy Potter in branch 'dev/trunk' [ https://svn.apache.org/r1629966 ] SOLR-6511 : keep track of the node that created the leader-initiated recovery znode, helpful for debugging
        Hide
        ASF subversion and git services added a comment -

        Commit 1630137 from Timothy Potter in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1630137 ]

        SOLR-6511: backport latest changes to branch_5x

        Show
        ASF subversion and git services added a comment - Commit 1630137 from Timothy Potter in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1630137 ] SOLR-6511 : backport latest changes to branch_5x
        Hide
        ASF subversion and git services added a comment -

        Commit 1630164 from Timothy Potter in branch 'dev/branches/lucene_solr_4_10'
        [ https://svn.apache.org/r1630164 ]

        SOLR-6550: backport to 4_10 branch so we can backport SOLR-6511

        Show
        ASF subversion and git services added a comment - Commit 1630164 from Timothy Potter in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1630164 ] SOLR-6550 : backport to 4_10 branch so we can backport SOLR-6511
        Hide
        ASF subversion and git services added a comment -

        Commit 1630196 from Timothy Potter in branch 'dev/branches/lucene_solr_4_10'
        [ https://svn.apache.org/r1630196 ]

        SOLR-6511: backport to 4.10 branch

        Show
        ASF subversion and git services added a comment - Commit 1630196 from Timothy Potter in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1630196 ] SOLR-6511 : backport to 4.10 branch
        Hide
        Shalin Shekhar Mangar added a comment -

        Tim, the change to the LIR state to be kept as map is not back-compatible. For example, I saw the following error upon upgrading a cluster (to trunk) (which had some LIR state written down in ZK already). But looking at the code, the same exception can happen on upgrading to latest lucene_solr_4_10 branch too.

        41228 [RecoveryThread] ERROR org.apache.solr.cloud.RecoveryStrategy   Error while trying to recover. core=coll_5x3_shard5_replica3:java.lang.ClassCastException: java.lang.String cannot be cast to java.util.Map
                at org.apache.solr.cloud.ZkController.getLeaderInitiatedRecoveryStateObject(ZkController.java:1993)
                at org.apache.solr.cloud.ZkController.getLeaderInitiatedRecoveryState(ZkController.java:1958)
                at org.apache.solr.cloud.ZkController.publish(ZkController.java:1105)
                at org.apache.solr.cloud.ZkController.publish(ZkController.java:1075)
                at org.apache.solr.cloud.ZkController.publish(ZkController.java:1071)
                at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:355)
                at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:235)
        
        
        Show
        Shalin Shekhar Mangar added a comment - Tim, the change to the LIR state to be kept as map is not back-compatible. For example, I saw the following error upon upgrading a cluster (to trunk) (which had some LIR state written down in ZK already). But looking at the code, the same exception can happen on upgrading to latest lucene_solr_4_10 branch too. 41228 [RecoveryThread] ERROR org.apache.solr.cloud.RecoveryStrategy Error while trying to recover. core=coll_5x3_shard5_replica3:java.lang.ClassCastException: java.lang. String cannot be cast to java.util.Map at org.apache.solr.cloud.ZkController.getLeaderInitiatedRecoveryStateObject(ZkController.java:1993) at org.apache.solr.cloud.ZkController.getLeaderInitiatedRecoveryState(ZkController.java:1958) at org.apache.solr.cloud.ZkController.publish(ZkController.java:1105) at org.apache.solr.cloud.ZkController.publish(ZkController.java:1075) at org.apache.solr.cloud.ZkController.publish(ZkController.java:1071) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:355) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:235)
        Hide
        ASF subversion and git services added a comment -

        Commit 1631439 from Timothy Potter in branch 'dev/trunk'
        [ https://svn.apache.org/r1631439 ]

        SOLR-6511: fix back compat issue when reading existing data from ZK

        Show
        ASF subversion and git services added a comment - Commit 1631439 from Timothy Potter in branch 'dev/trunk' [ https://svn.apache.org/r1631439 ] SOLR-6511 : fix back compat issue when reading existing data from ZK
        Hide
        ASF subversion and git services added a comment -

        Commit 1631444 from Timothy Potter in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1631444 ]

        SOLR-6511: fix back compat issue when reading existing data from ZK

        Show
        ASF subversion and git services added a comment - Commit 1631444 from Timothy Potter in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1631444 ] SOLR-6511 : fix back compat issue when reading existing data from ZK
        Hide
        ASF subversion and git services added a comment -

        Commit 1631447 from Timothy Potter in branch 'dev/branches/lucene_solr_4_10'
        [ https://svn.apache.org/r1631447 ]

        SOLR-6511: fix back compat issue when reading existing data from ZK

        Show
        ASF subversion and git services added a comment - Commit 1631447 from Timothy Potter in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1631447 ] SOLR-6511 : fix back compat issue when reading existing data from ZK
        Hide
        Timothy Potter added a comment -

        Good catch Shalin thanks. That's my bad for changing functionality as part of this ticket.

        Show
        Timothy Potter added a comment - Good catch Shalin thanks. That's my bad for changing functionality as part of this ticket.
        Hide
        Shalin Shekhar Mangar added a comment -

        Hi Tim, sorry for the late review but it'd more awesome if we put the replica core in addition to the node name inside createdByNode key (or maybe just coreNodeName). Otherwise if there are more than one replica for the same shard on the node then we don't know which one created the LIR.

        Show
        Shalin Shekhar Mangar added a comment - Hi Tim, sorry for the late review but it'd more awesome if we put the replica core in addition to the node name inside createdByNode key (or maybe just coreNodeName). Otherwise if there are more than one replica for the same shard on the node then we don't know which one created the LIR.

          People

          • Assignee:
            Timothy Potter
            Reporter:
            Alan Woodward
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development