Solr
  1. Solr
  2. SOLR-3274

ZooKeeper related SolrCloud problems

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: None
    • Component/s: SolrCloud
    • Labels:
      None
    • Environment:

      Any

      Description

      Same setup as in SOLR-3273. Well if I have to tell the entire truth we have 7 Solr servers, running 28 slices of the same collection (collA) - all slices have one replica (two shards all in all - leader + replica) - 56 cores all in all (8 shards on each solr instance). But anyways...

      Besides the problem reported in SOLR-3273, the system seems to run fine under high load for several hours, but eventually errors like the ones shown below start to occur. I might be wrong, but they all seem to indicate some kind of unstability in the collaboration between Solr and ZooKeeper. I have to say that I havnt been there to check ZooKeeper "at the moment where those exception occur", but basically I dont believe the exceptions occur because ZooKeeper is not running stable - at least when I go and check ZooKeeper through other "channels" (e.g. my eclipse ZK plugin) it is always accepting my connection and generally seems to be doing fine.

      Exception 1) Often the first error we see in solr.log is something like this

      Mar 22, 2012 5:06:43 AM org.apache.solr.common.SolrException log
      SEVERE: org.apache.solr.common.SolrException: Cannot talk to ZooKeeper - Updates are disabled.
              at org.apache.solr.update.processor.DistributedUpdateProcessor.zkCheck(DistributedUpdateProcessor.java:678)
              at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:250)
              at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:140)
              at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:80)
              at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:59)
              at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
              at org.apache.solr.core.SolrCore.execute(SolrCore.java:1540)
              at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:407)
              at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:256)
              at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
              at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
              at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
              at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
              at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
              at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
              at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
              at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
              at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
              at org.mortbay.jetty.Server.handle(Server.java:326)
              at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
              at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
              at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756)
              at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
              at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
              at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
              at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
      

      I believe this error basically occurs because SolrZkClient.isConnected reports false, which means that its internal "keeper.getState" does not return ZooKeeper.States.CONNECTED. Im pretty sure that it has been CONNECTED for a long time, since this error starts occuring after several hours of processing without this problem showing. But why is it suddenly not connected anymore?!

      Exception 2) We also see errors like the following, and if Im not mistaken, they start occuring shortly after "Exception 1)" (above) shows for the fist time

      Mar 22, 2012 5:07:26 AM org.apache.solr.common.SolrException log
      SEVERE: org.apache.solr.common.SolrException: no servers hosting shard: 
              at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:149)
              at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:123)
              at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
              at java.util.concurrent.FutureTask.run(FutureTask.java:138)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
              at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
              at java.util.concurrent.FutureTask.run(FutureTask.java:138)
              at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
              at java.lang.Thread.run(Thread.java:662)
      

      Please note that the exception says "no servers hosting shard: <blank>". Looking at the code a "shard"-string was actually supposed to be written at <blank>. Basically this means that HttpShardHandler.submit was called with an empty "shard"-string parameter. But who does this? CoreAdminHandler.handleDistribUrlAction or SearchHandler.handleRequestBody or SyncStrategy or PeerSync or... I dont know, and maybe it is not that relevant, because I guess they all get the "shard"-string from ZooKeeper. Again something pointing in the direction of unstable collaboration between Solr and ZooKeeper.

      Exception 3) We also see exceptions like this

      Mar 25, 2012 3:05:38 PM org.apache.solr.common.cloud.ZkStateReader$3 process
      WARNING: ZooKeeper watch triggered, but Solr cannot talk to ZK
      Mar 25, 2012 3:05:38 PM org.apache.solr.cloud.LeaderElector$1 process
      WARNING: 
      org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /collections/collA/leader_elect/slice26/election
              at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
              at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
              at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1249)
              at org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:266)
              at org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:263)
              at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)
              at org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:263)
              at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:92)
              at org.apache.solr.cloud.LeaderElector.access$000(LeaderElector.java:57)
              at org.apache.solr.cloud.LeaderElector$1.process(LeaderElector.java:121)
              at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531)
              at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)
      

      Maybe this will we usable for some bug-fixing or for making the code more stable. I know 4.0 is not stable/released yet, and that we therefore should expect this kind of errors at the moment. So this is not negative criticism - just reporting of issues observed when using SolrCloud features under high load for several days. Any feedback is more than welcome.

      Regards, Per Steffensen

        Activity

        Hide
        Mark Miller added a comment -

        This happens because the connection between solr and zookeeper is lost - perhaps because the load on the box is too high. I think we may default to a fairly low timeout that could be raised (by default and manually).

        Show
        Mark Miller added a comment - This happens because the connection between solr and zookeeper is lost - perhaps because the load on the box is too high. I think we may default to a fairly low timeout that could be raised (by default and manually).
        Hide
        Per Steffensen added a comment -

        Thanks a lot, Mark!

        Can all the exception be explained by "connection loss between solr and zookeeper"?

        Im not sure I totally buy the explanation because I believe that, even though there is a fairly high update/search-load on the machines in the cluster, the machines actually do not seem to be exhausted (CPU idle way above 0% (more like 50% in average), not very high IO-wait etc.). So I would expect plenty of resources to be available for ZK to respond fast. But lets see what happens if we set the timeout higher. Can you point me in the direction of how to set it manually?

        Regards, Per Steffensen

        Show
        Per Steffensen added a comment - Thanks a lot, Mark! Can all the exception be explained by "connection loss between solr and zookeeper"? Im not sure I totally buy the explanation because I believe that, even though there is a fairly high update/search-load on the machines in the cluster, the machines actually do not seem to be exhausted (CPU idle way above 0% (more like 50% in average), not very high IO-wait etc.). So I would expect plenty of resources to be available for ZK to respond fast. But lets see what happens if we set the timeout higher. Can you point me in the direction of how to set it manually? Regards, Per Steffensen
        Hide
        Mark Miller added a comment -

        Can all the exception be explained by "connection loss between solr and zookeeper"?

        SessionExpiredException

        This indicates the connection with ZooKeeper was lost.

        org.apache.solr.common.SolrException: Cannot talk to ZooKeeper - Updates are disabled.

        If there is no connection to ZooKeeper, you will see this if you send an update.

        org.apache.solr.common.SolrException: no servers hosting shard:

        Sami Siren has a JIRA issue about improving this message I believe - but normally it means that the cluster does not see a single node hosting a given shard. Not sure if this is related to the above - not the same smoking gun.

        Can you point me in the direction of how to set it manually?

        The default is only 10 seconds. I'd try 30 seconds perhaps? You don't want it too low, but you also don't want it too high if you can help it. I can't remember what the "zookeeper" default is, but I've seen it set as high as 60 seconds looking around some hbase usage...

        You should be able to set it in solr.xml as a cores attribute: zkClientTimeout="30000" or whatever.

        That is: <cores adminPath="/admin/cores" zkClientTimeout="30000"

        You'd want to do it for each node.

        Show
        Mark Miller added a comment - Can all the exception be explained by "connection loss between solr and zookeeper"? SessionExpiredException This indicates the connection with ZooKeeper was lost. org.apache.solr.common.SolrException: Cannot talk to ZooKeeper - Updates are disabled. If there is no connection to ZooKeeper, you will see this if you send an update. org.apache.solr.common.SolrException: no servers hosting shard: Sami Siren has a JIRA issue about improving this message I believe - but normally it means that the cluster does not see a single node hosting a given shard. Not sure if this is related to the above - not the same smoking gun. Can you point me in the direction of how to set it manually? The default is only 10 seconds. I'd try 30 seconds perhaps? You don't want it too low, but you also don't want it too high if you can help it. I can't remember what the "zookeeper" default is, but I've seen it set as high as 60 seconds looking around some hbase usage... You should be able to set it in solr.xml as a cores attribute: zkClientTimeout="30000" or whatever. That is: <cores adminPath="/admin/cores" zkClientTimeout="30000" You'd want to do it for each node.
        Hide
        Mark Miller added a comment -

        not the same smoking gun.

        Sorry - actually this does make sense with the other errors - if the zk connection is lost, that node is no longer considered live - if that happens to each node hosting a shard (say you have 1 replica and this happened to both nodes) then searches would fail with this.

        Show
        Mark Miller added a comment - not the same smoking gun. Sorry - actually this does make sense with the other errors - if the zk connection is lost, that node is no longer considered live - if that happens to each node hosting a shard (say you have 1 replica and this happened to both nodes) then searches would fail with this.
        Hide
        Per Steffensen added a comment -

        Uhhhh 10 secs is A LOT OF TIME. I really wouldnt want to set it higher that that. If ZK is not able to answer within 10 secs I need to correct something else in my setup.

        I still believe that Solr might end in this state (where it "believes" that the connection to ZK is lost) some other way than actually experiencing a 10+ sec response-time from ZK, but I cant prove it (yet). So for now I will just thank you for your kind help, and assume that it is correct. Then basically my options are to setup a more responsive ZK cluster or maybe raise the ZK timeout on Solr side.

        Thanks, again.

        Regards, Per Steffensen

        Show
        Per Steffensen added a comment - Uhhhh 10 secs is A LOT OF TIME. I really wouldnt want to set it higher that that. If ZK is not able to answer within 10 secs I need to correct something else in my setup. I still believe that Solr might end in this state (where it "believes" that the connection to ZK is lost) some other way than actually experiencing a 10+ sec response-time from ZK, but I cant prove it (yet). So for now I will just thank you for your kind help, and assume that it is correct. Then basically my options are to setup a more responsive ZK cluster or maybe raise the ZK timeout on Solr side. Thanks, again. Regards, Per Steffensen
        Hide
        Per Steffensen added a comment -

        But why not just try to reconnect if/when this situation has occured, so that Solr can continue doing its work? I guess Solr does not do that, because it seems like when this error has first established, there is no "recovering", and certainly (Im close to 100% positive) ZK will not continue doing 10+ secs response-times to all requests, even though it might do a 10+ sec response once in a while.

        Regards, Per Steffensen

        Show
        Per Steffensen added a comment - But why not just try to reconnect if/when this situation has occured, so that Solr can continue doing its work? I guess Solr does not do that, because it seems like when this error has first established, there is no "recovering", and certainly (Im close to 100% positive) ZK will not continue doing 10+ secs response-times to all requests, even though it might do a 10+ sec response once in a while. Regards, Per Steffensen
        Hide
        Mark Miller added a comment -

        Uhhhh 10 secs is A LOT OF TIME

        It really depends - I've seen that timeout broken on a heavily loaded machine more than a few times. Then you have to add in any network delays. But yeah, on a fast machine under normal to high load, I have not really run into a problem with this timeout.

        Then basically my options are to setup a more responsive ZK cluster or maybe raise the ZK timeout on Solr side.

        That's all I can suggest. If the ZooKeeper client loses the connection, it has up to the session timeout to reconnect. Once it reconnects, if more than the session timeout has passed, you will get the SessionExpiredException. If that happens, the node will go into recovery. If it's in recovery, it won't serve search requests until recovery is finished - so that could also contribute to the "no servers hosting shard" issue.

        Let me know how it goes and if you can pin point any problems.

        Show
        Mark Miller added a comment - Uhhhh 10 secs is A LOT OF TIME It really depends - I've seen that timeout broken on a heavily loaded machine more than a few times. Then you have to add in any network delays. But yeah, on a fast machine under normal to high load, I have not really run into a problem with this timeout. Then basically my options are to setup a more responsive ZK cluster or maybe raise the ZK timeout on Solr side. That's all I can suggest. If the ZooKeeper client loses the connection, it has up to the session timeout to reconnect. Once it reconnects, if more than the session timeout has passed, you will get the SessionExpiredException. If that happens, the node will go into recovery. If it's in recovery, it won't serve search requests until recovery is finished - so that could also contribute to the "no servers hosting shard" issue. Let me know how it goes and if you can pin point any problems.
        Hide
        Mark Miller added a comment -

        But why not just try to reconnect if/when this situation has occured, so that Solr can continue doing its work? I guess Solr does not do that, because it seems like when this error has first established, there is no "recovering", and certainly (Im close to 100% positive) ZK will not continue doing 10+ secs response-times to all requests, even though it might do a 10+ sec response once in a while.

        Solr does try to reconnect - but there can be no recovering due to the other issue you posted - because you have changed the core admin url.

        Show
        Mark Miller added a comment - But why not just try to reconnect if/when this situation has occured, so that Solr can continue doing its work? I guess Solr does not do that, because it seems like when this error has first established, there is no "recovering", and certainly (Im close to 100% positive) ZK will not continue doing 10+ secs response-times to all requests, even though it might do a 10+ sec response once in a while. Solr does try to reconnect - but there can be no recovering due to the other issue you posted - because you have changed the core admin url.
        Hide
        Per Steffensen added a comment -

        Ahhh I see. So correcting the adminPath will allow it to allow it to reconnect AND recover and then get back to work. Of course, we didnt I conclude that myself. We will try to correct adminPath and run the performance test again and see how it goes. If we see any problems we will report here and try to pinpoint (and potentially fix). Thanks a lot for your help!

        Regards, Per Steffensen

        Show
        Per Steffensen added a comment - Ahhh I see. So correcting the adminPath will allow it to allow it to reconnect AND recover and then get back to work. Of course, we didnt I conclude that myself. We will try to correct adminPath and run the performance test again and see how it goes. If we see any problems we will report here and try to pinpoint (and potentially fix). Thanks a lot for your help! Regards, Per Steffensen
        Hide
        Per Steffensen added a comment -

        Our performance test (with adminPath="/admin/cores") ran successfully for a while, but then simular errors started to occur, and it seems like whenever a SolrCloud-cluster (of some of the Solr instances in it) gets into "no contant to ZK"-state, it has a hard time getting out of this state again. The CPU is like 70-80 idle on the machines while this is going on, so I have a hard time recognizing that ZK is not responding within 10 secs. But basically I cannot really pinpoint what goes wrong yet, and we will shift focus for a while, so it will probably be a while before I might get back with solid proof or concrete cases.

        Only thing I can do for now is try to encourage you guys at apache-solr to do your own stability/robustness/endurance tests where you run with a fairly high concurrent load (not so high that the machines get exhausted) for many days, and hopefully you will see the problems occur yourselves.

        Thanks for your collaboration!

        Regards, Per Steffensen

        Show
        Per Steffensen added a comment - Our performance test (with adminPath="/admin/cores") ran successfully for a while, but then simular errors started to occur, and it seems like whenever a SolrCloud-cluster (of some of the Solr instances in it) gets into "no contant to ZK"-state, it has a hard time getting out of this state again. The CPU is like 70-80 idle on the machines while this is going on, so I have a hard time recognizing that ZK is not responding within 10 secs. But basically I cannot really pinpoint what goes wrong yet, and we will shift focus for a while, so it will probably be a while before I might get back with solid proof or concrete cases. Only thing I can do for now is try to encourage you guys at apache-solr to do your own stability/robustness/endurance tests where you run with a fairly high concurrent load (not so high that the machines get exhausted) for many days, and hopefully you will see the problems occur yourselves. Thanks for your collaboration! Regards, Per Steffensen
        Hide
        Mark Miller added a comment -

        If you don't solve the issue of the zk expirations, this is no real surprise. The larger the index gets, the longer the recoveries can take - until you end up in a similar situation as you had. The key is understanding why the connection to zookeeper is dropping.

        Show
        Mark Miller added a comment - If you don't solve the issue of the zk expirations, this is no real surprise. The larger the index gets, the longer the recoveries can take - until you end up in a similar situation as you had. The key is understanding why the connection to zookeeper is dropping.
        Hide
        Jay Hacker added a comment - - edited

        Not sure if it's the same problem, but I have seen similar issues with 4.0.0 release. I get errors like:

        ClusterState says we are the leader, but locally we don't think so
        There was a problem finding the leader in zk
        forwarding update to http://solr83:4000/solr/main/ failed - retrying ...
        Cannot open channel to 3 at election address solr84/X.X.X.X:5002
        Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
        

        I'm running zookeeper embedded, and the problem turns out to be long garbage collection pauses. During a stop-the-world collection, zookeeper times out. It's especially bad if the system has to page in a bunch of memory from disk. This would explain why things run fine for a while, until memory fills up and you need to do a big GC. This is quite repeatable for us; just index until memory is pretty full, wait for a long GC or trigger one manually with VisualVM.

        For us, the solution was to do an autoSoftCommit every million documents. This essentially clears the heap, so doing it regularly prevents the heap from getting too big and causing long garbage collection pauses. It also had the nice side benefit of speeding things up a bit.

        You can also try different garbage collectors (I've had some luck with -XX:+UseConcMarkSweepGC ), running zookeeper in an independent JVM, and/or turning up the zookeeper timeouts.

        Show
        Jay Hacker added a comment - - edited Not sure if it's the same problem, but I have seen similar issues with 4.0.0 release. I get errors like: ClusterState says we are the leader, but locally we don't think so There was a problem finding the leader in zk forwarding update to http: //solr83:4000/solr/main/ failed - retrying ... Cannot open channel to 3 at election address solr84/X.X.X.X:5002 Session 0x0 for server null , unexpected error, closing socket connection and attempting reconnect I'm running zookeeper embedded, and the problem turns out to be long garbage collection pauses. During a stop-the-world collection, zookeeper times out. It's especially bad if the system has to page in a bunch of memory from disk. This would explain why things run fine for a while, until memory fills up and you need to do a big GC. This is quite repeatable for us; just index until memory is pretty full, wait for a long GC or trigger one manually with VisualVM. For us, the solution was to do an autoSoftCommit every million documents. This essentially clears the heap, so doing it regularly prevents the heap from getting too big and causing long garbage collection pauses. It also had the nice side benefit of speeding things up a bit. You can also try different garbage collectors (I've had some luck with -XX:+UseConcMarkSweepGC ), running zookeeper in an independent JVM, and/or turning up the zookeeper timeouts.

          People

          • Assignee:
            Unassigned
            Reporter:
            Per Steffensen
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:

              Development