Solr
  1. Solr
  2. SOLR-3993

SolrCloud leader election on single node stucks the initialization

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0
    • Fix Version/s: 4.1, 6.0
    • Component/s: SolrCloud
    • Labels:
      None
    • Environment:

      Windows 7, Tomcat 6

      Description

      setup:
      1 node, 4 cores, 2 shards.
      15 documents indexed.

      problem:
      init stage times out.

      probable cause:
      According to the init flow, cores are initialized one by one synchronously.
      Actually, the main thread waits ShardLeaderElectionContext.waitForReplicasToComeUp until retry threshold, while replica cores are not yet initialized, in other words there is no chance other replicas go up in the meanwhile.
      stack trace:
      Thread [main] (Suspended)
      owns: HashMap<K,V> (id=3876)
      owns: StandardContext (id=3877)
      owns: HashMap<K,V> (id=3878)
      owns: StandardHost (id=3879)
      owns: StandardEngine (id=3880)
      owns: Service[] (id=3881)
      Thread.sleep(long) line: not available [native method]
      ShardLeaderElectionContext.waitForReplicasToComeUp(boolean, String) line: 298
      ShardLeaderElectionContext.runLeaderProcess(boolean) line: 143
      LeaderElector.runIamLeaderProcess(ElectionContext, boolean) line: 152
      LeaderElector.checkIfIamLeader(int, ElectionContext, boolean) line: 96
      LeaderElector.joinElection(ElectionContext) line: 262
      ZkController.joinElection(CoreDescriptor, boolean) line: 733
      ZkController.register(String, CoreDescriptor, boolean, boolean) line: 566
      ZkController.register(String, CoreDescriptor) line: 532
      CoreContainer.registerInZk(SolrCore) line: 709
      CoreContainer.register(String, SolrCore, boolean) line: 693
      CoreContainer.load(String, InputSource) line: 535
      CoreContainer.load(String, File) line: 356
      CoreContainer$Initializer.initialize() line: 308
      SolrDispatchFilter.init(FilterConfig) line: 107
      ApplicationFilterConfig.getFilter() line: 295
      ApplicationFilterConfig.setFilterDef(FilterDef) line: 422
      ApplicationFilterConfig.<init>(Context, FilterDef) line: 115
      StandardContext.filterStart() line: 4072
      StandardContext.start() line: 4726
      StandardHost(ContainerBase).addChildInternal(Container) line: 799
      StandardHost(ContainerBase).addChild(Container) line: 779
      StandardHost.addChild(Container) line: 601
      HostConfig.deployDescriptor(String, File, String) line: 675
      HostConfig.deployDescriptors(File, String[]) line: 601
      HostConfig.deployApps() line: 502
      HostConfig.start() line: 1317
      HostConfig.lifecycleEvent(LifecycleEvent) line: 324
      LifecycleSupport.fireLifecycleEvent(String, Object) line: 142
      StandardHost(ContainerBase).start() line: 1065
      StandardHost.start() line: 840
      StandardEngine(ContainerBase).start() line: 1057
      StandardEngine.start() line: 463
      StandardService.start() line: 525
      StandardServer.start() line: 754
      Catalina.start() line: 595
      NativeMethodAccessorImpl.invoke0(Method, Object, Object[]) line: not available [native method]
      NativeMethodAccessorImpl.invoke(Object, Object[]) line: not available
      DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: not available
      Method.invoke(Object, Object...) line: not available
      Bootstrap.start() line: 289
      Bootstrap.main(String[]) line: 414

      After a while, the session times out and following exception appears:
      Oct 25, 2012 1:16:56 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp
      INFO: Waiting until we see more replicas up: total=2 found=0 timeoutin=-95
      Oct 25, 2012 1:16:56 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp
      INFO: Was waiting for replicas to come up, but they are taking too long - assuming they won't come back till later
      Oct 25, 2012 1:16:56 PM org.apache.solr.common.SolrException log
      SEVERE: Errir checking for the number of election participants:org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /collections/collection1/leader_elect/shard2/election
      at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
      at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
      at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1249)
      at org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:227)
      at org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:224)
      at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:63)
      at org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:224)
      at org.apache.solr.cloud.ShardLeaderElectionContext.waitForReplicasToComeUp(ElectionContext.java:276)
      at org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:143)
      at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:152)
      at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:96)
      at org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:262)
      at org.apache.solr.cloud.ZkController.joinElection(ZkController.java:733)
      at org.apache.solr.cloud.ZkController.register(ZkController.java:566)
      at org.apache.solr.cloud.ZkController.register(ZkController.java:532)
      at org.apache.solr.core.CoreContainer.registerInZk(CoreContainer.java:709)
      at org.apache.solr.core.CoreContainer.register(CoreContainer.java:693)
      at org.apache.solr.core.CoreContainer.load(CoreContainer.java:535)
      at org.apache.solr.core.CoreContainer.load(CoreContainer.java:356)
      at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:308)
      at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:107)
      at org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:295)
      at org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:422)
      at org.apache.catalina.core.ApplicationFilterConfig.<init>(ApplicationFilterConfig.java:115)
      at org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4072)
      at org.apache.catalina.core.StandardContext.start(StandardContext.java:4726)
      at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:799)
      at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:779)
      at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:601)
      at org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:675)
      at org.apache.catalina.startup.HostConfig.deployDescriptors(HostConfig.java:601)
      at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:502)
      at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1317)
      at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:324)
      at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:142)
      at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1065)
      at org.apache.catalina.core.StandardHost.start(StandardHost.java:840)
      at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1057)
      at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:463)
      at org.apache.catalina.core.StandardService.start(StandardService.java:525)
      at org.apache.catalina.core.StandardServer.start(StandardServer.java:754)
      at org.apache.catalina.startup.Catalina.start(Catalina.java:595)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
      at java.lang.reflect.Method.invoke(Unknown Source)
      at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:289)
      at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:414)

      Followed by:
      Oct 25, 2012 1:17:27 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
      SEVERE: Recovery failed - trying again... core=collection1
      Oct 25, 2012 1:18:32 PM org.apache.solr.common.SolrException log
      SEVERE: Error while trying to recover. core=collection1
      Oct 25, 2012 1:18:32 PM org.apache.solr.common.SolrException log
      SEVERE: Error while trying to recover. core=collection1:org.apache.solr.common.SolrException: No registered leader was found, collection:collection1 slice:shard1
      at org.apache.solr.common.cloud.ZkStateReader.getLeaderProps(ZkStateReader.java:413)
      at org.apache.solr.common.cloud.ZkStateReader.getLeaderProps(ZkStateReader.java:399)
      at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:318)
      at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:220)

        Activity

        Hide
        Markus Jelsma added a comment -

        We're seeing this too using a current trunk on a 10 node test cluster. We can trigger this state if we restart all servlet containers sequentially or roughly at the same time, most nodes are stuck in this state too while others sometimes change to a different state throwing all kinds of exceptions (see list). We can only get out of this state by restarting some servlet containers again. It does not happen at all with Zookeeper's data directory wiped clean. The cluster then starts very nicely.

        Show
        Markus Jelsma added a comment - We're seeing this too using a current trunk on a 10 node test cluster. We can trigger this state if we restart all servlet containers sequentially or roughly at the same time, most nodes are stuck in this state too while others sometimes change to a different state throwing all kinds of exceptions (see list). We can only get out of this state by restarting some servlet containers again. It does not happen at all with Zookeeper's data directory wiped clean. The cluster then starts very nicely.
        Hide
        Mark Miller added a comment -

        Hey guys - I've been out of power for the week and am headed to ApacheCon for next week, but I'll try and look into this soon. 4.1 is coming soon and I'd like all these issues fixed in it.

        Show
        Mark Miller added a comment - Hey guys - I've been out of power for the week and am headed to ApacheCon for next week, but I'll try and look into this soon. 4.1 is coming soon and I'd like all these issues fixed in it.
        Hide
        Mark Miller added a comment -

        I think a good way to solve this might be SOLR-4063: CoreContainer should be able to load multiple SolrCore's in parallel rather than just serially.

        Then just require a value of at least 2 when running in SolrCloud mode.

        Show
        Mark Miller added a comment - I think a good way to solve this might be SOLR-4063 : CoreContainer should be able to load multiple SolrCore's in parallel rather than just serially. Then just require a value of at least 2 when running in SolrCloud mode.
        Hide
        Mark Miller added a comment -

        I've written a test for that issue that fails before SOLR-4063 and passes after it.

        I'll commit it shortly.

        Show
        Mark Miller added a comment - I've written a test for that issue that fails before SOLR-4063 and passes after it. I'll commit it shortly.
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] Mark Robert Miller
        http://svn.apache.org/viewvc?view=revision&revision=1408313

        SOLR-3993: If multiple SolrCore's for a shard coexist on a node, on cluster restart, leader election would stall until timeout, waiting to see all of the replicas come up.

        Show
        Commit Tag Bot added a comment - [trunk commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revision&revision=1408313 SOLR-3993 : If multiple SolrCore's for a shard coexist on a node, on cluster restart, leader election would stall until timeout, waiting to see all of the replicas come up.
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Mark Robert Miller
        http://svn.apache.org/viewvc?view=revision&revision=1408323

        SOLR-3993: If multiple SolrCore's for a shard coexist on a node, on cluster restart, leader election would stall until timeout, waiting to see all of the replicas come up.

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revision&revision=1408323 SOLR-3993 : If multiple SolrCore's for a shard coexist on a node, on cluster restart, leader election would stall until timeout, waiting to see all of the replicas come up.
        Hide
        Mark Miller added a comment -

        Thanks Alexey!

        Show
        Mark Miller added a comment - Thanks Alexey!
        Hide
        Alexey Kudinov added a comment -

        Thanks Mark!
        I will test it as soon as possible

        Show
        Alexey Kudinov added a comment - Thanks Mark! I will test it as soon as possible
        Hide
        Werner Maier added a comment -

        Thanks Mark, works better.
        tested nightly 2012-11-16 on a 1-shard-3-node-cluster (hardware, each core on different server, jetty container).

        I still can reproduce a problem:
        1) simulate a power outage on the cluster (kill -9 on all servers), leaving the zookeeper ensemble running.
        2) start only ONE core that has NOT been the former leader.

        Result: loop in Running recovery...

        3) restart that core (clean shutdown with kill instead of kill -9):

        Result: all will be fine.
        (the node connects to zookeeper, registeres as leader, runs all changes, cleans the zookeeper queue, and finally
        waits for other replicas to come up. After that timeout it declares itself as leader and all is fine).

        Second Problem (might be the at least similar):
        1) setup: 1 shard, 3 codes, zookeeper ensemble on three servers. node 1 is leader.
        2) thistime killall -9 java (shuts down zookeeper ensemble and solr cores - simulated power outage on all three servers)
        3) start solr core on server 2 and 3 (which has NOT been leader). tries to connect to zookeeper, but can't)
        4) start zookeeper on server 2 and 3 (that still simulates hardware failure of server #1)

        sometiomes both a core loops in recovery.
        sometimes a core keeps stuck in "shutting down". "INFO: Client->ZooKeeper status change trigger but we are already closed".

        restarting the cores helps everytime.

        kind regards.

        Show
        Werner Maier added a comment - Thanks Mark, works better. tested nightly 2012-11-16 on a 1-shard-3-node-cluster (hardware, each core on different server, jetty container). I still can reproduce a problem: 1) simulate a power outage on the cluster (kill -9 on all servers), leaving the zookeeper ensemble running. 2) start only ONE core that has NOT been the former leader. Result: loop in Running recovery... 3) restart that core (clean shutdown with kill instead of kill -9): Result: all will be fine. (the node connects to zookeeper, registeres as leader, runs all changes, cleans the zookeeper queue, and finally waits for other replicas to come up. After that timeout it declares itself as leader and all is fine). Second Problem (might be the at least similar): 1) setup: 1 shard, 3 codes, zookeeper ensemble on three servers. node 1 is leader. 2) thistime killall -9 java (shuts down zookeeper ensemble and solr cores - simulated power outage on all three servers) 3) start solr core on server 2 and 3 (which has NOT been leader). tries to connect to zookeeper, but can't) 4) start zookeeper on server 2 and 3 (that still simulates hardware failure of server #1) sometiomes both a core loops in recovery. sometimes a core keeps stuck in "shutting down". "INFO: Client->ZooKeeper status change trigger but we are already closed". restarting the cores helps everytime. kind regards.
        Hide
        Mark Miller added a comment -

        Bummer - hard to make a test that simulates kill -9. I imagine the key issue is that the nodes going down are having the leader ephemerals timeout - perhaps when I get some time I can work on simulating that.

        Show
        Mark Miller added a comment - Bummer - hard to make a test that simulates kill -9. I imagine the key issue is that the nodes going down are having the leader ephemerals timeout - perhaps when I get some time I can work on simulating that.
        Hide
        Po Rui added a comment -

        TO Werner :

        2) start only ONE core that has NOT been the former leader.
        ……
        Result: loop in Running recovery...

        this only one core will be the leader finally. this will take a long time cause the waitForReplicasComeup() will end for timeout. this core will cancelRecovery() after he is the leader. it wouldn't run in loop cause cancelRecovery() will stop recovery thread. But before this core become the leader the recovery thread try to connect to old leader and do recovery. recovery thread will throw a lot of exceptions along this process duo to the dead leader. it's fine since the recovery thread will die too

        Second Problem (might be the at least similar):
        live zookeeper is a precondition of solr cluster and also the zookeeper service must ready before start a solr instance. you'd better use zookeeper service separately
        Show
        Po Rui added a comment - TO Werner : 2) start only ONE core that has NOT been the former leader. …… Result: loop in Running recovery... this only one core will be the leader finally. this will take a long time cause the waitForReplicasComeup() will end for timeout. this core will cancelRecovery() after he is the leader. it wouldn't run in loop cause cancelRecovery() will stop recovery thread. But before this core become the leader the recovery thread try to connect to old leader and do recovery. recovery thread will throw a lot of exceptions along this process duo to the dead leader. it's fine since the recovery thread will die too Second Problem (might be the at least similar): live zookeeper is a precondition of solr cluster and also the zookeeper service must ready before start a solr instance. you'd better use zookeeper service separately
        Hide
        Werner Maier added a comment -

        To Po: (thanks for your comment)

        1) [...]this only one core will be the leader finally. this will take a long time cause the waitForReplicasComeup() [...]

        I don't see a WaitForReplicasComeUp in the logs in that case. Maybe I missinterpreted the lots of exceptions in the log.
        After power fail or kill -9 it shows (for me) a recovery loop. Maybe this loop will end eventually (I stopped watching
        after some 10..15min). I'll try that again if I have some minutes.

        2) Zookeeper:
        I know. But I'm a sysadmin dealing with real hardware (and failures) - not a programmer that just uses failure-free hardware and proposes preconditions
        I have seen a ROW of 19"-cabinets going down - even though each of them had TWO redundant Power Lines and the compunting center had a N+2 UPS. So I'm just trying all possible scenarios that I can imagine - and one of theese is a power outage for zookeeper and solr at the same time. Some things can be controlled with startup order (first zookeeper then solr) on single machines, but if multiple machines are involed, this gets difficult.
        If such a problem can easily circumvented by the solar instance reconnecting to the zookeeper, then the solr instance should just do that.

        So I tried these things and wrote the bug report - as maybe the developers (that do a very good job!) might just not have considered these cases.

        Show
        Werner Maier added a comment - To Po: (thanks for your comment) 1) [...] this only one core will be the leader finally. this will take a long time cause the waitForReplicasComeup() [...] I don't see a WaitForReplicasComeUp in the logs in that case. Maybe I missinterpreted the lots of exceptions in the log. After power fail or kill -9 it shows (for me) a recovery loop. Maybe this loop will end eventually (I stopped watching after some 10..15min). I'll try that again if I have some minutes. 2) Zookeeper: I know. But I'm a sysadmin dealing with real hardware (and failures) - not a programmer that just uses failure-free hardware and proposes preconditions I have seen a ROW of 19"-cabinets going down - even though each of them had TWO redundant Power Lines and the compunting center had a N+2 UPS. So I'm just trying all possible scenarios that I can imagine - and one of theese is a power outage for zookeeper and solr at the same time. Some things can be controlled with startup order (first zookeeper then solr) on single machines, but if multiple machines are involed, this gets difficult. If such a problem can easily circumvented by the solar instance reconnecting to the zookeeper, then the solr instance should just do that. So I tried these things and wrote the bug report - as maybe the developers (that do a very good job!) might just not have considered these cases.
        Hide
        Po Rui added a comment -

        To Werner:

        1)are you sure your solr is 4.0 rather than 4.0alpha or 4.0beta? generally, the situation you described lead a lot of waitForReplicasComeup log produced. if no exception it will stop 3 mins latter. those logs due to zookeeper keep those old server's info in clustersate.

        2) above situation both refer to zookeeper leader reelection.all solr service can't work before this election procedure over. the solr node may die if the election time over the solr connect-zookeeper retry time. so the problem is hard to deal with since the zook leader reelection procedure is uncontrollable.

        Show
        Po Rui added a comment - To Werner: 1)are you sure your solr is 4.0 rather than 4.0alpha or 4.0beta? generally, the situation you described lead a lot of waitForReplicasComeup log produced. if no exception it will stop 3 mins latter. those logs due to zookeeper keep those old server's info in clustersate. 2) above situation both refer to zookeeper leader reelection.all solr service can't work before this election procedure over. the solr node may die if the election time over the solr connect-zookeeper retry time. so the problem is hard to deal with since the zook leader reelection procedure is uncontrollable.
        Hide
        Werner Maier added a comment -

        1) Yes I'm sure. as I wrote: "tested nightly 2012-11-16" after Mark's fix for the leader election. Sorry, I wrote "nightly" instead of "nightly build" that would have been more spefific.
        The nodes were running:
        solr-impl 4.1-2012-11-16_01-01-43 1410134 - hudson - 2012-11-16 01:10:49
        lucene-impl 4.1-2012-11-16_01-01-43 1410134 - hudson - 2012-11-16 01:08:46
        (and zookeeper 3.4.4)

        2) I understand. Then it might be a solution for us just to test the zookeeper connections before starting the solr nodes. this could be done using 4-letter-commands like 'ruok' + 'stat' in the startup script for solr, if I'm not wrong.

        Show
        Werner Maier added a comment - 1) Yes I'm sure. as I wrote: "tested nightly 2012-11-16" after Mark's fix for the leader election. Sorry, I wrote "nightly" instead of "nightly build" that would have been more spefific. The nodes were running: solr-impl 4.1-2012-11-16_01-01-43 1410134 - hudson - 2012-11-16 01:10:49 lucene-impl 4.1-2012-11-16_01-01-43 1410134 - hudson - 2012-11-16 01:08:46 (and zookeeper 3.4.4) 2) I understand. Then it might be a solution for us just to test the zookeeper connections before starting the solr nodes. this could be done using 4-letter-commands like 'ruok' + 'stat' in the startup script for solr, if I'm not wrong.
        Hide
        Mark Miller added a comment -

        So I just tested out this scenario, and it seemed to work as expected for me. I killed all the replicas in a shard and then started one that had not been the leader before - it waited to see the other replicas come back - it does this for safety reasons. Once the timeout went by (I think it's 5 minutes), or I started the other replicas, everything went green.

        You can lower the timeout or turn it off if you want. Just set the leaderVoteWait attribute on 'cores' in solr.xml.

        Show
        Mark Miller added a comment - So I just tested out this scenario, and it seemed to work as expected for me. I killed all the replicas in a shard and then started one that had not been the leader before - it waited to see the other replicas come back - it does this for safety reasons. Once the timeout went by (I think it's 5 minutes), or I started the other replicas, everything went green. You can lower the timeout or turn it off if you want. Just set the leaderVoteWait attribute on 'cores' in solr.xml.
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Mark Robert Miller
        http://svn.apache.org/viewvc?view=revision&revision=1408323

        SOLR-3993: If multiple SolrCore's for a shard coexist on a node, on cluster restart, leader election would stall until timeout, waiting to see all of the replicas come up.

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revision&revision=1408323 SOLR-3993 : If multiple SolrCore's for a shard coexist on a node, on cluster restart, leader election would stall until timeout, waiting to see all of the replicas come up.

          People

          • Assignee:
            Mark Miller
            Reporter:
            Alexey Kudinov
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development