Solr
  1. Solr
  2. SOLR-6159

cancelElection fails on uninitialized ElectionContext

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 4.8.1
    • Fix Version/s: 4.10, 6.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      I had a solr collection that basically was out of memory (no exception, just continuous 80-90 second full GCs). This of course is not a good state, but when in this state ever time you come out of the GC your zookeeper session has expired causing all kinds of havoc. Anyway I found bug in the condition where during LeaderElector.setup() if you get a Zookeeper error the LeaderElector.context get set to a context that is not fully initialized (ie hasn't called joinElection..

      Anyway once this happens the node can no longer attempt to join elections because every time the LeaderElector attempts to call cancelElection() on the previous ElectionContext..

      Some logs below and I've attached a patch that does 2 things:

      • Move the setting of LeaderElector.context in the setup call to then of the call so it is only set if the setup completes.
      • Added a check to see if leaderSeqPath is null in ElectionContext.cancelElection
      • Made leaderSeqPath volatile as it is being directly accessed by multiple threads.
      • set LeaderElector.context = null when joinElection fails

      There may be other issues.. the patch is focused on breaking the failure loop that occurs when initialization of the ElectionContext fails.

      2014-06-08 23:14:57.805 INFO  ClientCnxn [main-SendThread(host1:1234)] - Opening socket connection to server host1/10.122.142.31:1234. Will not attempt to authenticate using SASL (unknown error)
      2014-06-08 23:14:57.806 INFO  ClientCnxn [main-SendThread(host1:1234)] - Socket connection established to host1/10.122.142.31:1234, initiating session
      2014-06-08 23:14:57.810 INFO  ClientCnxn [main-SendThread(host1:1234)] - Unable to reconnect to ZooKeeper service, session 0x2467d956c8d0446 has expired, closing socket connection
      2014-06-08 23:14:57.816 INFO  ConnectionManager [main-EventThread] - Watcher org.apache.solr.common.cloud.ConnectionManager@7fe8e0f1 name:ZooKeeperConnection Watcher:host4:1234,host1:1234,host3:1234,host2:1234/engines/solr/collections/XXXXXX got event WatchedEvent state:Expired type:None path:null path:null type:None
      2014-06-08 23:14:57.817 INFO  ConnectionManager [main-EventThread] - Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper...
      2014-06-08 23:14:57.817 INFO  DefaultConnectionStrategy [main-EventThread] - Connection expired - starting a new one...
      2014-06-08 23:14:57.817 INFO  ZooKeeper [main-EventThread] - Initiating client connection, connectString=host4:1234,host1:1234,host3:1234,host2:1234/engines/solr/collections/XXXXXX sessionTimeout=15000 watcher=org.apache.solr.common.cloud.ConnectionManager@7fe8e0f1
      2014-06-08 23:14:57.857 INFO  ConnectionManager [main-EventThread] - Waiting for client to connect to ZooKeeper
      2014-06-08 23:14:57.859 INFO  ClientCnxn [main-SendThread(host4:1234)] - Opening socket connection to server host4/172.17.14.107:1234. Will not attempt to authenticate using SASL (unknown error)
      2014-06-08 23:14:57.891 INFO  ClientCnxn [main-SendThread(host4:1234)] - Socket connection established to host4/172.17.14.107:1234, initiating session
      2014-06-08 23:14:57.906 INFO  ClientCnxn [main-SendThread(host4:1234)] - Session establishment complete on server host4/172.17.14.107:1234, sessionid = 0x4467d8d79260486, negotiated timeout = 15000
      2014-06-08 23:14:57.907 INFO  ConnectionManager [main-EventThread] - Watcher org.apache.solr.common.cloud.ConnectionManager@7fe8e0f1 name:ZooKeeperConnection Watcher:host4:1234,host1:1234,host3:1234,host2:1234/engines/solr/collections/XXXXXX got event WatchedEvent state:SyncConnected type:None path:null path:null type:None
      2014-06-08 23:14:57.909 INFO  ConnectionManager [main-EventThread] - Client is connected to ZooKeeper
      2014-06-08 23:14:57.909 INFO  ConnectionManager [main-EventThread] - Connection with ZooKeeper reestablished.
      2014-06-08 23:14:57.911 ERROR ZkController [Thread-203] - :org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /overseer_elect/election
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
        at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226)
        at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223)
        at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
        at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223)
        at org.apache.solr.common.cloud.ZkCmdExecutor.ensureExists(ZkCmdExecutor.java:100)
        at org.apache.solr.common.cloud.ZkCmdExecutor.ensureExists(ZkCmdExecutor.java:94)
        at org.apache.solr.cloud.LeaderElector.setup(LeaderElector.java:324)
        at org.apache.solr.cloud.ZkController$1.command(ZkController.java:235)
        at org.apache.solr.common.cloud.ConnectionManager$1$1.run(ConnectionManager.java:166)
      2014-06-08 23:14:57.911 WARN  ConnectionManager [Thread-203] - Exception running onReconnect command
      org.apache.solr.common.cloud.ZooKeeperException:
        at org.apache.solr.cloud.ZkController$1.command(ZkController.java:267)
        at org.apache.solr.common.cloud.ConnectionManager$1$1.run(ConnectionManager.java:166)
      Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /overseer_elect/election
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
        at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226)
        at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223)
        at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
        at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223)
        at org.apache.solr.common.cloud.ZkCmdExecutor.ensureExists(ZkCmdExecutor.java:100)
        at org.apache.solr.common.cloud.ZkCmdExecutor.ensureExists(ZkCmdExecutor.java:94)
        at org.apache.solr.cloud.LeaderElector.setup(LeaderElector.java:324)
        at org.apache.solr.cloud.ZkController$1.command(ZkController.java:235)
        ... 1 more
      2014-06-08 23:14:57.917 INFO  DefaultConnectionStrategy [main-EventThread] - Reconnected to ZooKeeper
      2014-06-08 23:14:57.917 INFO  ConnectionManager [main-EventThread] - Connected:true
      2014-06-08 23:14:57.917 INFO  ClientCnxn [main-EventThread] - EventThread shut down
      2014-06-08 23:14:57.917 INFO  ZkController [Thread-205] - publishing core=XXXXXX state=down collection=XXXXXX
      2014-06-08 23:14:57.920 INFO  ZkController [Thread-205] - numShards not found on descriptor - reading it from system property
      2014-06-08 23:14:58.033 INFO  ElectionContext [Thread-205] - canceling election null
      2014-06-08 23:14:58.083 ERROR ZkController [Thread-205] - :java.lang.IllegalArgumentException: Path cannot be null
        at org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:45)
        at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:851)
        at org.apache.solr.common.cloud.SolrZkClient$2.execute(SolrZkClient.java:177)
        at org.apache.solr.common.cloud.SolrZkClient$2.execute(SolrZkClient.java:174)
        at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
        at org.apache.solr.common.cloud.SolrZkClient.delete(SolrZkClient.java:174)
        at org.apache.solr.cloud.ElectionContext.cancelElection(ElectionContext.java:72)
        at org.apache.solr.cloud.OverseerElectionContext.cancelElection(ElectionContext.java:447)
        at org.apache.solr.cloud.ZkController$1.command(ZkController.java:232)
        at org.apache.solr.common.cloud.ConnectionManager$1$1.run(ConnectionManager.java:166)
      2014-06-08 23:14:58.083 WARN  ConnectionManager [Thread-205] - Exception running onReconnect command
      org.apache.solr.common.cloud.ZooKeeperException:
        at org.apache.solr.cloud.ZkController$1.command(ZkController.java:267)
        at org.apache.solr.common.cloud.ConnectionManager$1$1.run(ConnectionManager.java:166)
      Caused by: java.lang.IllegalArgumentException: Path cannot be null
        at org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:45)
        at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:851)
        at org.apache.solr.common.cloud.SolrZkClient$2.execute(SolrZkClient.java:177)
        at org.apache.solr.common.cloud.SolrZkClient$2.execute(SolrZkClient.java:174)
        at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
        at org.apache.solr.common.cloud.SolrZkClient.delete(SolrZkClient.java:174)
        at org.apache.solr.cloud.ElectionContext.cancelElection(ElectionContext.java:72)
        at org.apache.solr.cloud.OverseerElectionContext.cancelElection(ElectionContext.java:447)
        at org.apache.solr.cloud.ZkController$1.command(ZkController.java:232)
        ... 1 more
      
      1. SOLR-6159.patch
        7 kB
        Shalin Shekhar Mangar
      2. SOLR-6159.patch
        2 kB
        Steven Bower

        Issue Links

          Activity

          Hide
          Shalin Shekhar Mangar added a comment -

          Nice catch! Thanks Steven. Let me see if I can re-create the problem in a test.

          Show
          Shalin Shekhar Mangar added a comment - Nice catch! Thanks Steven. Let me see if I can re-create the problem in a test.
          Hide
          Shalin Shekhar Mangar added a comment -

          Added a new test called TestLeaderElectionZkExpiry which expires zk sessions in a tight loop. This test is able to reproduce the problem reported in this issue 6 out of 10 times on my machine. The patch given by Steven fixes the problems.

          Show
          Shalin Shekhar Mangar added a comment - Added a new test called TestLeaderElectionZkExpiry which expires zk sessions in a tight loop. This test is able to reproduce the problem reported in this issue 6 out of 10 times on my machine. The patch given by Steven fixes the problems.
          Hide
          ASF subversion and git services added a comment -

          Commit 1606996 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1606996 ]

          SOLR-6159: A ZooKeeper session expiry during setup can keep LeaderElector from joining elections

          Show
          ASF subversion and git services added a comment - Commit 1606996 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1606996 ] SOLR-6159 : A ZooKeeper session expiry during setup can keep LeaderElector from joining elections
          Hide
          ASF subversion and git services added a comment -

          Commit 1606997 from shalin@apache.org in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1606997 ]

          SOLR-6159: A ZooKeeper session expiry during setup can keep LeaderElector from joining elections

          Show
          ASF subversion and git services added a comment - Commit 1606997 from shalin@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1606997 ] SOLR-6159 : A ZooKeeper session expiry during setup can keep LeaderElector from joining elections
          Hide
          Shalin Shekhar Mangar added a comment -

          Thanks Steven!

          Show
          Shalin Shekhar Mangar added a comment - Thanks Steven!

            People

            • Assignee:
              Shalin Shekhar Mangar
              Reporter:
              Steven Bower
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development