Solr
  1. Solr
  2. SOLR-4734

Leader election fails with an NPE if there is no UpdateLog.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.2.1, 4.3
    • Fix Version/s: 4.3.1, 4.4, 6.0
    • Component/s: SolrCloud
    • Labels:
      None
    • Environment:

      Linux 64bit on 3.2.0-33-generic kernel
      Solr: 4.2.1
      ZooKeeper: 3.4.5
      Tomcat 7.0.27

      Description

      The following setup and steps always lead to the same error:
      app01: ZooKeeper
      app02: ZooKeeper, Solr (in Tomcat)
      app03: ZooKeeper, Solr (in Tomcat)

      *) Start ZooKeeper as ensemble on all machines.
      *) Start tomcat on app02/app03

      clusterstate.json
      null
      cZxid = 0x100000014
      ctime = Thu Apr 18 10:59:24 CEST 2013
      mZxid = 0x100000014
      mtime = Thu Apr 18 10:59:24 CEST 2013
      pZxid = 0x100000014
      cversion = 0
      dataVersion = 0
      aclVersion = 0
      ephemeralOwner = 0x0
      dataLength = 0
      numChildren = 0
      

      *) Upload the configuration (on app02) for the collection via the following command:

          zkcli.sh -cmd upconfig --zkhost app01:4181,app02:4181,app03:4181 --confdir config/solr/storage/conf/ --confname storage-conf 
      

      *) Linking the configuration (on app02) via the following command:

          zkcli.sh -cmd linkconfig --collection storage --confname storage-conf --zkhost app01:4181,app02:4181,app03:4181
      

      *) Create Collection via:

      http://app02/solr/admin/collections?action=CREATE&name=storage&numShards=1&replicationFactor=2&collection.configName=storage-conf
      
      clusterstate.json
      {"storage":{
          "shards":{"shard1":{
              "range":"80000000-7fffffff",
              "state":"active",
              "replicas":{
                "app02:9985_solr_storage_shard1_replica2":{
                  "shard":"shard1",
                  "state":"down",
                  "core":"storage_shard1_replica2",
                  "collection":"storage",
                  "node_name":"app02:9985_solr",
                  "base_url":"http://app02:9985/solr"},
                "app03:9985_solr_storage_shard1_replica1":{
                  "shard":"shard1",
                  "state":"down",
                  "core":"storage_shard1_replica1",
                  "collection":"storage",
                  "node_name":"app03:9985_solr",
                  "base_url":"http://app03:9985/solr"}}}},
          "router":"compositeId"}}
      cZxid = 0x100000014
      ctime = Thu Apr 18 10:59:24 CEST 2013
      mZxid = 0x100000047
      mtime = Thu Apr 18 11:04:06 CEST 2013
      pZxid = 0x100000014
      cversion = 0
      dataVersion = 2
      aclVersion = 0
      ephemeralOwner = 0x0
      dataLength = 847
      numChildren = 0
      

      This creates the replication of the shard on app02 and app03, but neither of them is marked as leader, both are marked as DOWN.
      And after wards I can not access the collection.
      In the browser I get:

      "SEVERE: org.apache.solr.common.SolrException: no servers hosting shard:"
      

      The following stacktrace in the logs:

      Apr 18, 2013 11:04:05 AM org.apache.solr.common.SolrException log
      SEVERE: org.apache.solr.common.SolrException: Error CREATEing SolrCore 'storage_shard1_replica2': 
              at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:483)
              at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:140)
              at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
              at org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:591)
              at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:192)
              at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
              at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
              at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
              at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:225)
              at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
              at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
              at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
              at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
              at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
              at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:999)
              at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:565)
              at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
              at java.lang.Thread.run(Thread.java:722)
      Caused by: org.apache.solr.common.cloud.ZooKeeperException: 
              at org.apache.solr.core.CoreContainer.registerInZk(CoreContainer.java:931)
              at org.apache.solr.core.CoreContainer.registerCore(CoreContainer.java:892)
              at org.apache.solr.core.CoreContainer.register(CoreContainer.java:841)
              at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:479)
              ... 19 more
      Caused by: java.lang.NullPointerException
              at org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:190)
              at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:156)
              at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:100)
              at org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:266)
              at org.apache.solr.cloud.ZkController.joinElection(ZkController.java:935)
              at org.apache.solr.cloud.ZkController.register(ZkController.java:761)
              at org.apache.solr.cloud.ZkController.register(ZkController.java:727)
              at org.apache.solr.core.CoreContainer.registerInZk(CoreContainer.java:908)
              ... 22 more
      

      I have attached a minimal set of configuration files which are needed to replicate this error, also containing the log files for the commands I have run in the order above.

      1. config-logs.zip
        38 kB
        Alexander Eibner

        Activity

        Hide
        Alexander Eibner added a comment -

        Minimal set of configurations for reproducing the error.

        Log files for the steps above

        Show
        Alexander Eibner added a comment - Minimal set of configurations for reproducing the error. Log files for the steps above
        Hide
        Mark Miller added a comment -

        Could use a better erorr message - it's not finding an update log. Looking at a config your zip file, you have updatelog defined in the wrong place - it needs to be in updateHandler.

        Show
        Mark Miller added a comment - Could use a better erorr message - it's not finding an update log. Looking at a config your zip file, you have updatelog defined in the wrong place - it needs to be in updateHandler.
        Hide
        Shawn Heisey added a comment -

        One other thing in addition to Mark's note - the step where you link the config with zkcli isn't necessary, and at that point, the collection doesn't exist, so it can't be linked.

        The best-case scenario is that the linkconfig step isn't doing anything at all. The worst-case scenario is that linkconfig puts something in zookeeper that prevents the CREATE from working properly.

        The CREATE action takes care of linking the config to the collection, via the collection.configName parameter.

        Show
        Shawn Heisey added a comment - One other thing in addition to Mark's note - the step where you link the config with zkcli isn't necessary, and at that point, the collection doesn't exist, so it can't be linked. The best-case scenario is that the linkconfig step isn't doing anything at all. The worst-case scenario is that linkconfig puts something in zookeeper that prevents the CREATE from working properly. The CREATE action takes care of linking the config to the collection, via the collection.configName parameter.
        Hide
        Shawn Heisey added a comment -

        Another note: The linkconfig might in fact be failing, but you aren't seeing an error message because of SOLR-4807. The fix for that problem will be in 4.3.1 when it gets released. If you download the 4.3.1 source code (or a nightly build) and copy its cloud-scripts directory over to your 4.3.0 install, you'll have logging.

        Show
        Shawn Heisey added a comment - Another note: The linkconfig might in fact be failing, but you aren't seeing an error message because of SOLR-4807 . The fix for that problem will be in 4.3.1 when it gets released. If you download the 4.3.1 source code (or a nightly build) and copy its cloud-scripts directory over to your 4.3.0 install, you'll have logging.
        Hide
        Mark Miller added a comment -

        at that point, the collection doesn't exist, so it can't be linked.

        You can link before the collection exists - this feature was added to support some more complicated scenarios. When the collection is actually created, it will find the link.

        Show
        Mark Miller added a comment - at that point, the collection doesn't exist, so it can't be linked. You can link before the collection exists - this feature was added to support some more complicated scenarios. When the collection is actually created, it will find the link.
        Hide
        Shawn Heisey added a comment -

        You can link before the collection exists - this feature was added to support some more complicated scenarios. When the collection is actually created, it will find the link.

        Thanks, Mark. I can always learn new things! I think I can envision the scenario - upload the config, link it to the collection that doesn't exist yet, then skip the collections API and manually create each core with CoreAdmin.

        Show
        Shawn Heisey added a comment - You can link before the collection exists - this feature was added to support some more complicated scenarios. When the collection is actually created, it will find the link. Thanks, Mark. I can always learn new things! I think I can envision the scenario - upload the config, link it to the collection that doesn't exist yet, then skip the collections API and manually create each core with CoreAdmin.
        Hide
        Shalin Shekhar Mangar added a comment -

        I'll backport this to 4.3.1 if there are no objections.

        Show
        Shalin Shekhar Mangar added a comment - I'll backport this to 4.3.1 if there are no objections.
        Hide
        Shalin Shekhar Mangar added a comment -

        Backported to 4.3.1 r1483669.

        Actually, this fix was already backported to 4.3.1 with SOLR-4829 so I just moved the change log entry to 4.3.1

        Show
        Shalin Shekhar Mangar added a comment - Backported to 4.3.1 r1483669. Actually, this fix was already backported to 4.3.1 with SOLR-4829 so I just moved the change log entry to 4.3.1
        Hide
        Alexander Eibner added a comment -

        Thanks, yes the updateLog was the problem, sorry I did not see this.
        Now the collection api calls work, but the cores will be created only on one node.
        But this is another problem, posted to the users list.

        Show
        Alexander Eibner added a comment - Thanks, yes the updateLog was the problem, sorry I did not see this. Now the collection api calls work, but the cores will be created only on one node. But this is another problem, posted to the users list.
        Hide
        Shalin Shekhar Mangar added a comment -

        Bulk close after 4.3.1 release

        Show
        Shalin Shekhar Mangar added a comment - Bulk close after 4.3.1 release

          People

          • Assignee:
            Mark Miller
            Reporter:
            Alexander Eibner
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development