Solr
  1. Solr
  2. SOLR-7146

MiniSolrCloudCluster based tests can fail with ZooKeeperException NoNode for /live_nodes

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.3, 6.0
    • Component/s: SolrCloud, Tests
    • Labels:
      None

      Description

      MiniSolrCloudCluster based tests can fail with the following exception:

      org.apache.solr.common.cloud.ZooKeeperException: 
      	at __randomizedtesting.SeedInfo.seed([3F3D838A8ADC9385:F153ADFBF163EC6D]:0)
      	at org.apache.solr.client.solrj.impl.CloudSolrClient.connect(CloudSolrClient.java:463)
      	at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:763)
      	at org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:752)
      	at org.apache.solr.cloud.MiniSolrCloudCluster.createCollection(MiniSolrCloudCluster.java:193)
      	at org.apache.solr.handler.component.TestTwoPhaseDistributedQuery.testNoExtraFieldsRequestedFromShardsInPhaseOne(TestTwoPhaseDistributedQuery.java:79)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1618)
      	at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:827)
      	at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:863)
      	at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:877)
      	at com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:53)
      	at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:50)
      	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:46)
      	at com.carrotsearch.randomizedtesting.rules.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:55)
      	at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:49)
      	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:65)
      	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:48)
      	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
      	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:365)
      	at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:798)
      	at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:458)
      	at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:836)
      	at com.carrotsearch.randomizedtesting.RandomizedRunner$3.evaluate(RandomizedRunner.java:738)
      	at com.carrotsearch.randomizedtesting.RandomizedRunner$4.evaluate(RandomizedRunner.java:772)
      	at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:783)
      	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
      	at com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:53)
      	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:46)
      	at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:42)
      	at com.carrotsearch.randomizedtesting.rules.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:55)
      	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:39)
      	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:39)
      	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
      	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
      	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
      	at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:54)
      	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:48)
      	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:65)
      	at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:55)
      	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
      	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:365)
      	at java.lang.Thread.run(Thread.java:745)
      Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /live_nodes
      	at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
      	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
      	at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
      	at org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:326)
      	at org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:323)
      	at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
      	at org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:323)
      	at org.apache.solr.common.cloud.ZkStateReader.createClusterStateWatchersAndUpdate(ZkStateReader.java:346)
      	at org.apache.solr.client.solrj.impl.CloudSolrClient.connect(CloudSolrClient.java:455)
      	... 44 more
      

      The reason is that the cluster constructor can return before any of the nodes are initialized. In such a case, no /live_nodes are present and hence this error.

      1. SOLR-7146.patch
        1 kB
        Vamsee Yarlagadda
      2. SOLR-7146v2.patch
        1 kB
        Vamsee Yarlagadda

        Issue Links

          Activity

          Hide
          Mark Miller added a comment -

          Looks like probably a problem that has come in user land before as well (I think Grant pinged me about it once). I think the cluster is not yet ready for the client. We talked about having it way around a bit in this case, it can make some scripting difficult, but I don't think any work was ever done.

          Show
          Mark Miller added a comment - Looks like probably a problem that has come in user land before as well (I think Grant pinged me about it once). I think the cluster is not yet ready for the client. We talked about having it way around a bit in this case, it can make some scripting difficult, but I don't think any work was ever done.
          Hide
          Vamsee Yarlagadda added a comment -

          May be we can add a check at this line to wait for # of znodes under /solr/live_nodes to match up the number of servers being started?
          https://github.com/apache/lucene-solr/blob/trunk/solr/test-framework/src/java/org/apache/solr/cloud/MiniSolrCloudCluster.java#L116

          Of course, we need to have a timeout though rather than waiting indefinitely. I can work on it if the above approach sounds good?

          Show
          Vamsee Yarlagadda added a comment - May be we can add a check at this line to wait for # of znodes under /solr/live_nodes to match up the number of servers being started? https://github.com/apache/lucene-solr/blob/trunk/solr/test-framework/src/java/org/apache/solr/cloud/MiniSolrCloudCluster.java#L116 Of course, we need to have a timeout though rather than waiting indefinitely. I can work on it if the above approach sounds good?
          Hide
          Shalin Shekhar Mangar added a comment -

          Looks like probably a problem that has come in user land before as well (I think Grant pinged me about it once).

          Yup, I found SOLR-4044 which describes the same problem.

          May be we can add a check at this line to wait for # of znodes under /solr/live_nodes to match up the number of servers being started?

          We should do that to solve this particular problem because this is a test-framework class and it should work regardless of whether you use CloudSolrClient or HttpSolrClient but IMO, we should also find a way to solve it inside CloudSolrServer.

          Show
          Shalin Shekhar Mangar added a comment - Looks like probably a problem that has come in user land before as well (I think Grant pinged me about it once). Yup, I found SOLR-4044 which describes the same problem. May be we can add a check at this line to wait for # of znodes under /solr/live_nodes to match up the number of servers being started? We should do that to solve this particular problem because this is a test-framework class and it should work regardless of whether you use CloudSolrClient or HttpSolrClient but IMO, we should also find a way to solve it inside CloudSolrServer.
          Hide
          Vamsee Yarlagadda added a comment -

          Here is the first revision of the patch. I added logic to wait for maximum of 10 seconds before giving up.

          Show
          Vamsee Yarlagadda added a comment - Here is the first revision of the patch. I added logic to wait for maximum of 10 seconds before giving up.
          Hide
          Vamsee Yarlagadda added a comment -

          oops. I didn't handle the case to check if live_nodes exists in the first place. Let me update the patch.

          Show
          Vamsee Yarlagadda added a comment - oops. I didn't handle the case to check if live_nodes exists in the first place. Let me update the patch.
          Hide
          Vamsee Yarlagadda added a comment -

          Added check to see if /solr/live_nodes exist before getting the list of live solr servers.

          Show
          Vamsee Yarlagadda added a comment - Added check to see if /solr/live_nodes exist before getting the list of live solr servers.
          Hide
          Alan Woodward added a comment -

          For CloudSolrClient, does it make sense to create the znode if it doesn't exist in ZkStateReader.createClusterStateWatchersAndUpdate()?

          Show
          Alan Woodward added a comment - For CloudSolrClient, does it make sense to create the znode if it doesn't exist in ZkStateReader.createClusterStateWatchersAndUpdate()?
          Hide
          Shalin Shekhar Mangar added a comment -

          For CloudSolrClient, does it make sense to create the znode if it doesn't exist in ZkStateReader.createClusterStateWatchersAndUpdate()?

          This was discussed in SOLR-4044 but I agree with Yonik that our clients should be read-only w.r.t ZooKeeper. Let's keep the CloudSolrClient related discussions in SOLR-4044 and just fix MiniSolrCloudCluster here.

          Show
          Shalin Shekhar Mangar added a comment - For CloudSolrClient, does it make sense to create the znode if it doesn't exist in ZkStateReader.createClusterStateWatchersAndUpdate()? This was discussed in SOLR-4044 but I agree with Yonik that our clients should be read-only w.r.t ZooKeeper. Let's keep the CloudSolrClient related discussions in SOLR-4044 and just fix MiniSolrCloudCluster here.
          Hide
          Alan Woodward added a comment -

          Shalin Shekhar Mangar I think my proposed patch on SOLR-4044 will fix this too?

          Show
          Alan Woodward added a comment - Shalin Shekhar Mangar I think my proposed patch on SOLR-4044 will fix this too?
          Hide
          ASF subversion and git services added a comment -

          Commit 1682002 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1682002 ]

          SOLR-7146: MiniSolrCloudCluster based tests can fail with ZooKeeperException NoNode for /live_nodes

          Show
          ASF subversion and git services added a comment - Commit 1682002 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1682002 ] SOLR-7146 : MiniSolrCloudCluster based tests can fail with ZooKeeperException NoNode for /live_nodes
          Hide
          ASF subversion and git services added a comment -

          Commit 1682003 from shalin@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1682003 ]

          SOLR-7146: MiniSolrCloudCluster based tests can fail with ZooKeeperException NoNode for /live_nodes

          Show
          ASF subversion and git services added a comment - Commit 1682003 from shalin@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1682003 ] SOLR-7146 : MiniSolrCloudCluster based tests can fail with ZooKeeperException NoNode for /live_nodes
          Hide
          Shalin Shekhar Mangar added a comment -

          I increased the timeout to a minute and changed the logic to break immediately if numservers is reached.

          Thanks Vamsee!

          Show
          Shalin Shekhar Mangar added a comment - I increased the timeout to a minute and changed the logic to break immediately if numservers is reached. Thanks Vamsee!
          Hide
          Shalin Shekhar Mangar added a comment -

          Bulk close for 5.3.0 release

          Show
          Shalin Shekhar Mangar added a comment - Bulk close for 5.3.0 release

            People

            • Assignee:
              Shalin Shekhar Mangar
              Reporter:
              Shalin Shekhar Mangar
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development