Solr
  1. Solr
  2. SOLR-6610

ZkController.publishAndWaitForDownStates always times out when a new cluster is started

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.10.3, 5.0, 6.0
    • Component/s: SolrCloud
    • Labels:

      Description

      Using stateFormat=2, our solr always takes a while to start up and spits out this warning line:

      WARN - 2014-10-08 17:30:24.290; org.apache.solr.cloud.ZkController; Timed out waiting to see all nodes published as DOWN in our cluster state.

      Looking at the code, this is probably because ZkController.publishAndWaitForDownStates is called in ZkController.init, which gets called via ZkContainer.initZookeeper in CoreContainer.load before any of the stateFormat=2 collection watches are set in the CoreContainer.preRegisterInZk call a few lines later.

      1. SOLR-6610.patch
        0.8 kB
        Noble Paul

        Issue Links

          Activity

          Hide
          Noble Paul added a comment - - edited

          Is it in the trunk or in your own internal version?

          Show
          Noble Paul added a comment - - edited Is it in the trunk or in your own internal version?
          Hide
          Jessica Cheng Mallet added a comment -

          We're seeing it manifested in our own build, but looks like the relevant code in trunk is the same. I did mis-describe it in that I said ZkController.init is called in ZkContainer.initZookeeper, but actually it's called in the constructor of ZKController, which is constructed in ZkContainer.initZookeeper.

          Show
          Jessica Cheng Mallet added a comment - We're seeing it manifested in our own build, but looks like the relevant code in trunk is the same. I did mis-describe it in that I said ZkController.init is called in ZkContainer.initZookeeper, but actually it's called in the constructor of ZKController, which is constructed in ZkContainer.initZookeeper.
          Hide
          Shalin Shekhar Mangar added a comment -

          The way the ZkController.publishAndWaitForDownStates is written, it checks if the live_nodes exist and then it tries to publish and wait but that's not really correct. It should check if live_nodes exists and it has at least one childred. Then only we can be sure that there will be an overseer to process the state requests.

          I've seen this problem on cluster restarts where ZK already has /live_nodes existing but without any children (which are ephemeral of course). But I've never seen this problem on individual node restarts when an overseer exists in the cluster already.

          Show
          Shalin Shekhar Mangar added a comment - The way the ZkController.publishAndWaitForDownStates is written, it checks if the live_nodes exist and then it tries to publish and wait but that's not really correct. It should check if live_nodes exists and it has at least one childred. Then only we can be sure that there will be an overseer to process the state requests. I've seen this problem on cluster restarts where ZK already has /live_nodes existing but without any children (which are ephemeral of course). But I've never seen this problem on individual node restarts when an overseer exists in the cluster already.
          Hide
          Jessica Cheng Mallet added a comment -

          Shalin, I think you're right. I misread the code in that publishAndWaitForDownStates's call to clusterState.getCollection(collectionName) doesn't actually require a watch since it'll call out to zookeeper on-demand. This also explains why most of our complaints come for 1 node dev clusters.

          Show
          Jessica Cheng Mallet added a comment - Shalin, I think you're right. I misread the code in that publishAndWaitForDownStates's call to clusterState.getCollection(collectionName) doesn't actually require a watch since it'll call out to zookeeper on-demand. This also explains why most of our complaints come for 1 node dev clusters.
          Hide
          Noble Paul added a comment - - edited

          Now that makes sense. This particular problem Shalin Shekhar Mangar mentioned happens all the time and it happens when the entire cluster is restarted (in 1 node clusters that is always true)

          Show
          Noble Paul added a comment - - edited Now that makes sense. This particular problem Shalin Shekhar Mangar mentioned happens all the time and it happens when the entire cluster is restarted (in 1 node clusters that is always true)
          Hide
          Noble Paul added a comment -

          This should fix the problem reported by Shalin Shekhar Mangar

          Show
          Noble Paul added a comment - This should fix the problem reported by Shalin Shekhar Mangar
          Hide
          ASF subversion and git services added a comment -

          Commit 1635163 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1635163 ]

          SOLR-6610

          Show
          ASF subversion and git services added a comment - Commit 1635163 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1635163 ] SOLR-6610
          Hide
          ASF subversion and git services added a comment -

          Commit 1635168 from Noble Paul in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1635168 ]

          SOLR-6610

          Show
          ASF subversion and git services added a comment - Commit 1635168 from Noble Paul in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1635168 ] SOLR-6610
          Hide
          Shalin Shekhar Mangar added a comment -

          Reopening to backport to 4.10.3

          Show
          Shalin Shekhar Mangar added a comment - Reopening to backport to 4.10.3
          Hide
          ASF subversion and git services added a comment -

          Commit 1642732 from shalin@apache.org in branch 'dev/branches/lucene_solr_4_10'
          [ https://svn.apache.org/r1642732 ]

          SOLR-6610: Slow startup of new clusters because ZkController.publishAndWaitForDownStates always times out

          Show
          ASF subversion and git services added a comment - Commit 1642732 from shalin@apache.org in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1642732 ] SOLR-6610 : Slow startup of new clusters because ZkController.publishAndWaitForDownStates always times out
          Hide
          Anshum Gupta added a comment -

          Bulk close after 5.0 release.

          Show
          Anshum Gupta added a comment - Bulk close after 5.0 release.

            People

            • Assignee:
              Noble Paul
              Reporter:
              Jessica Cheng Mallet
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development