Solr
  1. Solr
  2. SOLR-8561

Add fallback to ZkController.getLeaderProps for a mixed 5.4-pre-5.4 deployments

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.4.1, 5.5
    • Component/s: SolrCloud
    • Labels:
      None
    • Flags:
      Patch

      Description

      See last comments in SOLR-7844. The latter changed the structure of the leader path in ZK such that upgrading from pre-5.4 to 5.4 is impossible, unless all nodes are taken down. This issue adds a fallback logic to look for the leader properties on the old ZK node, as discussed.

      1. SOLR-8561.patch
        3 kB
        Shai Erera
      2. SOLR-8561.patch
        2 kB
        Shai Erera

        Issue Links

          Activity

          Hide
          Shai Erera added a comment -

          Mark Miller, same patch as I added on SOLR-7844. I will add the CHANGES entry after I know to which section it belongs (i.e. if we make it to 5.4.1).

          Show
          Shai Erera added a comment - Mark Miller , same patch as I added on SOLR-7844 . I will add the CHANGES entry after I know to which section it belongs (i.e. if we make it to 5.4.1).
          Hide
          Shai Erera added a comment -

          FYI, all tests pass.

          Show
          Shai Erera added a comment - FYI, all tests pass.
          Hide
          Shai Erera added a comment -

          Patch adds a CHANGES entry, currently under 5.4.1. I'd really like to get it out in 5.4.1 (cause currently in order to upgrade to 5.4.0 you have to take down your entire cluster), and Adrien Grand is about to cut the release soon. I'd appreciate if I can get another set of eyes reviewing the change, although it's a simple fallback strategy. Varun Thacker, Noble Paul, Ishan Chattopadhyaya if you're around, would you mind giving this a quick review?

          Show
          Shai Erera added a comment - Patch adds a CHANGES entry, currently under 5.4.1. I'd really like to get it out in 5.4.1 (cause currently in order to upgrade to 5.4.0 you have to take down your entire cluster), and Adrien Grand is about to cut the release soon. I'd appreciate if I can get another set of eyes reviewing the change, although it's a simple fallback strategy. Varun Thacker , Noble Paul , Ishan Chattopadhyaya if you're around, would you mind giving this a quick review?
          Hide
          Varun Thacker added a comment -

          Hi Shai,

          I'm taking a look at the patch now.

          Show
          Varun Thacker added a comment - Hi Shai, I'm taking a look at the patch now.
          Hide
          Adrien Grand added a comment -

          Thank you Varun!

          Show
          Adrien Grand added a comment - Thank you Varun!
          Hide
          Anshum Gupta added a comment -

          I read up the last couple of comments on the linked JIRA and the patch, it LGTM!

          Show
          Anshum Gupta added a comment - I read up the last couple of comments on the linked JIRA and the patch, it LGTM!
          Hide
          Shai Erera added a comment -

          Thanks Anshum Gupta! Varun Thacker, I'll wait for your review as well.

          Show
          Shai Erera added a comment - Thanks Anshum Gupta ! Varun Thacker , I'll wait for your review as well.
          Hide
          Ishan Chattopadhyaya added a comment -

          +1, LGTM.

          Show
          Ishan Chattopadhyaya added a comment - +1, LGTM.
          Hide
          ASF subversion and git services added a comment -

          Commit 1725209 from Shai Erera in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1725209 ]

          SOLR-8561: Add fallback to ZkController.getLeaderProps for a mixed 5.4-pre-5.4 deployments

          Show
          ASF subversion and git services added a comment - Commit 1725209 from Shai Erera in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1725209 ] SOLR-8561 : Add fallback to ZkController.getLeaderProps for a mixed 5.4-pre-5.4 deployments
          Hide
          Varun Thacker added a comment -

          I noticed something while testing out manually.

          When you run solr start -e cloud -noprompt here are the following outputs on solr 5.3.0 and lucene_solr_5_4 branch

          1. solr 5.3.0

          ~/solr-5.3.0 $ ls example/cloud/node1/
          logs	solr
          ~/solr-5.3.0 $ ls example/cloud/node1/solr/
          gettingstarted_shard1_replica2	gettingstarted_shard2_replica2	solr.xml			zoo.cfg				zoo_data
          

          lucene_solr_5_4 with the patch

          ~/apache-work/lucene_solr_5_4/solr $ ls  example/cloud/node1/
          logs		solr		solrzoo_data
          ~/apache-work/lucene_solr_5_4/solr $ ls  example/cloud/node1/solr
          gettingstarted_shard1_replica1	gettingstarted_shard2_replica1	solr.xml			zoo.cfg
          

          On the 5_4 branch the directory seems to be concatenated i.e solrzoo_data instead of solr/zoo_data . It's unrelated to the patch but this tripped me

          So here were the steps I used to try out the patch manually

          • Started Solr 5.3.0 : ./bin/solr start -e cloud -noprompt
          • Node /collections/gettingstarted/leaders/shard1 exists
          • Stopped Solr
          • Ran Solr from the 5.4 branch with the patch applied and also copied over example/cloud directory from 5.3 ( with some manual changes because of the directory structure change mentioned above)
          • The gettingstarted collection came up healthy and a node /collections/gettingstarted/leaders/shard1/leader exists instead.

          So +1 to the patch. I'll file a Jira for the directory structure change in the cloud example

          Show
          Varun Thacker added a comment - I noticed something while testing out manually. When you run solr start -e cloud -noprompt here are the following outputs on solr 5.3.0 and lucene_solr_5_4 branch 1. solr 5.3.0 ~/solr-5.3.0 $ ls example/cloud/node1/ logs solr ~/solr-5.3.0 $ ls example/cloud/node1/solr/ gettingstarted_shard1_replica2 gettingstarted_shard2_replica2 solr.xml zoo.cfg zoo_data lucene_solr_5_4 with the patch ~/apache-work/lucene_solr_5_4/solr $ ls example/cloud/node1/ logs solr solrzoo_data ~/apache-work/lucene_solr_5_4/solr $ ls example/cloud/node1/solr gettingstarted_shard1_replica1 gettingstarted_shard2_replica1 solr.xml zoo.cfg On the 5_4 branch the directory seems to be concatenated i.e solrzoo_data instead of solr/zoo_data . It's unrelated to the patch but this tripped me So here were the steps I used to try out the patch manually Started Solr 5.3.0 : ./bin/solr start -e cloud -noprompt Node /collections/gettingstarted/leaders/shard1 exists Stopped Solr Ran Solr from the 5.4 branch with the patch applied and also copied over example/cloud directory from 5.3 ( with some manual changes because of the directory structure change mentioned above) The gettingstarted collection came up healthy and a node /collections/gettingstarted/leaders/shard1/leader exists instead. So +1 to the patch. I'll file a Jira for the directory structure change in the cloud example
          Hide
          ASF subversion and git services added a comment -

          Commit 1725212 from Shai Erera in branch 'dev/branches/lucene_solr_5_4'
          [ https://svn.apache.org/r1725212 ]

          SOLR-8561: Add fallback to ZkController.getLeaderProps for a mixed 5.4-pre-5.4 deployments

          Show
          ASF subversion and git services added a comment - Commit 1725212 from Shai Erera in branch 'dev/branches/lucene_solr_5_4' [ https://svn.apache.org/r1725212 ] SOLR-8561 : Add fallback to ZkController.getLeaderProps for a mixed 5.4-pre-5.4 deployments
          Hide
          Shai Erera added a comment -

          Thanks Anshum Gupta, Ishan Chattopadhyaya and Varun Thacker for the review!

          And thanks Adrien Grand for willing to wait 'till this is committed.

          I've committed to 5x and the 5.4 branch.

          Show
          Shai Erera added a comment - Thanks Anshum Gupta , Ishan Chattopadhyaya and Varun Thacker for the review! And thanks Adrien Grand for willing to wait 'till this is committed. I've committed to 5x and the 5.4 branch.
          Hide
          Varun Thacker added a comment -

          Hi Shai,

          Shouldn't this be committed to trunk as well? What will happen if a user upgrades from a 5.x version to 6.0?

          Show
          Varun Thacker added a comment - Hi Shai, Shouldn't this be committed to trunk as well? What will happen if a user upgrades from a 5.x version to 6.0?
          Hide
          Shai Erera added a comment -

          I thought that it should, but didn't for two reasons:

          1) On the original issue where this issue was introduced (SOLR-7844), writing the leader props to both "shard1" and "shard1/leader" was done in 5x only, suggesting that there was no intention for 6.0 to be backward compatible with 5.x in this regard.

          2) Mark Miller made this comment "Yeah, 5x needs a little bridge back compat that checks the old location if the new one does not exist." which again, made me think that there was no intention to support such a use case.

          Basically I agree with you, if someone will upgrade from 5.3 straight to 6.0, and will want to do this as a rolling upgrade, then he'll hit that issue. But I wasn't sure if that scenario is intended to be supported. I'd be happy to port this fix to trunk as well if people think otherwise.

          Show
          Shai Erera added a comment - I thought that it should, but didn't for two reasons: 1) On the original issue where this issue was introduced ( SOLR-7844 ), writing the leader props to both "shard1" and "shard1/leader" was done in 5x only, suggesting that there was no intention for 6.0 to be backward compatible with 5.x in this regard. 2) Mark Miller made this comment "Yeah, 5x needs a little bridge back compat that checks the old location if the new one does not exist." which again, made me think that there was no intention to support such a use case. Basically I agree with you, if someone will upgrade from 5.3 straight to 6.0, and will want to do this as a rolling upgrade, then he'll hit that issue. But I wasn't sure if that scenario is intended to be supported. I'd be happy to port this fix to trunk as well if people think otherwise.
          Hide
          Varun Thacker added a comment -

          I'll file a Jira for the directory structure change in the cloud example

          I created SOLR-8564 for the problem mentioned

          Show
          Varun Thacker added a comment - I'll file a Jira for the directory structure change in the cloud example I created SOLR-8564 for the problem mentioned
          Hide
          Noble Paul added a comment -

          Looks fine to me

          why are we using the new org.apache.hadoop.fs.Path ?

          Show
          Noble Paul added a comment - Looks fine to me why are we using the new org.apache.hadoop.fs.Path ?
          Hide
          Shai Erera added a comment -

          I did that because that's how ShardLeaderElectionContextBase did it (see runLeaderProcess()). It's only to extract the parent.

          Show
          Shai Erera added a comment - I did that because that's how ShardLeaderElectionContextBase did it (see runLeaderProcess() ). It's only to extract the parent.
          Hide
          Mark Miller added a comment -

          +1, looks good.

          I would say we certainly do not support rolling upgrades over major versions where we don't even promise or deliver back compat, so we should not need this on trunk.

          Show
          Mark Miller added a comment - +1, looks good. I would say we certainly do not support rolling upgrades over major versions where we don't even promise or deliver back compat, so we should not need this on trunk.
          Hide
          Yonik Seeley added a comment -

          Do we have any tests for rolling upgrades? Should we?

          Show
          Yonik Seeley added a comment - Do we have any tests for rolling upgrades? Should we?
          Hide
          Enrico Hartung added a comment -

          Not sure whether this is related, but when doing a rolling upgrade from 5.3.2 to 5.4.1 leader election still fails with the following error:

          ERROR org.apache.solr.cloud.ShardLeaderElectionContext  [c:collection s:shard1 r:core_node1 x:collection_shard1_replica1] – There was a problem trying to register as the leader:org.apache.solr.common.SolrException: Could not register as the leader because creating the ephemeral registration node in ZooKeeper failed
          #011at org.apache.solr.cloud.ShardLeaderElectionContextBase.runLeaderProcess(ElectionContext.java:214)
          #011at org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:406)
          #011at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:198)
          #011at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:158)
          #011at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:59)
          #011at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:389)
          #011at org.apache.solr.common.cloud.SolrZkClient$3$1.run(SolrZkClient.java:264)
          #011at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          #011at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          #011at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:232)
          #011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
          #011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
          #011at java.lang.Thread.run(Thread.java:745)
          Caused by: org.apache.zookeeper.KeeperException$NoChildrenForEphemeralsException: KeeperErrorCode = NoChildrenForEphemerals
          #011at org.apache.zookeeper.KeeperException.create(KeeperException.java:117)
          #011at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:949)
          #011at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
          #011at org.apache.solr.common.cloud.SolrZkClient$11.execute(SolrZkClient.java:570)
          #011at org.apache.solr.common.cloud.SolrZkClient$11.execute(SolrZkClient.java:567)
          #011at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
          #011at org.apache.solr.common.cloud.SolrZkClient.multi(SolrZkClient.java:567)
          #011at org.apache.solr.cloud.ShardLeaderElectionContextBase$1.execute(ElectionContext.java:197)
          #011at org.apache.solr.common.util.RetryUtil.retryOnThrowable(RetryUtil.java:50)
          #011at org.apache.solr.common.util.RetryUtil.retryOnThrowable(RetryUtil.java:43)
          #011at org.apache.solr.cloud.ShardLeaderElectionContextBase.runLeaderProcess(ElectionContext.java:179)
          #011... 12 more
          

          Should I create a separate ticket for this?

          Show
          Enrico Hartung added a comment - Not sure whether this is related, but when doing a rolling upgrade from 5.3.2 to 5.4.1 leader election still fails with the following error: ERROR org.apache.solr.cloud.ShardLeaderElectionContext [c:collection s:shard1 r:core_node1 x:collection_shard1_replica1] – There was a problem trying to register as the leader:org.apache.solr.common.SolrException: Could not register as the leader because creating the ephemeral registration node in ZooKeeper failed #011at org.apache.solr.cloud.ShardLeaderElectionContextBase.runLeaderProcess(ElectionContext.java:214) #011at org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:406) #011at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:198) #011at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:158) #011at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:59) #011at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:389) #011at org.apache.solr.common.cloud.SolrZkClient$3$1.run(SolrZkClient.java:264) #011at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) #011at java.util.concurrent.FutureTask.run(FutureTask.java:266) #011at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:232) #011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) #011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) #011at java.lang. Thread .run( Thread .java:745) Caused by: org.apache.zookeeper.KeeperException$NoChildrenForEphemeralsException: KeeperErrorCode = NoChildrenForEphemerals #011at org.apache.zookeeper.KeeperException.create(KeeperException.java:117) #011at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:949) #011at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) #011at org.apache.solr.common.cloud.SolrZkClient$11.execute(SolrZkClient.java:570) #011at org.apache.solr.common.cloud.SolrZkClient$11.execute(SolrZkClient.java:567) #011at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61) #011at org.apache.solr.common.cloud.SolrZkClient.multi(SolrZkClient.java:567) #011at org.apache.solr.cloud.ShardLeaderElectionContextBase$1.execute(ElectionContext.java:197) #011at org.apache.solr.common.util.RetryUtil.retryOnThrowable(RetryUtil.java:50) #011at org.apache.solr.common.util.RetryUtil.retryOnThrowable(RetryUtil.java:43) #011at org.apache.solr.cloud.ShardLeaderElectionContextBase.runLeaderProcess(ElectionContext.java:179) #011... 12 more Should I create a separate ticket for this?
          Hide
          ASF subversion and git services added a comment -

          Commit a128bd36b26457c7686be8209d985d2753969766 in lucene-solr's branch refs/heads/branch_5_4 from Shai Erera
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a128bd3 ]

          SOLR-8561: Add fallback to ZkController.getLeaderProps for a mixed 5.4-pre-5.4 deployments

          git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_5_4@1725212 13f79535-47bb-0310-9956-ffa450edef68

          Show
          ASF subversion and git services added a comment - Commit a128bd36b26457c7686be8209d985d2753969766 in lucene-solr's branch refs/heads/branch_5_4 from Shai Erera [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a128bd3 ] SOLR-8561 : Add fallback to ZkController.getLeaderProps for a mixed 5.4-pre-5.4 deployments git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_5_4@1725212 13f79535-47bb-0310-9956-ffa450edef68

            People

            • Assignee:
              Shai Erera
              Reporter:
              Shai Erera
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development