Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.5.0
    • Fix Version/s: 0.6.0
    • Component/s: service/hbase
    • Labels:
      None

      Description

      Message from Geoff Black on the Github pull request [1]:

      I've updated the cdh services scripts and ZooKeeperClusterActionHandler.java to properly work with CDH3 when setting up an HBase cluster. Tested multiple times with 1 master + 1 region and also 1 master + 5 region on EC2.

      The only issue I ran into was previously documented in https://issues.apache.org/jira/browse/HBASE-1960 where the HBase Master shuts down after only one attempt to access DFS. This is something that should be addressed by the HBase team or a fix integrated by Cloudera into CDH.

      [1] https://github.com/apache/whirr/pull/1

      1. WHIRR-334-test.patch
        1 kB
        Bruno Dumon
      2. WHIRR-334-5.patch
        39 kB
        Bruno Dumon
      3. WHIRR-334-4.patch
        36 kB
        Bruno Dumon
      4. WHIRR-334-3.patch
        36 kB
        Bruno Dumon
      5. WHIRR-334-2.patch
        35 kB
        Bruno Dumon
      6. WHIRR-334.patch
        7 kB
        Andrei Savu
      7. WHIRR-334.patch
        10 kB
        Andrei Savu
      8. WHIRR-334.patch
        37 kB
        Andrei Savu
      9. WHIRR-334.patch
        38 kB
        Andrei Savu
      10. WHIRR-334.patch
        35 kB
        Bruno Dumon

        Issue Links

          Activity

          Hide
          Andrei Savu added a comment -

          I've just committed this. Thanks guys!

          Show
          Andrei Savu added a comment - I've just committed this. Thanks guys!
          Hide
          Andrei Savu added a comment -

          +1

          Thanks Bruno for taking the time to fix the remaining issues. I'm planning to commit this tomorrow.

          Show
          Andrei Savu added a comment - +1 Thanks Bruno for taking the time to fix the remaining issues. I'm planning to commit this tomorrow.
          Hide
          Bruno Dumon added a comment -

          I updated the patch for CDH3u1:

          • configure_cdh_hbase: removed delayed_restart trick
          • configure_cdh_hbase: add hbase.zookeeper.recoverable.waittime
          • configure_cdh_hbase: install daemon package after configuration is performed (otherwise it first starts against default conf), don't restart on debian as it's not necessary.
          • install_cdh_zookeeper/configure_cdh_zookeeper: install daemon package after configuration is performed. This aligns with how things are done in general, and otherwise this will expire client's ZK sessions due to changed ZK server identities.
          • configure_cdh_zookeeper: the service is now also called 'hadoop-zookeeper-server' on rpm systems

          I ran the integration tests (of CDH Hadoop, ZooKeeper and HBase) with everything default except for whirr.hardware-id=m1.large. These all ran successfully.

          Then I also ran the integration tests on Amazon's own Linux images, which are rpm based, using these properties:
          whirr.image-id=us-east-1/ami-221fec4b
          jclouds.ec2.ami-owners=137112412989
          whirr.login-user=ec2-user
          whirr.hardware-id=m1.large

          These tests ran also successful.

          IMHO this patch is ready know. Once this is in, I can look into adjusting my other patches to trunk. Probably WHIRR-240 first?

          Show
          Bruno Dumon added a comment - I updated the patch for CDH3u1: configure_cdh_hbase: removed delayed_restart trick configure_cdh_hbase: add hbase.zookeeper.recoverable.waittime configure_cdh_hbase: install daemon package after configuration is performed (otherwise it first starts against default conf), don't restart on debian as it's not necessary. install_cdh_zookeeper/configure_cdh_zookeeper: install daemon package after configuration is performed. This aligns with how things are done in general, and otherwise this will expire client's ZK sessions due to changed ZK server identities. configure_cdh_zookeeper: the service is now also called 'hadoop-zookeeper-server' on rpm systems I ran the integration tests (of CDH Hadoop, ZooKeeper and HBase) with everything default except for whirr.hardware-id=m1.large. These all ran successfully. Then I also ran the integration tests on Amazon's own Linux images, which are rpm based, using these properties: whirr.image-id=us-east-1/ami-221fec4b jclouds.ec2.ami-owners=137112412989 whirr.login-user=ec2-user whirr.hardware-id=m1.large These tests ran also successful. IMHO this patch is ready know. Once this is in, I can look into adjusting my other patches to trunk. Probably WHIRR-240 first?
          Hide
          Bruno Dumon added a comment -

          CDH has done a new release and is now on HBase 0.90.3. The part of this patch which does the wait-on-hdfs should in theory not be needed anymore. I'll try this out tomorrow.

          Show
          Bruno Dumon added a comment - CDH has done a new release and is now on HBase 0.90.3. The part of this patch which does the wait-on-hdfs should in theory not be needed anymore. I'll try this out tomorrow.
          Hide
          Bruno Dumon added a comment -

          I've updated the patch so that we can apply it using the patch command (Bruno please use git diff --no-prefix next time).

          ok, thanks for the tip.

          I've just done some more tests and noticed that even though all HBase processes started, there was (sometimes, especially with more nodes) a problem with actually using HBase. The configure script is such that it first installs the CDH daemon package (which starts HBase), then changes the configuration, and then restarts HBase. It appears HBase is confused by this configuration change. If I move the daemon package installation after the conf file changes, then it works.

          I'm now able to run the integration tests with the default selected images in the default (us) zone. I've ran in them three times in a row successfully. I did use whirr.hardware-id=m1.large. With the default selected t1.micro, results seem to be less consistent (I've run it once successful, and once it failed).

          I'll provide an updated patch tomorrow.

          Show
          Bruno Dumon added a comment - I've updated the patch so that we can apply it using the patch command (Bruno please use git diff --no-prefix next time). ok, thanks for the tip. I've just done some more tests and noticed that even though all HBase processes started, there was (sometimes, especially with more nodes) a problem with actually using HBase. The configure script is such that it first installs the CDH daemon package (which starts HBase), then changes the configuration, and then restarts HBase. It appears HBase is confused by this configuration change. If I move the daemon package installation after the conf file changes, then it works. I'm now able to run the integration tests with the default selected images in the default (us) zone. I've ran in them three times in a row successfully. I did use whirr.hardware-id=m1.large. With the default selected t1.micro, results seem to be less consistent (I've run it once successful, and once it failed). I'll provide an updated patch tomorrow.
          Hide
          Andrei Savu added a comment -

          I've updated the patch so that we can apply it using the patch command (Bruno please use git diff --no-prefix next time).

          +1 for committing it. It seems like it works most of the time and on the long term we should find a better way of handling the order of starting services.

          Show
          Andrei Savu added a comment - I've updated the patch so that we can apply it using the patch command (Bruno please use git diff --no-prefix next time). +1 for committing it. It seems like it works most of the time and on the long term we should find a better way of handling the order of starting services.
          Hide
          Bruno Dumon added a comment -

          I'll give it a try with the default image as well and report back (might not be before friday).

          Show
          Bruno Dumon added a comment - I'll give it a try with the default image as well and report back (might not be before friday).
          Hide
          Andrei Savu added a comment -

          I'm unable to get the integration tests to work just by applying the patch. The region server was not running. I'm testing using the default image selected by Whirr (imageId=us-east-1/ami-aef607c7, description=411009282317/RightImage_Ubuntu_10.04_x64_v5.6.8.1_EBS) and maybe this is the problem because I see no other difference.

          Show
          Andrei Savu added a comment - I'm unable to get the integration tests to work just by applying the patch. The region server was not running. I'm testing using the default image selected by Whirr (imageId=us-east-1/ami-aef607c7, description=411009282317/RightImage_Ubuntu_10.04_x64_v5.6.8.1_EBS) and maybe this is the problem because I see no other difference.
          Hide
          Bruno Dumon added a comment -

          Just learned that the first point in my previous comment should not be necessary due to WHIRR-314 (but not if WHIRR-339 is applied too since that patch dropped that setting). So basically the tests should have run with just the original patch applied.

          If you would still experience any failures, it would be helpful to have:

          /tmp/setup-*.sh
          /tmp/logs/*
          /tmp/jclouds*/*
          /var/logs/hbase/**

          and a 'ps aux | grep java'

          I provided the following extra properties when running the test:

          whirr.image-id=eu-west-1/ami-619ea915 (canonical 11.04 instance store EU)
          whirr.hardware-id=m1.large
          whirr.location-id=eu-west-1a

          and I ran this with a jclouds 1.1 snapshot, since jclouds 1.0 has a bug that prevents from specifying the zone.

          Show
          Bruno Dumon added a comment - Just learned that the first point in my previous comment should not be necessary due to WHIRR-314 (but not if WHIRR-339 is applied too since that patch dropped that setting). So basically the tests should have run with just the original patch applied. If you would still experience any failures, it would be helpful to have: /tmp/setup-*.sh /tmp/logs/* /tmp/jclouds*/* /var/logs/hbase/** and a 'ps aux | grep java' I provided the following extra properties when running the test: whirr.image-id=eu-west-1/ami-619ea915 (canonical 11.04 instance store EU) whirr.hardware-id=m1.large whirr.location-id=eu-west-1a and I ran this with a jclouds 1.1 snapshot, since jclouds 1.0 has a bug that prevents from specifying the zone.
          Hide
          Bruno Dumon added a comment -

          Hi Andrei, thanks for trying out my patches, much appreciated.

          I added a patch to make the tests work, it has two changes:

          • the delay loop in configure_cdh_hbase.sh also needs to be added before starting the thrift server. This is because otherwise the thrift server might start quite a bit earlier than the master, and it will only try to establish connection with the master for a limited amount of time.
          • the instance-templates in whirr-hbase-test.properties: changed similar as for WHIRR-240. This change is actually not strictly necessary, the test runs successful without it too, so you might want to leave this out. It would be needed though once WHIRR-339 gets in (will add comment there).
          Show
          Bruno Dumon added a comment - Hi Andrei, thanks for trying out my patches, much appreciated. I added a patch to make the tests work, it has two changes: the delay loop in configure_cdh_hbase.sh also needs to be added before starting the thrift server. This is because otherwise the thrift server might start quite a bit earlier than the master, and it will only try to establish connection with the master for a limited amount of time. the instance-templates in whirr-hbase-test.properties: changed similar as for WHIRR-240 . This change is actually not strictly necessary, the test runs successful without it too, so you might want to leave this out. It would be needed though once WHIRR-339 gets in (will add comment there).
          Hide
          Andrei Savu added a comment -

          Bruno are the integration tests working for you? I've also tried to run them today using the AMI packaged by Canonical with Ubuntu 10.04 LTS and I had to restart daemons by hand.

          Show
          Andrei Savu added a comment - Bruno are the integration tests working for you? I've also tried to run them today using the AMI packaged by Canonical with Ubuntu 10.04 LTS and I had to restart daemons by hand.
          Hide
          Andrei Savu added a comment -

          The CDH HBase integration test still hangs for me when using the automatically selected Rightscale Ubuntu 10.04 AMI. Am I the only one seeing this? I will update the default image selection strategy to use a plain Canonical AMI.

          Show
          Andrei Savu added a comment - The CDH HBase integration test still hangs for me when using the automatically selected Rightscale Ubuntu 10.04 AMI. Am I the only one seeing this? I will update the default image selection strategy to use a plain Canonical AMI.
          Hide
          Bruno Dumon added a comment -

          I realized the wait loop could be easily adjusted to wait for at least one datanode to be up. Since we're wget'ing the hadoop web ui we can as well grep it for the number of datanodes. I've adjusted the patch in this regard (WHIRR-334-5.patch).

          Tested on EC2.

          In summary, this patch contains the following:

          Here's an exmple of the output of the HDFS wait loop:

          hadoop-hbase-master restart: waiting for HDFS to be available -- Fri Jul 15 08:32:38 UTC 2011
          hadoop-hbase-master restart: waiting for HDFS to be available -- Fri Jul 15 08:32:41 UTC 2011
          hadoop-hbase-master restart: waiting for HDFS to be available -- Fri Jul 15 08:32:44 UTC 2011
          hadoop-hbase-master restart: waiting for HDFS to be available -- Fri Jul 15 08:32:47 UTC 2011
          hadoop-hbase-master restart: waiting for HDFS to be available -- Fri Jul 15 08:32:50 UTC 2011
          hadoop-hbase-master restart: waiting for HDFS to be available -- Fri Jul 15 08:32:53 UTC 2011
          hadoop-hbase-master restart: waiting for HDFS to be available -- Fri Jul 15 08:32:56 UTC 2011
          hadoop-hbase-master restart: waiting for HDFS to be available -- Fri Jul 15 08:32:59 UTC 2011
          Live Datanodes : 2
          Restarting Hadoop HBase master daemon: no master to stop because no pid file /var/run/hbase/hbase-hbase-master.pid
          Starting Hadoop HBase master daemon: starting master, logging to /var/log/hbase/logs/hbase-hbase-master-ip-10-50-37-175.out
          hbase-master.
          

          What I said in my previous comment about double java processes can be ignored, the double one was not Java itself but "su mapred -s /usr/lib/jvm/java-6-sun/bin/java"

          Show
          Bruno Dumon added a comment - I realized the wait loop could be easily adjusted to wait for at least one datanode to be up. Since we're wget'ing the hadoop web ui we can as well grep it for the number of datanodes. I've adjusted the patch in this regard ( WHIRR-334 -5.patch). Tested on EC2. In summary, this patch contains the following: the changes from the patch of June 24 fix for ZK service script name: https://github.com/bdumon/whirr/commit/0f9910439c2025240828b34af6442ebedd72bca2 HDFS-wait-loop before starting HBase: https://github.com/bdumon/whirr/commit/b160a0f8345524fcb9ddd5301550d9fa48b0b865 Here's an exmple of the output of the HDFS wait loop: hadoop-hbase-master restart: waiting for HDFS to be available -- Fri Jul 15 08:32:38 UTC 2011 hadoop-hbase-master restart: waiting for HDFS to be available -- Fri Jul 15 08:32:41 UTC 2011 hadoop-hbase-master restart: waiting for HDFS to be available -- Fri Jul 15 08:32:44 UTC 2011 hadoop-hbase-master restart: waiting for HDFS to be available -- Fri Jul 15 08:32:47 UTC 2011 hadoop-hbase-master restart: waiting for HDFS to be available -- Fri Jul 15 08:32:50 UTC 2011 hadoop-hbase-master restart: waiting for HDFS to be available -- Fri Jul 15 08:32:53 UTC 2011 hadoop-hbase-master restart: waiting for HDFS to be available -- Fri Jul 15 08:32:56 UTC 2011 hadoop-hbase-master restart: waiting for HDFS to be available -- Fri Jul 15 08:32:59 UTC 2011 Live Datanodes : 2 Restarting Hadoop HBase master daemon: no master to stop because no pid file /var/run/hbase/hbase-hbase-master.pid Starting Hadoop HBase master daemon: starting master, logging to /var/log/hbase/logs/hbase-hbase-master-ip-10-50-37-175.out hbase-master. What I said in my previous comment about double java processes can be ignored, the double one was not Java itself but "su mapred -s /usr/lib/jvm/java-6-sun/bin/java"
          Hide
          Bruno Dumon added a comment - - edited

          I tried to launch a CDH hbase cluster on EC2 using this patch, and I had the opposite problem: the master was not running, the region servers were running.

          The master seemed to have exited because of this:

          2011-07-14 15:48:26,912 WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/hbase/hbase.version" - Aborting...
          2011-07-14 15:48:26,913 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown.
          org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /hbase/hbase.version could only be replicated to 0 nodes, instead of 1
                  at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1469)
                  at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:649)
          

          I had the same problem over at WHIRR-240, but there the master survived this (possibly due to improved handling in newer HBase, or maybe due to timing differences). Of course, this was the original topic of this issue ("HBase Master shuts down after only one attempt to access DFS"), but hence it's not enough for the namenode to be up, there need to be actual datanodes. Maybe I'll go for the ordered role startup after all.

          Also strange was that all hadoop Java processes (datanode, tasktracker) appeared double, as if they were started twice. Will look into this more tomorrow.

          Show
          Bruno Dumon added a comment - - edited I tried to launch a CDH hbase cluster on EC2 using this patch, and I had the opposite problem: the master was not running, the region servers were running. The master seemed to have exited because of this: 2011-07-14 15:48:26,912 WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/hbase/hbase.version" - Aborting... 2011-07-14 15:48:26,913 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown. org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /hbase/hbase.version could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1469) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:649) I had the same problem over at WHIRR-240 , but there the master survived this (possibly due to improved handling in newer HBase, or maybe due to timing differences). Of course, this was the original topic of this issue ("HBase Master shuts down after only one attempt to access DFS"), but hence it's not enough for the namenode to be up, there need to be actual datanodes. Maybe I'll go for the ordered role startup after all. Also strange was that all hadoop Java processes (datanode, tasktracker) appeared double, as if they were started twice. Will look into this more tomorrow.
          Hide
          Andrei Savu added a comment -

          The CDH HBase integration test is still failing for me (getting stuck). I've took a look on the servers and it seems like be HMaster is running but the regionserver is not. Should we make the regionserver check that the master is running before starting (I guess this is the problem)?

          Show
          Andrei Savu added a comment - The CDH HBase integration test is still failing for me (getting stuck). I've took a look on the servers and it seems like be HMaster is running but the regionserver is not. Should we make the regionserver check that the master is running before starting (I guess this is the problem)?
          Hide
          Andrei Savu added a comment -

          Looks good to me! Thanks Bruno for taking the time to work on this. I will test it now and if everything works I will commit. Tom, Adrian any feedback?

          Show
          Andrei Savu added a comment - Looks good to me! Thanks Bruno for taking the time to work on this. I will test it now and if everything works I will commit. Tom, Adrian any feedback?
          Hide
          Bruno Dumon added a comment -

          Patch update: added comment explaining the wait-for-namenode loop.

          Show
          Bruno Dumon added a comment - Patch update: added comment explaining the wait-for-namenode loop.
          Hide
          Bruno Dumon added a comment -

          With the current state of the patch, I'm able to successfully launch a CDH HBase cluster.

          Show
          Bruno Dumon added a comment - With the current state of the patch, I'm able to successfully launch a CDH HBase cluster.
          Hide
          Bruno Dumon added a comment -

          Patch update: added waiting for namenode availability before starting hbase master/regionserver.

          Show
          Bruno Dumon added a comment - Patch update: added waiting for namenode availability before starting hbase master/regionserver.
          Hide
          Bruno Dumon added a comment -

          I found another issue with the configure_cdh_zookeeper.sh script: the service name for zookeeper is actually different in the RPM and Debian packages, therefore I changed it like this:

          if [ -f /etc/init.d/hadoop-zookeeper ] ; then
          service hadoop-zookeeper restart
          else
          service hadoop-zookeeper-server restart
          fi

          Otherwise the patch is the same as the one from June 24.

          Show
          Bruno Dumon added a comment - I found another issue with the configure_cdh_zookeeper.sh script: the service name for zookeeper is actually different in the RPM and Debian packages, therefore I changed it like this: if [ -f /etc/init.d/hadoop-zookeeper ] ; then service hadoop-zookeeper restart else service hadoop-zookeeper-server restart fi Otherwise the patch is the same as the one from June 24.
          Hide
          Bruno Dumon added a comment -

          Adds fix for zookeeper service name

          Show
          Bruno Dumon added a comment - Adds fix for zookeeper service name
          Hide
          Tom White added a comment -

          > I believe we should commit this one before WHIRR-294.

          OK

          > Should we consider fixing WHIRR-221 so that we can have a more predictable cluster launch process?

          It sounds like that would be useful in this case. Really we want a DAG of dependencies, not just a list, but a list is probably good enough to start with.

          Show
          Tom White added a comment - > I believe we should commit this one before WHIRR-294 . OK > Should we consider fixing WHIRR-221 so that we can have a more predictable cluster launch process? It sounds like that would be useful in this case. Really we want a DAG of dependencies, not just a list, but a list is probably good enough to start with.
          Hide
          Andrei Savu added a comment -

          Should we consider fixing WHIRR-221 so that we can have a more predictable cluster launch process?

          Show
          Andrei Savu added a comment - Should we consider fixing WHIRR-221 so that we can have a more predictable cluster launch process?
          Hide
          Andrei Savu added a comment -

          Tom I believe we should commit this one before WHIRR-294. It fixes some existing bugs.

          Show
          Andrei Savu added a comment - Tom I believe we should commit this one before WHIRR-294 . It fixes some existing bugs.
          Hide
          Andrei Savu added a comment -

          Lars is possible to make it retry forever? How about using a watchdog process?

          Show
          Andrei Savu added a comment - Lars is possible to make it retry forever? How about using a watchdog process?
          Hide
          Andrei Savu added a comment -

          I've been able to track this down and it's also related to HBASE-1960 - it seems like the region server shutdowns if it's unable to connect to HDFS. I will look for an workaround.

          Show
          Andrei Savu added a comment - I've been able to track this down and it's also related to HBASE-1960 - it seems like the region server shutdowns if it's unable to connect to HDFS. I will look for an workaround.
          Hide
          Andrei Savu added a comment -

          Updated patch to fix the following issues:

          • naming inconsistency as discussed on the email list (whirr.hadoop-install-function renamed to whirr.hadoop.install-function etc.)
          • fixed CDH test .properties files
          • fixed ZooKeeper install / configure scripts

          Unfortunately it's not yet ready. It seems like sometime the region server does not start as expected. Any ideas? I will keep on debugging this.

          Show
          Andrei Savu added a comment - Updated patch to fix the following issues: naming inconsistency as discussed on the email list (whirr.hadoop-install-function renamed to whirr.hadoop.install-function etc.) fixed CDH test .properties files fixed ZooKeeper install / configure scripts Unfortunately it's not yet ready. It seems like sometime the region server does not start as expected. Any ideas? I will keep on debugging this.
          Hide
          Andrei Savu added a comment -

          Attached an updated version of the patch that address some of the minor issues I've noticed. Unfortunately the CDH integration tests are still failing. I will investigate more later today.

          Show
          Andrei Savu added a comment - Attached an updated version of the patch that address some of the minor issues I've noticed. Unfortunately the CDH integration tests are still failing. I will investigate more later today.
          Hide
          Andrei Savu added a comment -

          I've started a cluster using the provided recipe (an updated version) and everything seems to be working as expected: I've been able to create a table.

          Show
          Andrei Savu added a comment - I've started a cluster using the provided recipe (an updated version) and everything seems to be working as expected: I've been able to create a table.
          Hide
          Andrei Savu added a comment -

          Nit: hbase-ec-cdh.properties need to specify a whirr.cluster-name and location-id should probably be us-east-1 (testing on trunk with jclouds 1.0.0).

          Show
          Andrei Savu added a comment - Nit: hbase-ec-cdh.properties need to specify a whirr.cluster-name and location-id should probably be us-east-1 (testing on trunk with jclouds 1.0.0).
          Hide
          Andrei Savu added a comment -

          I believe that there is still some work that needs to be done in order to make the integration tests pass.

          Show
          Andrei Savu added a comment - I believe that there is still some work that needs to be done in order to make the integration tests pass.
          Hide
          Andrei Savu added a comment -

          Created patch from the pull request. Not tested yet. It looks like it applies cleanly on branch-0.5 and trunk.

          Show
          Andrei Savu added a comment - Created patch from the pull request. Not tested yet. It looks like it applies cleanly on branch-0.5 and trunk.

            People

            • Assignee:
              Bruno Dumon
              Reporter:
              Andrei Savu
            • Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development