Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8.0
    • Component/s: service/hbase
    • Labels:
      None

      Description

      We should support HBase 0.92. The integration test will need changing since the Thrift inteface has changed between HBase 0.90 and 0.92.

      1. WHIRR-525.patch
        51 kB
        Andrew Bayer
      2. WHIRR-525.patch
        52 kB
        Andrew Bayer
      3. WHIRR-525.patch
        52 kB
        Andrew Bayer
      4. WHIRR-525.patch
        56 kB
        Tom White
      5. WHIRR-525.patch
        56 kB
        Tom White
      6. WHIRR-525.patch
        56 kB
        Tom White
      7. WHIRR-525.patch
        25 kB
        Tom White
      8. WHIRR-525-zk-retry.patch
        2 kB
        Andrew Bayer
      9. WHIRR-525-zk-retry.patch
        0.7 kB
        Andrew Bayer

        Activity

        Hide
        Tom White added a comment -

        This patch upgrades HBase to 0.92.0. Tests for 0.89 and 0.90 have been removed since the Thrift API changed, however these versions will still work since the HBase service code is unchanged.

        I ran HBase and CDH HBase integration tests successfully.

        Show
        Tom White added a comment - This patch upgrades HBase to 0.92.0. Tests for 0.89 and 0.90 have been removed since the Thrift API changed, however these versions will still work since the HBase service code is unchanged. I ran HBase and CDH HBase integration tests successfully.
        Hide
        Andrei Savu added a comment -

        Looks good but I don't think it's a good idea to remove the integration tests for 0.89 and 0.90 as long as those version are still relevant. What if we create a subproject just with the old tests? (to be able to use the previous thrift API)

        Show
        Andrei Savu added a comment - Looks good but I don't think it's a good idea to remove the integration tests for 0.89 and 0.90 as long as those version are still relevant. What if we create a subproject just with the old tests? (to be able to use the previous thrift API)
        Hide
        Tom White added a comment -

        Thanks for the review Andrei. Here's a new patch which includes integration tests for 0.90.5 in a separate Maven module. I didn't include 0.89 since it isn't available from the mirrors any more (http://www.apache.org/dyn/closer.cgi/hbase/).

        Show
        Tom White added a comment - Thanks for the review Andrei. Here's a new patch which includes integration tests for 0.90.5 in a separate Maven module. I didn't include 0.89 since it isn't available from the mirrors any more ( http://www.apache.org/dyn/closer.cgi/hbase/ ).
        Hide
        Andrei Savu added a comment -

        +1

        Show
        Andrei Savu added a comment - +1
        Hide
        Andrei Savu added a comment -

        I'm going to commit this now.

        Show
        Andrei Savu added a comment - I'm going to commit this now.
        Hide
        Andrei Savu added a comment -

        I've tried a few times to run this but always seems to get stuck after the following log lines:

        2012-02-29 01:19:06,150 INFO  [org.apache.whirr.service.hbase.integration.HBaseServiceController] (main) Waiting for master...
        .Warning: Permanently added 'ec2-107-21-87-208.compute-1.amazonaws.com,107.21.87.208' (RSA) to the list of known hosts.
        ..2012-02-29 01:19:09,749 INFO  [org.apache.whirr.service.hbase.integration.HBaseServiceController] (main) Connected to thrift server.
        2012-02-29 01:19:09,749 INFO  [org.apache.whirr.service.hbase.integration.HBaseServiceController] (main) Waiting for .META. table...
        

        Any ideas?

        Show
        Andrei Savu added a comment - I've tried a few times to run this but always seems to get stuck after the following log lines: 2012-02-29 01:19:06,150 INFO [org.apache.whirr.service.hbase.integration.HBaseServiceController] (main) Waiting for master... .Warning: Permanently added 'ec2-107-21-87-208.compute-1.amazonaws.com,107.21.87.208' (RSA) to the list of known hosts. ..2012-02-29 01:19:09,749 INFO [org.apache.whirr.service.hbase.integration.HBaseServiceController] (main) Connected to thrift server. 2012-02-29 01:19:09,749 INFO [org.apache.whirr.service.hbase.integration.HBaseServiceController] (main) Waiting for .META. table... Any ideas?
        Hide
        Tom White added a comment -

        That's odd, it worked when I ran it. I'll try it again.

        Show
        Tom White added a comment - That's odd, it worked when I ran it. I'll try it again.
        Hide
        Tom White added a comment -

        On retrying I had the same problem, so I updated the patch so as to not use the Apache archive, and the integration test passed.

        Show
        Tom White added a comment - On retrying I had the same problem, so I updated the patch so as to not use the Apache archive, and the integration test passed.
        Hide
        Andrei Savu added a comment -

        Now I'm getting:

        [ERROR] Failed to execute goal on project whirr-hbase: Could not resolve dependencies for project org.apache.whirr:whirr-hbase:bundle:0.8.0-SNAPSHOT: Failed to collect dependencies for [org.apache.whirr:whirr-core:jar:0.8.0-SNAPSHOT (compile), org.apache.whirr:whirr-core:jar:tests:0.8.0-SNAPSHOT (test), org.apache.whirr:whirr-hadoop:jar:0.8.0-SNAPSHOT (compile), org.apache.whirr:whirr-zookeeper:jar:0.8.0-SNAPSHOT (compile), junit:junit:jar:4.8.1 (test), org.hamcrest:hamcrest-all:jar:1.1 (test), commons-configuration:commons-configuration:jar:1.7 (compile), org.slf4j:slf4j-api:jar:1.6.3 (compile), org.slf4j:slf4j-log4j12:jar:1.6.3 (test), com.jcraft:jsch:jar:0.1.44-1 (compile), log4j:log4j:jar:1.2.16 (test), org.apache.zookeeper:zookeeper:jar:3.3.1 (test), dnsjava:dnsjava:jar:2.1.1 (compile), org.apache.hadoop:hadoop-core:jar:0.20.205.0 (test), org.apache.hbase:hbase:jar:0.92.0-SNAPSHOT (test), org.apache.hbase:hbase:jar:tests:0.92.0-SNAPSHOT (test)]: Failed to read artifact descriptor for org.apache.hbase:hbase:jar:0.92.0-SNAPSHOT: Could not transfer artifact org.apache.hbase:hbase:pom:0.92.0-SNAPSHOT from/to cloudera (https://repository.cloudera.com/content/repositories/releases/): Error transferring file: Server returned HTTP response code: 409 for URL: https://repository.cloudera.com/content/repositories/releases/org/apache/hbase/hbase/0.92.0-SNAPSHOT/hbase-0.92.0-SNAPSHOT.pom -> [Help 1]
        
        Show
        Andrei Savu added a comment - Now I'm getting: [ERROR] Failed to execute goal on project whirr-hbase: Could not resolve dependencies for project org.apache.whirr:whirr-hbase:bundle:0.8.0-SNAPSHOT: Failed to collect dependencies for [org.apache.whirr:whirr-core:jar:0.8.0-SNAPSHOT (compile), org.apache.whirr:whirr-core:jar:tests:0.8.0-SNAPSHOT (test), org.apache.whirr:whirr-hadoop:jar:0.8.0-SNAPSHOT (compile), org.apache.whirr:whirr-zookeeper:jar:0.8.0-SNAPSHOT (compile), junit:junit:jar:4.8.1 (test), org.hamcrest:hamcrest-all:jar:1.1 (test), commons-configuration:commons-configuration:jar:1.7 (compile), org.slf4j:slf4j-api:jar:1.6.3 (compile), org.slf4j:slf4j-log4j12:jar:1.6.3 (test), com.jcraft:jsch:jar:0.1.44-1 (compile), log4j:log4j:jar:1.2.16 (test), org.apache.zookeeper:zookeeper:jar:3.3.1 (test), dnsjava:dnsjava:jar:2.1.1 (compile), org.apache.hadoop:hadoop-core:jar:0.20.205.0 (test), org.apache.hbase:hbase:jar:0.92.0-SNAPSHOT (test), org.apache.hbase:hbase:jar:tests:0.92.0-SNAPSHOT (test)]: Failed to read artifact descriptor for org.apache.hbase:hbase:jar:0.92.0-SNAPSHOT: Could not transfer artifact org.apache.hbase:hbase:pom:0.92.0-SNAPSHOT from/to cloudera (https://repository.cloudera.com/content/repositories/releases/): Error transferring file: Server returned HTTP response code: 409 for URL: https://repository.cloudera.com/content/repositories/releases/org/apache/hbase/hbase/0.92.0-SNAPSHOT/hbase-0.92.0-SNAPSHOT.pom -> [Help 1]
        Hide
        Andrei Savu added a comment - - edited

        I have changed from 0.92.0-SNAPSHOT to 0.92.0 (as available on Maven Central) and seems to be building as expected. I still have some issues with running integration tests on aws-ec2 but I guess this is happening because I'm on a slow connection.

        Show
        Andrei Savu added a comment - - edited I have changed from 0.92.0-SNAPSHOT to 0.92.0 (as available on Maven Central) and seems to be building as expected. I still have some issues with running integration tests on aws-ec2 but I guess this is happening because I'm on a slow connection.
        Hide
        Tom White added a comment -

        I've updated the patch to HBase 0.92.0 (from SNAPSHOT) and it compiles and passes the integration test for me. Andrei, are you OK with this going in now?

        Show
        Tom White added a comment - I've updated the patch to HBase 0.92.0 (from SNAPSHOT) and it compiles and passes the integration test for me. Andrei, are you OK with this going in now?
        Hide
        Andrei Savu added a comment -

        Re-testing now on a better connection.

        Show
        Andrei Savu added a comment - Re-testing now on a better connection.
        Hide
        Andrei Savu added a comment -

        Still stuck for me at "Waiting for .META. table..." - Can someone else test this patch? I will check the logs tomorrow.

        Show
        Andrei Savu added a comment - Still stuck for me at "Waiting for .META. table..." - Can someone else test this patch? I will check the logs tomorrow.
        Hide
        Tom White added a comment -

        I managed to reproduce this once, and looking at the logs it appears that the region server failed to start:

        2012-03-09 05:17:47,116 INFO org.apache.hadoop.hbase.util.RetryCounter: The 3 times to retry  after sleeping 8000 ms
        2012-03-09 05:17:48,126 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server domU-12-31-39-02-BC-72.compute-1.internal/10.248.195.128:2181
        2012-03-09 05:17:48,127 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to domU-12-31-39-02-BC-72.compute-1.internal/10.248.195.128:2181, initiating
         session
        2012-03-09 05:17:48,133 WARN org.apache.zookeeper.ClientCnxnSocket: Connected to an old server; r-o mode will be unavailable
        2012-03-09 05:17:48,133 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server domU-12-31-39-02-BC-72.compute-1.internal/10.248.195.128:2181, se
        ssionid = 0x135f5e431c50002, negotiated timeout = 40000
        2012-03-09 05:17:58,136 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60020
        2012-03-09 05:17:58,140 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server domU-12-31-39-03-B9-B2.compute-1.internal,60020,1331270241403: 
        Initialization of RS failed.  Hence aborting RS.
        java.io.IOException: Received the shutdown message while waiting.
        	at org.apache.hadoop.hbase.regionserver.HRegionServer.blockAndCheckIfStopped(HRegionServer.java:587)
        	at org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:556)
        	at org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:524)
        	at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:625)
        	at java.lang.Thread.run(Thread.java:636)
        

        The odd thing is that I can't issue an ruok to the ZK node from within the cluster (it connects but returns nothing), whereas I get imok from outside.

        This is a sporadic failure which seems to be 0.92 related, so I'm inclined to commit this and have a follow up JIRA to fix it. What do you think? (Having this in will help progress WHIRR-391 since there is overlap in some of the cdh refactoring I'm doing there.)

        Show
        Tom White added a comment - I managed to reproduce this once, and looking at the logs it appears that the region server failed to start: 2012-03-09 05:17:47,116 INFO org.apache.hadoop.hbase.util.RetryCounter: The 3 times to retry after sleeping 8000 ms 2012-03-09 05:17:48,126 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server domU-12-31-39-02-BC-72.compute-1.internal/10.248.195.128:2181 2012-03-09 05:17:48,127 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to domU-12-31-39-02-BC-72.compute-1.internal/10.248.195.128:2181, initiating session 2012-03-09 05:17:48,133 WARN org.apache.zookeeper.ClientCnxnSocket: Connected to an old server; r-o mode will be unavailable 2012-03-09 05:17:48,133 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server domU-12-31-39-02-BC-72.compute-1.internal/10.248.195.128:2181, se ssionid = 0x135f5e431c50002, negotiated timeout = 40000 2012-03-09 05:17:58,136 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60020 2012-03-09 05:17:58,140 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server domU-12-31-39-03-B9-B2.compute-1.internal,60020,1331270241403: Initialization of RS failed. Hence aborting RS. java.io.IOException: Received the shutdown message while waiting. at org.apache.hadoop.hbase.regionserver.HRegionServer.blockAndCheckIfStopped(HRegionServer.java:587) at org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:556) at org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:524) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:625) at java.lang.Thread.run(Thread.java:636) The odd thing is that I can't issue an ruok to the ZK node from within the cluster (it connects but returns nothing), whereas I get imok from outside. This is a sporadic failure which seems to be 0.92 related, so I'm inclined to commit this and have a follow up JIRA to fix it. What do you think? (Having this in will help progress WHIRR-391 since there is overlap in some of the cdh refactoring I'm doing there.)
        Hide
        Tom White added a comment -

        I opened WHIRR-552 and attached the logs from a failed attempt there. HBase experts feel free to take a look!

        Show
        Tom White added a comment - I opened WHIRR-552 and attached the logs from a failed attempt there. HBase experts feel free to take a look!
        Hide
        Andrei Savu added a comment -

        Adding on the roadmap for 0.8.0 & 0.7.2 - we need to find a way to make the deployment reliable.

        Show
        Andrei Savu added a comment - Adding on the roadmap for 0.8.0 & 0.7.2 - we need to find a way to make the deployment reliable.
        Hide
        Karel Vervaeke added a comment -

        The top 3 most likely culprits are (IMO):

        • DNS issue
        • firewall issue
        • Timing issue.

        My money's on timing issue.
        What instance-templates are you using? Does occur in single-node cluster or also when using multiple nodes?

        Show
        Karel Vervaeke added a comment - The top 3 most likely culprits are (IMO): DNS issue firewall issue Timing issue. My money's on timing issue. What instance-templates are you using? Does occur in single-node cluster or also when using multiple nodes?
        Hide
        Amandeep Khurana added a comment -

        It's a timing issue. I looked at the RS logs. They died because the master wasn't up. Started them manually and HBase came up fine.

        Show
        Amandeep Khurana added a comment - It's a timing issue. I looked at the RS logs. They died because the master wasn't up. Started them manually and HBase came up fine.
        Hide
        Andrew Bayer added a comment -

        I just saw the timing issue with the CDH HBase test as well, due to ZK not being up yet in this case.

        Show
        Andrew Bayer added a comment - I just saw the timing issue with the CDH HBase test as well, due to ZK not being up yet in this case.
        Hide
        Andrew Bayer added a comment -

        Looks like the ZK issue is that "hbase.zookeeper.recoverable.waittime" is no longer used in HBase 0.92. Instead, we need to set and bump "zookeeper.recovery.retry" to greater than the default 3. I'm trying with 20 now, but will probably boost that further.

        Show
        Andrew Bayer added a comment - Looks like the ZK issue is that "hbase.zookeeper.recoverable.waittime" is no longer used in HBase 0.92. Instead, we need to set and bump "zookeeper.recovery.retry" to greater than the default 3. I'm trying with 20 now, but will probably boost that further.
        Hide
        Andrew Bayer added a comment -

        This patch is to be added alongside the existing WHIRR-525.patch - even if WHIRR-525.patch isn't added, we need this one for CDH4 HBase to actually work properly, and it shouldn't have any compatibility issues with <0.92 versions of HBase.

        Show
        Andrew Bayer added a comment - This patch is to be added alongside the existing WHIRR-525 .patch - even if WHIRR-525 .patch isn't added, we need this one for CDH4 HBase to actually work properly, and it shouldn't have any compatibility issues with <0.92 versions of HBase.
        Hide
        Andrew Bayer added a comment -

        New version of ZK retry patch that also re-enables the CDH HBase live test.

        Show
        Andrew Bayer added a comment - New version of ZK retry patch that also re-enables the CDH HBase live test.
        Hide
        Adrian Cole added a comment -

        The original patch needs to be recut.

        Show
        Adrian Cole added a comment - The original patch needs to be recut.
        Hide
        Andrew Bayer added a comment -

        That is true. I think Tom needs to handle that, though.

        Show
        Andrew Bayer added a comment - That is true. I think Tom needs to handle that, though.
        Hide
        Andrew Bayer added a comment -

        Tweaked patch now applies cleanly, and contains my zk retry patch as well. Builds fine, running tests now.

        Show
        Andrew Bayer added a comment - Tweaked patch now applies cleanly, and contains my zk retry patch as well. Builds fine, running tests now.
        Hide
        Andrew Bayer added a comment -

        Fixed tarball URLs for now. Still seeing the timing problem with HBase master and regionserver, but I'm not sure what we should do about that.

        Show
        Andrew Bayer added a comment - Fixed tarball URLs for now. Still seeing the timing problem with HBase master and regionserver, but I'm not sure what we should do about that.
        Hide
        Andrew Bayer added a comment -

        Looks like the regionserver thing may be HBASE-5849 - fixed in the eventual 0.92.2 and in 0.94. Should we try switching to 0.94 or should we just make sure that we use 0.92.2 as soon as it's available?

        Show
        Andrew Bayer added a comment - Looks like the regionserver thing may be HBASE-5849 - fixed in the eventual 0.92.2 and in 0.94. Should we try switching to 0.94 or should we just make sure that we use 0.92.2 as soon as it's available?
        Hide
        Andrew Bayer added a comment -

        0.94 would require yet more code-changes, so I'm trying an old RC of 0.92.2 to see if that works.

        Show
        Andrew Bayer added a comment - 0.94 would require yet more code-changes, so I'm trying an old RC of 0.92.2 to see if that works.
        Hide
        Andrew Bayer added a comment -

        Verified that all tests pass with the old (June 1) 0.92.2 RC, so I think we're best just committing this, doc'ing the issue, and bumping to 0.92.2 as soon as it's out.

        Show
        Andrew Bayer added a comment - Verified that all tests pass with the old (June 1) 0.92.2 RC, so I think we're best just committing this, doc'ing the issue, and bumping to 0.92.2 as soon as it's out.
        Hide
        Andrew Bayer added a comment -

        Latest patch removes some defunct test properties files from services/hbase.

        Show
        Andrew Bayer added a comment - Latest patch removes some defunct test properties files from services/hbase.
        Hide
        Tom White added a comment -

        > we're best just committing this, doc'ing the issue, and bumping to 0.92.2 as soon as it's out.

        I agree. It could be a while before 0.92.2 comes out. So +1 to committing now. Do you want to add a sentence to src/site/xdoc/known-limitations.xml?

        Show
        Tom White added a comment - > we're best just committing this, doc'ing the issue, and bumping to 0.92.2 as soon as it's out. I agree. It could be a while before 0.92.2 comes out. So +1 to committing now. Do you want to add a sentence to src/site/xdoc/known-limitations.xml?
        Hide
        Andrew Bayer added a comment -

        Committed, with note in known_limitations.xml.

        Show
        Andrew Bayer added a comment - Committed, with note in known_limitations.xml.

          People

          • Assignee:
            Andrew Bayer
            Reporter:
            Tom White
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development