Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4733

HBase/Zookeeper continues to be flaky when starting the minicluster on RHEL7

    Details

      Description

      HBase startup continues to be problematic on RHEL7. (We recently beefed up our error checking with IMPALA-4684, but this didn't address the underlying flakiness.)

      An example from a recent failure shows both Zookeeper flakiness (note the connection loss/retry) and a failure to launch HBase:

      21:31:20 Connecting to Zookeeper host(s).
      21:31:27 Success: <kazoo.client.KazooClient object at 0x13e2590>
      21:31:27 Waiting for HBase node: /hbase/master
      21:31:33 No handlers could be found for logger "kazoo.client"
      21:31:33 Zookeeper connection loss: retrying connection (1 of 3 attempts)
      21:31:34 Stopping Zookeeper client
      21:31:39 Connecting to Zookeeper host(s).
      21:31:46 Success: <kazoo.client.KazooClient object at 0x7f49109b3210>
      21:31:46 Waiting for HBase node: /hbase/master
      21:31:51 Waiting for HBase node: /hbase/master
      21:31:55 Waiting for HBase node: /hbase/master
      21:31:56 Waiting for HBase node: /hbase/master
      21:32:04 Waiting for HBase node: /hbase/master
      21:32:09 Waiting for HBase node: /hbase/master
      21:32:28 Waiting for HBase node: /hbase/master
      21:32:28 Waiting for HBase node: /hbase/master
      21:32:28 Failed while checking for HBase node: /hbase/master
      21:32:28 Waiting for HBase node: /hbase/rs
      21:32:28 Success: /hbase/rs
      21:32:28 Stopping Zookeeper client
      21:32:28 Could not get one or more nodes. Exiting with errors: 1
      

      When I look at the end of the output of the HBase master node, it's ends here:

      17/01/04 21:32:17 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /127.0.0.1:51485, server: localhost/127.0.0.1:2181
      17/01/04 21:32:23 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x1596d1c12ee0009, negotiated timeout = 90000
      17/01/04 21:32:23 INFO client.ZooKeeperRegistry: ClusterId read in ZooKeeper is null
      17/01/04 21:32:23 INFO master.ActiveMasterManager: Deleting ZNode for /hbase/backup-masters/localhost,60000,1483594276304 from backup master directory
      

      If we look at a successful HBase startup from another test run, the output looks like this:

      16/12/29 20:27:00 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /127.0.0.1:59932, server: localhost/127.0.0.1:2181
      16/12/29 20:27:00 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x1594dfb07910002, negotiated timeout = 90000
      16/12/29 20:27:00 INFO client.ZooKeeperRegistry: ClusterId read in ZooKeeper is null
      16/12/29 20:27:01 INFO util.FSUtils: Created version file at hdfs://localhost:20500/hbase with version=8
      16/12/29 20:27:02 INFO master.MasterFileSystem: BOOTSTRAP: creating hbase:meta region
      16/12/29 20:27:02 INFO regionserver.HRegion: creating HRegion hbase:meta HTD == 'hbase:meta', {TABLE_ATTRIBUTES => {IS_META => 'true', coprocessor$1 => '|org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint|536870911|'}, {NAME => 'info', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '10', TTL => 'FOREVER', MIN_VERSIONS => '0', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '8192', IN_MEMORY => 'false', BLOCKCACHE => 'false'} RootDir = hdfs://localhost:20500/hbase Table name == hbase:meta
      16/12/29 20:27:02 INFO hfile.CacheConfig: blockCache=LruBlockCache{blockCount=0, currentSize=6566880, freeSize=6396457504, maxSize=6403024384, heapSize=6566880, minSize=6082873344, minFactor=0.95, multiSize=3041436672, multiFactor=0.5, singleSize=1520718336, singleFactor=0.25}, cacheDataOnRead=false, cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false
      [etc...]
      
      1. hbase_logs.tgz
        50 kB
        David Knupp

        Issue Links

          Activity

          Hide
          dknupp David Knupp added a comment -

          So, unfortunately we only seem to have logs from the latest failure, not prior ones. However, a quick scan shows this error in the Zookeeper output:

          17/01/04 21:31:35 ERROR server.NIOServerCnxn: Unexpected Exception: 
          java.nio.channels.CancelledKeyException
          	at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
          	at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
          	at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:153)
          	at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1075)
          	at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:404)
          	at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:138)
          

          The error seems to be recoverable to an extent. For example, I see this in the saved logs from other RHEL7 test runs, even ones where HBase started successfully. Notably however, I don't see it in the log files from other platforms. I wonder if there's something about this platform that makes it especially susceptible to hitting this exception?

          This is just an observation. It may or may not turn out to be a red herring.

          Show
          dknupp David Knupp added a comment - So, unfortunately we only seem to have logs from the latest failure, not prior ones. However, a quick scan shows this error in the Zookeeper output: 17/01/04 21:31:35 ERROR server.NIOServerCnxn: Unexpected Exception: java.nio.channels.CancelledKeyException at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73) at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77) at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:153) at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1075) at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:404) at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:138) The error seems to be recoverable to an extent. For example, I see this in the saved logs from other RHEL7 test runs, even ones where HBase started successfully. Notably however, I don't see it in the log files from other platforms. I wonder if there's something about this platform that makes it especially susceptible to hitting this exception? This is just an observation. It may or may not turn out to be a red herring.
          Hide
          lv Lars Volker added a comment -

          I think I've seen this again in a private Jenkins run. Please ping me and I can provide access to the artifacts.

          Show
          lv Lars Volker added a comment - I think I've seen this again in a private Jenkins run. Please ping me and I can provide access to the artifacts.
          Hide
          dknupp David Knupp added a comment -

          Lars Volker I just added a comment to ZOOKEEPER-2044 to see if someone can help confirm whether this is that bug.

          Show
          dknupp David Knupp added a comment - Lars Volker I just added a comment to ZOOKEEPER-2044 to see if someone can help confirm whether this is that bug.
          Hide
          dknupp David Knupp added a comment -

          This seems to be a duplicate of ZOOKEEPER-2044. A patch has been submitted, but is still in review.

          Show
          dknupp David Knupp added a comment - This seems to be a duplicate of ZOOKEEPER-2044 . A patch has been submitted, but is still in review.
          Hide
          lv Lars Volker added a comment -

          Reopening, since ZOOKEEPER-2044 does not seem to address the underlying issue (but merely improves the error reporting) and we expect it to show up again on RHEL7.

          Show
          lv Lars Volker added a comment - Reopening, since ZOOKEEPER-2044 does not seem to address the underlying issue (but merely improves the error reporting) and we expect it to show up again on RHEL7.
          Hide
          lv Lars Volker added a comment -

          Silvius Rus, Dan Hecht - I'm downgrading this to a P2: it only affects RHEL7, it only affects the test infrastructure (i.e. it makes the tests red, but because of an issue with how we start HBase, not because of a product defect), and it occurs very infrequently. Please let me know if you want us to keep tracking this as a P1.

          Show
          lv Lars Volker added a comment - Silvius Rus , Dan Hecht - I'm downgrading this to a P2: it only affects RHEL7, it only affects the test infrastructure (i.e. it makes the tests red, but because of an issue with how we start HBase, not because of a product defect), and it occurs very infrequently. Please let me know if you want us to keep tracking this as a P1.
          Hide
          lv Lars Volker added a comment -

          IMPALA-4733: Change HBase ports to non-ephemeral

          We've seen repeated test failures because HBase tries to bind to ports
          in the ephemeral port range, which sometimes would already be occupied
          by outgoing connections of other proccesses.

          This change changes the ports to the new default HBase ports
          (HBASE-10123):

          HBase Master Port: 60000 -> 16000
          HBase Master Web UI Port: 60010 -> 16010
          HBase ReqionServer Port: 60020 -> 16020
          HBase ReqionServer Web UI Port: 60030 -> 16030
          HBase Status Multicast Port: 60100 -> 16100

          This made it necessary to change the default KMS port, too
          (HADOOP-12811):

          KMS HTTP port: 16000 -> 9600

          Change-Id: I6f8af325e34b6e352afd75ce5ddd2446ce73d857
          Reviewed-on: http://gerrit.cloudera.org:8080/6524
          Reviewed-by: Lars Volker <lv@cloudera.com>
          Tested-by: Impala Public Jenkins

          Show
          lv Lars Volker added a comment - IMPALA-4733 : Change HBase ports to non-ephemeral We've seen repeated test failures because HBase tries to bind to ports in the ephemeral port range, which sometimes would already be occupied by outgoing connections of other proccesses. This change changes the ports to the new default HBase ports ( HBASE-10123 ): HBase Master Port: 60000 -> 16000 HBase Master Web UI Port: 60010 -> 16010 HBase ReqionServer Port: 60020 -> 16020 HBase ReqionServer Web UI Port: 60030 -> 16030 HBase Status Multicast Port: 60100 -> 16100 This made it necessary to change the default KMS port, too ( HADOOP-12811 ): KMS HTTP port: 16000 -> 9600 Change-Id: I6f8af325e34b6e352afd75ce5ddd2446ce73d857 Reviewed-on: http://gerrit.cloudera.org:8080/6524 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Impala Public Jenkins
          Hide
          tarasbob Taras Bobrovytsky added a comment -

          A similar issue happened again.

          Show
          tarasbob Taras Bobrovytsky added a comment - A similar issue happened again.
          Hide
          tarasbob Taras Bobrovytsky added a comment -

          The new HBase issue we are seeing is different from this one. Opened IMPALA-5223.

          Show
          tarasbob Taras Bobrovytsky added a comment - The new HBase issue we are seeing is different from this one. Opened IMPALA-5223 .

            People

            • Assignee:
              lv Lars Volker
              Reporter:
              dknupp David Knupp
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development