[IMPALA-4733] HBase/Zookeeper continues to be flaky when starting the minicluster on RHEL7 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: Impala 2.8.0
Fix Version/s: Impala 2.9.0
Component/s: Infrastructure
Labels:
- broken-build

Description

HBase startup continues to be problematic on RHEL7. (We recently beefed up our error checking with ~~IMPALA-4684~~, but this didn't address the underlying flakiness.)

An example from a recent failure shows both Zookeeper flakiness (note the connection loss/retry) and a failure to launch HBase:

21:31:20 Connecting to Zookeeper host(s).
21:31:27 Success: <kazoo.client.KazooClient object at 0x13e2590>
21:31:27 Waiting for HBase node: /hbase/master
21:31:33 No handlers could be found for logger "kazoo.client"
21:31:33 Zookeeper connection loss: retrying connection (1 of 3 attempts)
21:31:34 Stopping Zookeeper client
21:31:39 Connecting to Zookeeper host(s).
21:31:46 Success: <kazoo.client.KazooClient object at 0x7f49109b3210>
21:31:46 Waiting for HBase node: /hbase/master
21:31:51 Waiting for HBase node: /hbase/master
21:31:55 Waiting for HBase node: /hbase/master
21:31:56 Waiting for HBase node: /hbase/master
21:32:04 Waiting for HBase node: /hbase/master
21:32:09 Waiting for HBase node: /hbase/master
21:32:28 Waiting for HBase node: /hbase/master
21:32:28 Waiting for HBase node: /hbase/master
21:32:28 Failed while checking for HBase node: /hbase/master
21:32:28 Waiting for HBase node: /hbase/rs
21:32:28 Success: /hbase/rs
21:32:28 Stopping Zookeeper client
21:32:28 Could not get one or more nodes. Exiting with errors: 1

When I look at the end of the output of the HBase master node, it's ends here:

17/01/04 21:32:17 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /127.0.0.1:51485, server: localhost/127.0.0.1:2181
17/01/04 21:32:23 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x1596d1c12ee0009, negotiated timeout = 90000
17/01/04 21:32:23 INFO client.ZooKeeperRegistry: ClusterId read in ZooKeeper is null
17/01/04 21:32:23 INFO master.ActiveMasterManager: Deleting ZNode for /hbase/backup-masters/localhost,60000,1483594276304 from backup master directory

If we look at a successful HBase startup from another test run, the output looks like this:

16/12/29 20:27:00 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /127.0.0.1:59932, server: localhost/127.0.0.1:2181
16/12/29 20:27:00 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x1594dfb07910002, negotiated timeout = 90000
16/12/29 20:27:00 INFO client.ZooKeeperRegistry: ClusterId read in ZooKeeper is null
16/12/29 20:27:01 INFO util.FSUtils: Created version file at hdfs://localhost:20500/hbase with version=8
16/12/29 20:27:02 INFO master.MasterFileSystem: BOOTSTRAP: creating hbase:meta region
16/12/29 20:27:02 INFO regionserver.HRegion: creating HRegion hbase:meta HTD == 'hbase:meta', {TABLE_ATTRIBUTES => {IS_META => 'true', coprocessor$1 => '|org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint|536870911|'}, {NAME => 'info', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '10', TTL => 'FOREVER', MIN_VERSIONS => '0', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '8192', IN_MEMORY => 'false', BLOCKCACHE => 'false'} RootDir = hdfs://localhost:20500/hbase Table name == hbase:meta
16/12/29 20:27:02 INFO hfile.CacheConfig: blockCache=LruBlockCache{blockCount=0, currentSize=6566880, freeSize=6396457504, maxSize=6403024384, heapSize=6566880, minSize=6082873344, minFactor=0.95, multiSize=3041436672, multiFactor=0.5, singleSize=1520718336, singleFactor=0.25}, cacheDataOnRead=false, cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false
[etc...]

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

hbase_logs.tgz
12/Jan/17 22:06
50 kB
David Knupp

Issue Links

relates to

IMPALA-4088 HDFS data nodes pick HTTP server ports at random, sometimes stealing HBase master's port

Resolved

Activity

People

Assignee:: Lars Volker

Reporter:: David Knupp

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 05/Jan/17 18:34

Updated:: 12/Jun/17 16:09

Resolved:: 18/Apr/17 21:05