[IMPALA-5223] HBase/Zookeeper continues to be flaky on RHEL7 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: Impala 2.8.0
Fix Version/s: Impala 2.10.0
Component/s: Infrastructure
Labels:
- broken-build

Epic Color:
ghx-label-5

Description

There were several failures on RHEL7 recently.

Starting cluster services...
Stopping kudu
Stopping kms
Stopping yarn
Stopping hdfs
Starting hdfs (Web UI - http://localhost:5070)
Namenode started
Starting yarn (Web UI - http://localhost:8088)
Starting kms (Web UI - http://localhost:9600)
Starting kudu (Web UI - http://localhost:8051)
The cluster is running
 --> Starting HBase
HBase start attempt: 1/5
Killing any HBase processes possibly lingering from previous start attempts
Stopping Hbase
localhost: -------------------------------------------------------------------------------
localhost: # This image has been automatically created with ComodIT - http://comodit.com #
localhost: -------------------------------------------------------------------------------
localhost: 
localhost: starting zookeeper, logging to /data/jenkins/workspace/impala-asf-master-exhaustive-rhel7/repos/Impala/logs/cluster/hbase/hbase-jenkins-zookeeper-impala-boost-static-burst-slave-el7-0a08.vpc.cloudera.com.out
starting master, logging to /data/jenkins/workspace/impala-asf-master-exhaustive-rhel7/repos/Impala/logs/cluster/hbase/hbase-jenkins-master-impala-boost-static-burst-slave-el7-0a08.vpc.cloudera.com.out
starting regionserver, logging to /data/jenkins/workspace/impala-asf-master-exhaustive-rhel7/repos/Impala/logs/cluster/hbase/hbase-jenkins-1-regionserver-impala-boost-static-burst-slave-el7-0a08.vpc.cloudera.com.out
starting regionserver, logging to /data/jenkins/workspace/impala-asf-master-exhaustive-rhel7/repos/Impala/logs/cluster/hbase/hbase-jenkins-2-regionserver-impala-boost-static-burst-slave-el7-0a08.vpc.cloudera.com.out
starting regionserver, logging to /data/jenkins/workspace/impala-asf-master-exhaustive-rhel7/repos/Impala/logs/cluster/hbase/hbase-jenkins-3-regionserver-impala-boost-static-burst-slave-el7-0a08.vpc.cloudera.com.out
Contents of HDFS root: []
Connecting to Zookeeper host(s).
Success: <kazoo.client.KazooClient object at 0x7f5aecf06e10>
Waiting for HBase node: /hbase/master
Waiting for HBase node: /hbase/master
Waiting for HBase node: /hbase/master
Waiting for HBase node: /hbase/master
Waiting for HBase node: /hbase/master
Waiting for HBase node: /hbase/master
Waiting for HBase node: /hbase/master
Waiting for HBase node: /hbase/master
Failed while checking for HBase node: /hbase/master
Waiting for HBase node: /hbase/rs
Success: /hbase/rs
Stopping Zookeeper client
Could not get one or more nodes. Exiting with errors: 1
Error in /data/jenkins/workspace/impala-asf-master-exhaustive-rhel7/repos/Impala/testdata/bin/run-hbase.sh at line 129: ${CLUSTER_BIN}/check-hbase-nodes.py
Error in /data/jenkins/workspace/impala-asf-master-exhaustive-rhel7/repos/Impala/testdata/bin/run-all.sh at line 51: tee ${IMPALA_CLUSTER_LOGS_DIR}/run-hbase.log
+ echo 'buildall.sh ' -format '-snapshot_file /data/jenkins/workspace/impala-asf-master-exhaustive-rhel7/testdata/test-warehouse-SNAPSHOT/test-warehouse-cdh5-304-SNAPSHOT.tar.gz' '-metastore_snapshot_file /data/jenkins/workspace/impala-asf-master-exhaustive-rhel7/testdata/hive_metastore_snapshot/hive_impala_dump_cdh5-304.txt failed.'
buildall.sh  -format -snapshot_file /data/jenkins/workspace/impala-asf-master-exhaustive-rhel7/testdata/test-warehouse-SNAPSHOT/test-warehouse-cdh5-304-SNAPSHOT.tar.gz -metastore_snapshot_file /data/jenkins/workspace/impala-asf-master-exhaustive-rhel7/testdata/hive_metastore_snapshot/hive_impala_dump_cdh5-304.txt failed.
+ RET_VAL=1

From the HBase master log:

17/04/17 21:32:40 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /0:0:0:0:0:0:0:1:48978, server: localhost/0:0:0:0:0:0:0:1:2181
17/04/17 21:32:46 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/0:0:0:0:0:0:0:1:2181, sessionid = 0x15b7f5567b10000, negotiated timeout = 90000
17/04/17 21:33:34 INFO ipc.RpcServer: RpcServer.responder: starting
17/04/17 21:33:34 INFO ipc.RpcServer: RpcServer.listener,port=16000: starting
17/04/17 21:33:35 INFO mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
17/04/17 21:33:35 INFO http.HttpRequestLog: Http request log for http.requests.master is not defined
17/04/17 21:33:35 INFO http.HttpServer: Added global filter 'safety' (class=org.apache.hadoop.hbase.http.HttpServer$QuotingInputFilter)
17/04/17 21:33:35 INFO http.HttpServer: Added global filter 'clickjackingprevention' (class=org.apache.hadoop.hbase.http.ClickjackingPreventionFilter)
17/04/17 21:33:35 INFO http.HttpServer: Added filter static_user_filter (class=org.apache.hadoop.hbase.http.lib.StaticUserWebFilter$StaticUserFilter) to context master
17/04/17 21:33:35 INFO http.HttpServer: Added filter static_user_filter (class=org.apache.hadoop.hbase.http.lib.StaticUserWebFilter$StaticUserFilter) to context static
17/04/17 21:33:35 INFO http.HttpServer: Added filter static_user_filter (class=org.apache.hadoop.hbase.http.lib.StaticUserWebFilter$StaticUserFilter) to context logs
17/04/17 21:33:35 INFO http.HttpServer: Jetty bound to port 16010
17/04/17 21:33:35 INFO mortbay.log: jetty-6.1.26.cloudera.4
17/04/17 21:33:35 INFO mortbay.log: Extract jar:file:/data/jenkins/workspace/impala-asf-master-exhaustive-rhel7/Impala-Toolchain/cdh_components/hbase-1.2.0-cdh5.12.0-SNAPSHOT/lib/hbase-server-1.2.0-cdh5.12.0-SNAPSHOT.jar!/hbase-webapps/master to /tmp/Jetty_0_0_0_0_16010_master____.6nvknp/webapp
17/04/17 21:33:35 INFO mortbay.log: Started SelectChannelConnector@0.0.0.0:16010
17/04/17 21:33:35 INFO master.HMaster: hbase.rootdir=hdfs://localhost:20500/hbase, hbase.cluster.distributed=true
17/04/17 21:33:35 INFO master.HMaster: Adding backup master ZNode /hbase/backup-masters/localhost,16000,1492489958254

Here's what a successful HBase looks like:

17/04/16 21:32:26 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /0:0:0:0:0:0:0:1:52928, server: localhost/0:0:0:0:0:0:0:1:2181
17/04/16 21:32:28 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/0:0:0:0:0:0:0:1:2181, sessionid = 0x15b7a2ed6860000, negotiated timeout = 90000
17/04/16 21:32:28 INFO ipc.RpcServer: RpcServer.responder: starting
17/04/16 21:32:28 INFO ipc.RpcServer: RpcServer.listener,port=16000: starting
17/04/16 21:32:28 INFO mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
17/04/16 21:32:28 INFO http.HttpRequestLog: Http request log for http.requests.master is not defined
17/04/16 21:32:28 INFO http.HttpServer: Added global filter 'safety' (class=org.apache.hadoop.hbase.http.HttpServer$QuotingInputFilter)
17/04/16 21:32:28 INFO http.HttpServer: Added global filter 'clickjackingprevention' (class=org.apache.hadoop.hbase.http.ClickjackingPreventionFilter)
17/04/16 21:32:28 INFO http.HttpServer: Added filter static_user_filter (class=org.apache.hadoop.hbase.http.lib.StaticUserWebFilter$StaticUserFilter) to context master
17/04/16 21:32:28 INFO http.HttpServer: Added filter static_user_filter (class=org.apache.hadoop.hbase.http.lib.StaticUserWebFilter$StaticUserFilter) to context static
17/04/16 21:32:28 INFO http.HttpServer: Added filter static_user_filter (class=org.apache.hadoop.hbase.http.lib.StaticUserWebFilter$StaticUserFilter) to context logs
17/04/16 21:32:28 INFO http.HttpServer: Jetty bound to port 16010
17/04/16 21:32:28 INFO mortbay.log: jetty-6.1.26.cloudera.4
17/04/16 21:32:28 INFO mortbay.log: Extract jar:file:/data/jenkins/workspace/impala-asf-master-exhaustive-rhel7/Impala-Toolchain/cdh_components/hbase-1.2.0-cdh5.12.0-SNAPSHOT/lib/hbase-server-1.2.0-cdh5.12.0-SNAPSHOT.jar!/hbase-webapps/master to /tmp/Jetty_0_0_0_0_16010_master____.6nvknp/webapp
17/04/16 21:32:29 INFO mortbay.log: Started SelectChannelConnector@0.0.0.0:16010
17/04/16 21:32:29 INFO master.HMaster: hbase.rootdir=hdfs://localhost:20500/hbase, hbase.cluster.distributed=true
17/04/16 21:32:29 INFO master.HMaster: Adding backup master ZNode /hbase/backup-masters/localhost,16000,1492403544599
17/04/16 21:32:29 INFO master.ActiveMasterManager: Deleting ZNode for /hbase/backup-masters/localhost,16000,1492403544599 from backup master directory
17/04/16 21:32:29 INFO master.ActiveMasterManager: Registered Active Master=localhost,16000,1492403544599
17/04/16 21:32:29 INFO Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS
17/04/16 21:32:29 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x112a19e connecting to ZooKeeper ensemble=localhost:2181
17/04/16 21:32:29 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=90000 watcher=hconnection-0x112a19e0x0, quorum=localhost:2181, baseZNode=/hbase
17/04/16 21:32:29 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to authenticate using SASL (unknown error)
17/04/16 21:32:29 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /0:0:0:0:0:0:0:1:52933, server: localhost/0:0:0:0:0:0:0:1:2181
17/04/16 21:32:29 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/0:0:0:0:0:0:0:1:2181, sessionid = 0x15b7a2ed6860005, negotiated timeout = 90000
17/04/16 21:32:29 INFO client.ZooKeeperRegistry: ClusterId read in ZooKeeper is null
17/04/16 21:32:30 INFO util.FSUtils: Created version file at hdfs://localhost:20500/hbase with version=8
17/04/16 21:32:31 INFO master.MasterFileSystem: BOOTSTRAP: creating hbase:meta region
17/04/16 21:32:31 INFO regionserver.HRegion: creating HRegion hbase:meta HTD == 'hbase:meta', {TABLE_ATTRIBUTES => {IS_META => 'true', coprocessor$1 => '|org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint|536870911|'}, {NAME => 'info', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '10', TTL => 'FOREVER', MIN_VERSIONS => '0', CACHE_DATA_IN_L1 => 'true', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '8192', IN_MEMORY => 'false', BLOCKCACHE => 'false'} RootDir = hdfs://localhost:20500/hbase Table name == hbase:meta
17/04/16 21:32:31 INFO hfile.CacheConfig: blockCache=LruBlockCache{blockCount=0, currentSize=6559152, freeSize=6388915792, maxSize=6395474944, heapSize=6559152, minSize=6075701248, minFactor=0.95, multiSize=3037850624, multiFactor=0.5, singleSize=1518925312, singleFactor=0.25}, cacheDataOnRead=false, cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false
17/04/16 21:32:31 INFO compactions.CompactionConfiguration: size [134217728, 9223372036854775807, 9223372036854775807); files [3, 10); ratio 1.200000; off-peak ratio 5.000000; throttle point 2684354560; major period 604800000, major jitter 0.500000, min locality to compact 0.000000
17/04/16 21:32:31 INFO regionserver.HRegion: Onlined 1588230740; next sequenceid=2
17/04/16 21:32:31 INFO regionserver.HStore: Closed info
17/04/16 21:32:31 INFO regionserver.HRegion: Closed hbase:meta,,1.1588230740
17/04/16 21:32:31 INFO fs.HFileSystem: Added intercepting call to namenode#getBlockLocations so can do block reordering using class class org.apache.hadoop.hbase.fs.HFileSystem$ReorderWALBlocks
17/04/16 21:32:31 INFO coordination.SplitLogManagerCoordination: Found 0 orphan tasks and 0 rescan nodes
17/04/16 21:32:31 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x6c9e069e connecting to ZooKeeper ensemble=localhost:2181
17/04/16 21:32:31 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=90000 watcher=hconnection-0x6c9e069e0x0, quorum=localhost:2181, baseZNode=/hbase
17/04/16 21:32:31 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to authenticate using SASL (unknown error)
17/04/16 21:32:31 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /0:0:0:0:0:0:0:1:52946, server: localhost/0:0:0:0:0:0:0:1:2181
17/04/16 21:32:31 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/0:0:0:0:0:0:0:1:2181, sessionid = 0x15b7a2ed6860009, negotiated timeout = 90000
17/04/16 21:32:31 INFO balancer.StochasticLoadBalancer: loading config
17/04/16 21:32:32 INFO master.HMaster: Server active/primary master=localhost,16000,1492403544599, sessionid=0x15b7a2ed6860000, setting cluster-up flag (Was=false)
17/04/16 21:32:32 INFO regionserver.HRegionServer: ClusterId : 4e7d30ae-2e8f-4ede-8ec8-f62245b8c9f9
17/04/16 21:32:32 INFO procedure.ZKProcedureUtil: Clearing all procedure znodes: /hbase/online-snapshot/acquired /hbase/online-snapshot/reached /hbase/online-snapshot/abort
17/04/16 21:32:32 INFO procedure.ZKProcedureUtil: Clearing all procedure znodes: /hbase/flush-table-proc/acquired /hbase/flush-table-proc/reached /hbase/flush-table-proc/abort
17/04/16 21:32:32 INFO master.MasterCoprocessorHost: System coprocessor loading is enabled
17/04/16 21:32:32 INFO procedure2.ProcedureExecutor: Starting procedure executor threads=9
17/04/16 21:32:32 INFO wal.WALProcedureStore: Starting WAL Procedure Store lease recovery
17/04/16 21:32:32 WARN wal.WALProcedureStore: Log directory not found: File hdfs://localhost:20500/hbase/MasterProcWALs does not exist.
17/04/16 21:32:32 INFO wal.WALProcedureStore: Lease acquired for flushLogId: 1
17/04/16 21:32:32 INFO zookeeper.RecoverableZooKeeper: Process identifier=replicationLogCleaner connecting to ZooKeeper ensemble=localhost:2181
17/04/16 21:32:32 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=90000 watcher=replicationLogCleaner0x0, quorum=localhost:2181, baseZNode=/hbase
17/04/16 21:32:32 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to authenticate using SASL (unknown error)
17/04/16 21:32:32 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /0:0:0:0:0:0:0:1:52949, server: localhost/0:0:0:0:0:0:0:1:2181
17/04/16 21:32:32 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/0:0:0:0:0:0:0:1:2181, sessionid = 0x15b7a2ed686000a, negotiated timeout = 90000
17/04/16 21:32:32 INFO master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
17/04/16 21:32:32 INFO hbase.Server: Auth successful for jenkins (auth:SIMPLE)
17/04/16 21:32:32 INFO hbase.Server: Auth successful for jenkins (auth:SIMPLE)
17/04/16 21:32:32 INFO hbase.Server: Auth successful for jenkins (auth:SIMPLE)
17/04/16 21:32:32 INFO hbase.Server: Connection from 127.0.0.1 port: 43413 with version info: version: "1.2.0-cdh5.12.0-SNAPSHOT" url: "file:///data/jenkins/workspace/generic-binary-tarball-and-maven-deploy/CDH5-Packaging-HBase-2017-03-02_15-47-59/hbase-1.2.0-cdh5.12.0-SNAPSHOT" revision: "Unknown" user: "jenkins" date: "Thu Mar  2 16:13:05 PST 2017" src_checksum: "9ea697f985b9a5740a14d1be418c1115"
17/04/16 21:32:32 INFO hbase.Server: Connection from 127.0.0.1 port: 35012 with version info: version: "1.2.0-cdh5.12.0-SNAPSHOT" url: "file:///data/jenkins/workspace/generic-binary-tarball-and-maven-deploy/CDH5-Packaging-HBase-2017-03-02_15-47-59/hbase-1.2.0-cdh5.12.0-SNAPSHOT" revision: "Unknown" user: "jenkins" date: "Thu Mar  2 16:13:05 PST 2017" src_checksum: "9ea697f985b9a5740a14d1be418c1115"
17/04/16 21:32:32 INFO hbase.Server: Connection from 127.0.0.1 port: 59011 with version info: version: "1.2.0-cdh5.12.0-SNAPSHOT" url: "file:///data/jenkins/workspace/generic-binary-tarball-and-maven-deploy/CDH5-Packaging-HBase-2017-03-02_15-47-59/hbase-1.2.0-cdh5.12.0-SNAPSHOT" revision: "Unknown" user: "jenkins" date: "Thu Mar  2 16:13:05 PST 2017" src_checksum: "9ea697f985b9a5740a14d1be418c1115"
17/04/16 21:32:32 INFO master.ServerManager: Registering server=localhost,16203,1492403547967
17/04/16 21:32:32 INFO master.ServerManager: Registering server=localhost,16201,1492403545540
17/04/16 21:32:32 INFO master.ServerManager: Registering server=localhost,16202,1492403546904
17/04/16 21:32:32 INFO master.ServerManager: Waiting for region servers count to settle; currently checked in 3, slept for 385 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
17/04/16 21:32:34 INFO master.ServerManager: Waiting for region servers count to settle; currently checked in 3, slept for 1895 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
17/04/16 21:32:35 INFO master.ServerManager: Waiting for region servers count to settle; currently checked in 3, slept for 3398 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
17/04/16 21:32:36 INFO master.ServerManager: Finished waiting for region servers count to settle; checked in 3, slept for 4501 ms, expecting minimum of 1, maximum of 2147483647, master is running
17/04/16 21:32:36 INFO master.MasterFileSystem: Log folder hdfs://localhost:20500/hbase/WALs/localhost,16201,1492403545540 belongs to an existing region server
17/04/16 21:32:36 INFO master.MasterFileSystem: Log folder hdfs://localhost:20500/hbase/WALs/localhost,16202,1492403546904 belongs to an existing region server
17/04/16 21:32:36 INFO master.MasterFileSystem: Log folder hdfs://localhost:20500/hbase/WALs/localhost,16203,1492403547967 belongs to an existing region server
17/04/16 21:32:37 INFO zookeeper.MetaTableLocator: Deleting hbase:meta region location in ZooKeeper
17/04/16 21:32:37 INFO master.AssignmentManager: Setting node as OFFLINED in ZooKeeper for region {ENCODED => 1588230740, NAME => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''}
17/04/16 21:32:37 INFO zookeeper.ZKTableStateManager: Moving table hbase:meta state from null to ENABLED
17/04/16 21:32:37 INFO master.AssignmentManager: Assigning hbase:meta,,1.1588230740 to localhost,16201,1492403545540
17/04/16 21:32:37 INFO master.RegionStates: Transition {1588230740 state=OFFLINE, ts=1492403557593, server=null} to {1588230740 state=PENDING_OPEN, ts=1492403557616, server=localhost,16201,1492403545540}

This issue is similar to ~~IMPALA-4733~~.

Attachments

Activity

People

Assignee:: Lars Volker

Reporter:: Taras Bobrovytsky

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 18/Apr/17 20:52

Updated:: 13/Jun/17 05:58

Resolved:: 13/Jun/17 05:58