Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-5223

HBase/Zookeeper continues to be flaky on RHEL7

    Details

    • Epic Color:
      ghx-label-5

      Description

      There were several failures on RHEL7 recently.

      Starting cluster services...
      Stopping kudu
      Stopping kms
      Stopping yarn
      Stopping hdfs
      Starting hdfs (Web UI - http://localhost:5070)
      Namenode started
      Starting yarn (Web UI - http://localhost:8088)
      Starting kms (Web UI - http://localhost:9600)
      Starting kudu (Web UI - http://localhost:8051)
      The cluster is running
       --> Starting HBase
      HBase start attempt: 1/5
      Killing any HBase processes possibly lingering from previous start attempts
      Stopping Hbase
      localhost: -------------------------------------------------------------------------------
      localhost: # This image has been automatically created with ComodIT - http://comodit.com #
      localhost: -------------------------------------------------------------------------------
      localhost: 
      localhost: starting zookeeper, logging to /data/jenkins/workspace/impala-asf-master-exhaustive-rhel7/repos/Impala/logs/cluster/hbase/hbase-jenkins-zookeeper-impala-boost-static-burst-slave-el7-0a08.vpc.cloudera.com.out
      starting master, logging to /data/jenkins/workspace/impala-asf-master-exhaustive-rhel7/repos/Impala/logs/cluster/hbase/hbase-jenkins-master-impala-boost-static-burst-slave-el7-0a08.vpc.cloudera.com.out
      starting regionserver, logging to /data/jenkins/workspace/impala-asf-master-exhaustive-rhel7/repos/Impala/logs/cluster/hbase/hbase-jenkins-1-regionserver-impala-boost-static-burst-slave-el7-0a08.vpc.cloudera.com.out
      starting regionserver, logging to /data/jenkins/workspace/impala-asf-master-exhaustive-rhel7/repos/Impala/logs/cluster/hbase/hbase-jenkins-2-regionserver-impala-boost-static-burst-slave-el7-0a08.vpc.cloudera.com.out
      starting regionserver, logging to /data/jenkins/workspace/impala-asf-master-exhaustive-rhel7/repos/Impala/logs/cluster/hbase/hbase-jenkins-3-regionserver-impala-boost-static-burst-slave-el7-0a08.vpc.cloudera.com.out
      Contents of HDFS root: []
      Connecting to Zookeeper host(s).
      Success: <kazoo.client.KazooClient object at 0x7f5aecf06e10>
      Waiting for HBase node: /hbase/master
      Waiting for HBase node: /hbase/master
      Waiting for HBase node: /hbase/master
      Waiting for HBase node: /hbase/master
      Waiting for HBase node: /hbase/master
      Waiting for HBase node: /hbase/master
      Waiting for HBase node: /hbase/master
      Waiting for HBase node: /hbase/master
      Failed while checking for HBase node: /hbase/master
      Waiting for HBase node: /hbase/rs
      Success: /hbase/rs
      Stopping Zookeeper client
      Could not get one or more nodes. Exiting with errors: 1
      Error in /data/jenkins/workspace/impala-asf-master-exhaustive-rhel7/repos/Impala/testdata/bin/run-hbase.sh at line 129: ${CLUSTER_BIN}/check-hbase-nodes.py
      Error in /data/jenkins/workspace/impala-asf-master-exhaustive-rhel7/repos/Impala/testdata/bin/run-all.sh at line 51: tee ${IMPALA_CLUSTER_LOGS_DIR}/run-hbase.log
      + echo 'buildall.sh ' -format '-snapshot_file /data/jenkins/workspace/impala-asf-master-exhaustive-rhel7/testdata/test-warehouse-SNAPSHOT/test-warehouse-cdh5-304-SNAPSHOT.tar.gz' '-metastore_snapshot_file /data/jenkins/workspace/impala-asf-master-exhaustive-rhel7/testdata/hive_metastore_snapshot/hive_impala_dump_cdh5-304.txt failed.'
      buildall.sh  -format -snapshot_file /data/jenkins/workspace/impala-asf-master-exhaustive-rhel7/testdata/test-warehouse-SNAPSHOT/test-warehouse-cdh5-304-SNAPSHOT.tar.gz -metastore_snapshot_file /data/jenkins/workspace/impala-asf-master-exhaustive-rhel7/testdata/hive_metastore_snapshot/hive_impala_dump_cdh5-304.txt failed.
      + RET_VAL=1
      

      From the HBase master log:

      17/04/17 21:32:40 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /0:0:0:0:0:0:0:1:48978, server: localhost/0:0:0:0:0:0:0:1:2181
      17/04/17 21:32:46 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/0:0:0:0:0:0:0:1:2181, sessionid = 0x15b7f5567b10000, negotiated timeout = 90000
      17/04/17 21:33:34 INFO ipc.RpcServer: RpcServer.responder: starting
      17/04/17 21:33:34 INFO ipc.RpcServer: RpcServer.listener,port=16000: starting
      17/04/17 21:33:35 INFO mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
      17/04/17 21:33:35 INFO http.HttpRequestLog: Http request log for http.requests.master is not defined
      17/04/17 21:33:35 INFO http.HttpServer: Added global filter 'safety' (class=org.apache.hadoop.hbase.http.HttpServer$QuotingInputFilter)
      17/04/17 21:33:35 INFO http.HttpServer: Added global filter 'clickjackingprevention' (class=org.apache.hadoop.hbase.http.ClickjackingPreventionFilter)
      17/04/17 21:33:35 INFO http.HttpServer: Added filter static_user_filter (class=org.apache.hadoop.hbase.http.lib.StaticUserWebFilter$StaticUserFilter) to context master
      17/04/17 21:33:35 INFO http.HttpServer: Added filter static_user_filter (class=org.apache.hadoop.hbase.http.lib.StaticUserWebFilter$StaticUserFilter) to context static
      17/04/17 21:33:35 INFO http.HttpServer: Added filter static_user_filter (class=org.apache.hadoop.hbase.http.lib.StaticUserWebFilter$StaticUserFilter) to context logs
      17/04/17 21:33:35 INFO http.HttpServer: Jetty bound to port 16010
      17/04/17 21:33:35 INFO mortbay.log: jetty-6.1.26.cloudera.4
      17/04/17 21:33:35 INFO mortbay.log: Extract jar:file:/data/jenkins/workspace/impala-asf-master-exhaustive-rhel7/Impala-Toolchain/cdh_components/hbase-1.2.0-cdh5.12.0-SNAPSHOT/lib/hbase-server-1.2.0-cdh5.12.0-SNAPSHOT.jar!/hbase-webapps/master to /tmp/Jetty_0_0_0_0_16010_master____.6nvknp/webapp
      17/04/17 21:33:35 INFO mortbay.log: Started SelectChannelConnector@0.0.0.0:16010
      17/04/17 21:33:35 INFO master.HMaster: hbase.rootdir=hdfs://localhost:20500/hbase, hbase.cluster.distributed=true
      17/04/17 21:33:35 INFO master.HMaster: Adding backup master ZNode /hbase/backup-masters/localhost,16000,1492489958254
      

      Here's what a successful HBase looks like:

      17/04/16 21:32:26 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /0:0:0:0:0:0:0:1:52928, server: localhost/0:0:0:0:0:0:0:1:2181
      17/04/16 21:32:28 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/0:0:0:0:0:0:0:1:2181, sessionid = 0x15b7a2ed6860000, negotiated timeout = 90000
      17/04/16 21:32:28 INFO ipc.RpcServer: RpcServer.responder: starting
      17/04/16 21:32:28 INFO ipc.RpcServer: RpcServer.listener,port=16000: starting
      17/04/16 21:32:28 INFO mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
      17/04/16 21:32:28 INFO http.HttpRequestLog: Http request log for http.requests.master is not defined
      17/04/16 21:32:28 INFO http.HttpServer: Added global filter 'safety' (class=org.apache.hadoop.hbase.http.HttpServer$QuotingInputFilter)
      17/04/16 21:32:28 INFO http.HttpServer: Added global filter 'clickjackingprevention' (class=org.apache.hadoop.hbase.http.ClickjackingPreventionFilter)
      17/04/16 21:32:28 INFO http.HttpServer: Added filter static_user_filter (class=org.apache.hadoop.hbase.http.lib.StaticUserWebFilter$StaticUserFilter) to context master
      17/04/16 21:32:28 INFO http.HttpServer: Added filter static_user_filter (class=org.apache.hadoop.hbase.http.lib.StaticUserWebFilter$StaticUserFilter) to context static
      17/04/16 21:32:28 INFO http.HttpServer: Added filter static_user_filter (class=org.apache.hadoop.hbase.http.lib.StaticUserWebFilter$StaticUserFilter) to context logs
      17/04/16 21:32:28 INFO http.HttpServer: Jetty bound to port 16010
      17/04/16 21:32:28 INFO mortbay.log: jetty-6.1.26.cloudera.4
      17/04/16 21:32:28 INFO mortbay.log: Extract jar:file:/data/jenkins/workspace/impala-asf-master-exhaustive-rhel7/Impala-Toolchain/cdh_components/hbase-1.2.0-cdh5.12.0-SNAPSHOT/lib/hbase-server-1.2.0-cdh5.12.0-SNAPSHOT.jar!/hbase-webapps/master to /tmp/Jetty_0_0_0_0_16010_master____.6nvknp/webapp
      17/04/16 21:32:29 INFO mortbay.log: Started SelectChannelConnector@0.0.0.0:16010
      17/04/16 21:32:29 INFO master.HMaster: hbase.rootdir=hdfs://localhost:20500/hbase, hbase.cluster.distributed=true
      17/04/16 21:32:29 INFO master.HMaster: Adding backup master ZNode /hbase/backup-masters/localhost,16000,1492403544599
      17/04/16 21:32:29 INFO master.ActiveMasterManager: Deleting ZNode for /hbase/backup-masters/localhost,16000,1492403544599 from backup master directory
      17/04/16 21:32:29 INFO master.ActiveMasterManager: Registered Active Master=localhost,16000,1492403544599
      17/04/16 21:32:29 INFO Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS
      17/04/16 21:32:29 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x112a19e connecting to ZooKeeper ensemble=localhost:2181
      17/04/16 21:32:29 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=90000 watcher=hconnection-0x112a19e0x0, quorum=localhost:2181, baseZNode=/hbase
      17/04/16 21:32:29 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to authenticate using SASL (unknown error)
      17/04/16 21:32:29 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /0:0:0:0:0:0:0:1:52933, server: localhost/0:0:0:0:0:0:0:1:2181
      17/04/16 21:32:29 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/0:0:0:0:0:0:0:1:2181, sessionid = 0x15b7a2ed6860005, negotiated timeout = 90000
      17/04/16 21:32:29 INFO client.ZooKeeperRegistry: ClusterId read in ZooKeeper is null
      17/04/16 21:32:30 INFO util.FSUtils: Created version file at hdfs://localhost:20500/hbase with version=8
      17/04/16 21:32:31 INFO master.MasterFileSystem: BOOTSTRAP: creating hbase:meta region
      17/04/16 21:32:31 INFO regionserver.HRegion: creating HRegion hbase:meta HTD == 'hbase:meta', {TABLE_ATTRIBUTES => {IS_META => 'true', coprocessor$1 => '|org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint|536870911|'}, {NAME => 'info', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '10', TTL => 'FOREVER', MIN_VERSIONS => '0', CACHE_DATA_IN_L1 => 'true', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '8192', IN_MEMORY => 'false', BLOCKCACHE => 'false'} RootDir = hdfs://localhost:20500/hbase Table name == hbase:meta
      17/04/16 21:32:31 INFO hfile.CacheConfig: blockCache=LruBlockCache{blockCount=0, currentSize=6559152, freeSize=6388915792, maxSize=6395474944, heapSize=6559152, minSize=6075701248, minFactor=0.95, multiSize=3037850624, multiFactor=0.5, singleSize=1518925312, singleFactor=0.25}, cacheDataOnRead=false, cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false
      17/04/16 21:32:31 INFO compactions.CompactionConfiguration: size [134217728, 9223372036854775807, 9223372036854775807); files [3, 10); ratio 1.200000; off-peak ratio 5.000000; throttle point 2684354560; major period 604800000, major jitter 0.500000, min locality to compact 0.000000
      17/04/16 21:32:31 INFO regionserver.HRegion: Onlined 1588230740; next sequenceid=2
      17/04/16 21:32:31 INFO regionserver.HStore: Closed info
      17/04/16 21:32:31 INFO regionserver.HRegion: Closed hbase:meta,,1.1588230740
      17/04/16 21:32:31 INFO fs.HFileSystem: Added intercepting call to namenode#getBlockLocations so can do block reordering using class class org.apache.hadoop.hbase.fs.HFileSystem$ReorderWALBlocks
      17/04/16 21:32:31 INFO coordination.SplitLogManagerCoordination: Found 0 orphan tasks and 0 rescan nodes
      17/04/16 21:32:31 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x6c9e069e connecting to ZooKeeper ensemble=localhost:2181
      17/04/16 21:32:31 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=90000 watcher=hconnection-0x6c9e069e0x0, quorum=localhost:2181, baseZNode=/hbase
      17/04/16 21:32:31 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to authenticate using SASL (unknown error)
      17/04/16 21:32:31 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /0:0:0:0:0:0:0:1:52946, server: localhost/0:0:0:0:0:0:0:1:2181
      17/04/16 21:32:31 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/0:0:0:0:0:0:0:1:2181, sessionid = 0x15b7a2ed6860009, negotiated timeout = 90000
      17/04/16 21:32:31 INFO balancer.StochasticLoadBalancer: loading config
      17/04/16 21:32:32 INFO master.HMaster: Server active/primary master=localhost,16000,1492403544599, sessionid=0x15b7a2ed6860000, setting cluster-up flag (Was=false)
      17/04/16 21:32:32 INFO regionserver.HRegionServer: ClusterId : 4e7d30ae-2e8f-4ede-8ec8-f62245b8c9f9
      17/04/16 21:32:32 INFO procedure.ZKProcedureUtil: Clearing all procedure znodes: /hbase/online-snapshot/acquired /hbase/online-snapshot/reached /hbase/online-snapshot/abort
      17/04/16 21:32:32 INFO procedure.ZKProcedureUtil: Clearing all procedure znodes: /hbase/flush-table-proc/acquired /hbase/flush-table-proc/reached /hbase/flush-table-proc/abort
      17/04/16 21:32:32 INFO master.MasterCoprocessorHost: System coprocessor loading is enabled
      17/04/16 21:32:32 INFO procedure2.ProcedureExecutor: Starting procedure executor threads=9
      17/04/16 21:32:32 INFO wal.WALProcedureStore: Starting WAL Procedure Store lease recovery
      17/04/16 21:32:32 WARN wal.WALProcedureStore: Log directory not found: File hdfs://localhost:20500/hbase/MasterProcWALs does not exist.
      17/04/16 21:32:32 INFO wal.WALProcedureStore: Lease acquired for flushLogId: 1
      17/04/16 21:32:32 INFO zookeeper.RecoverableZooKeeper: Process identifier=replicationLogCleaner connecting to ZooKeeper ensemble=localhost:2181
      17/04/16 21:32:32 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=90000 watcher=replicationLogCleaner0x0, quorum=localhost:2181, baseZNode=/hbase
      17/04/16 21:32:32 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to authenticate using SASL (unknown error)
      17/04/16 21:32:32 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /0:0:0:0:0:0:0:1:52949, server: localhost/0:0:0:0:0:0:0:1:2181
      17/04/16 21:32:32 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/0:0:0:0:0:0:0:1:2181, sessionid = 0x15b7a2ed686000a, negotiated timeout = 90000
      17/04/16 21:32:32 INFO master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
      17/04/16 21:32:32 INFO hbase.Server: Auth successful for jenkins (auth:SIMPLE)
      17/04/16 21:32:32 INFO hbase.Server: Auth successful for jenkins (auth:SIMPLE)
      17/04/16 21:32:32 INFO hbase.Server: Auth successful for jenkins (auth:SIMPLE)
      17/04/16 21:32:32 INFO hbase.Server: Connection from 127.0.0.1 port: 43413 with version info: version: "1.2.0-cdh5.12.0-SNAPSHOT" url: "file:///data/jenkins/workspace/generic-binary-tarball-and-maven-deploy/CDH5-Packaging-HBase-2017-03-02_15-47-59/hbase-1.2.0-cdh5.12.0-SNAPSHOT" revision: "Unknown" user: "jenkins" date: "Thu Mar  2 16:13:05 PST 2017" src_checksum: "9ea697f985b9a5740a14d1be418c1115"
      17/04/16 21:32:32 INFO hbase.Server: Connection from 127.0.0.1 port: 35012 with version info: version: "1.2.0-cdh5.12.0-SNAPSHOT" url: "file:///data/jenkins/workspace/generic-binary-tarball-and-maven-deploy/CDH5-Packaging-HBase-2017-03-02_15-47-59/hbase-1.2.0-cdh5.12.0-SNAPSHOT" revision: "Unknown" user: "jenkins" date: "Thu Mar  2 16:13:05 PST 2017" src_checksum: "9ea697f985b9a5740a14d1be418c1115"
      17/04/16 21:32:32 INFO hbase.Server: Connection from 127.0.0.1 port: 59011 with version info: version: "1.2.0-cdh5.12.0-SNAPSHOT" url: "file:///data/jenkins/workspace/generic-binary-tarball-and-maven-deploy/CDH5-Packaging-HBase-2017-03-02_15-47-59/hbase-1.2.0-cdh5.12.0-SNAPSHOT" revision: "Unknown" user: "jenkins" date: "Thu Mar  2 16:13:05 PST 2017" src_checksum: "9ea697f985b9a5740a14d1be418c1115"
      17/04/16 21:32:32 INFO master.ServerManager: Registering server=localhost,16203,1492403547967
      17/04/16 21:32:32 INFO master.ServerManager: Registering server=localhost,16201,1492403545540
      17/04/16 21:32:32 INFO master.ServerManager: Registering server=localhost,16202,1492403546904
      17/04/16 21:32:32 INFO master.ServerManager: Waiting for region servers count to settle; currently checked in 3, slept for 385 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
      17/04/16 21:32:34 INFO master.ServerManager: Waiting for region servers count to settle; currently checked in 3, slept for 1895 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
      17/04/16 21:32:35 INFO master.ServerManager: Waiting for region servers count to settle; currently checked in 3, slept for 3398 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
      17/04/16 21:32:36 INFO master.ServerManager: Finished waiting for region servers count to settle; checked in 3, slept for 4501 ms, expecting minimum of 1, maximum of 2147483647, master is running
      17/04/16 21:32:36 INFO master.MasterFileSystem: Log folder hdfs://localhost:20500/hbase/WALs/localhost,16201,1492403545540 belongs to an existing region server
      17/04/16 21:32:36 INFO master.MasterFileSystem: Log folder hdfs://localhost:20500/hbase/WALs/localhost,16202,1492403546904 belongs to an existing region server
      17/04/16 21:32:36 INFO master.MasterFileSystem: Log folder hdfs://localhost:20500/hbase/WALs/localhost,16203,1492403547967 belongs to an existing region server
      17/04/16 21:32:37 INFO zookeeper.MetaTableLocator: Deleting hbase:meta region location in ZooKeeper
      17/04/16 21:32:37 INFO master.AssignmentManager: Setting node as OFFLINED in ZooKeeper for region {ENCODED => 1588230740, NAME => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''}
      17/04/16 21:32:37 INFO zookeeper.ZKTableStateManager: Moving table hbase:meta state from null to ENABLED
      17/04/16 21:32:37 INFO master.AssignmentManager: Assigning hbase:meta,,1.1588230740 to localhost,16201,1492403545540
      17/04/16 21:32:37 INFO master.RegionStates: Transition {1588230740 state=OFFLINE, ts=1492403557593, server=null} to {1588230740 state=PENDING_OPEN, ts=1492403557616, server=localhost,16201,1492403545540}
      

      This issue is similar to IMPALA-4733.

        Attachments

          Activity

            People

            • Assignee:
              lv Lars Volker
              Reporter:
              tarasbob Taras Bobrovytsky
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: