[HDFS-7763] fix zkfc hung issue due to not catching exception in a corner case - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.6.0
Fix Version/s: 2.7.0, 2.6.1, 3.0.0-alpha1
Component/s: ha
Labels:
- 2.6.1-candidate

Description

In our product cluster, we hit both the two zkfc process is hung after a zk network outage.

the zkfc log said:

2015-02-07,17:40:11,875 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 3334ms for sessionid 0x4a61bacdd9dfb2, closing socket connection and attempting reconnect
2015-02-07,17:40:11,977 FATAL org.apache.hadoop.ha.ActiveStandbyElector: Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors.
2015-02-07,17:40:12,425 INFO org.apache.zookeeper.ZooKeeper: Session: 0x4a61bacdd9dfb2 closed
2015-02-07,17:40:12,425 FATAL org.apache.hadoop.ha.ZKFailoverController: Fatal error occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors.
2015-02-07,17:40:12,425 INFO org.apache.hadoop.ipc.Server: Stopping server on 11300
2015-02-07,17:40:12,425 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
2015-02-07,17:40:12,426 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election
2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.HealthMonitor: Stopping HealthMonitor thread
2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 11300

and the thread dump also be uploaded as attachment.
From the dump, we can see due to the unknown non-daemon threads(pool-~~thread~~), the process did not exit, but the critical threads, like health monitor and rpc threads had been stopped, so our watchdog(supervisord) had not not observed the zkfc process is down or abnormal. so the following namenode failover could not be done as expected.

there're two possible fixes here, 1) figure out the unset-thread-name, like pool-7-thread-1, where them came from and close or set daemon property. i tried to search but got nothing right now. 2) catch the exception from ZKFailoverController.run() so we can continue to exec the System.exit, the attached patch is 2).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-7763-001.txt
10/Feb/15 08:08
0.9 kB
Liang Xie
HDFS-7763-002.txt
12/Feb/15 03:52
0.9 kB
Liang Xie
jstack.4936
10/Feb/15 07:54
12 kB
Liang Xie

Activity

People

Assignee:: Liang Xie

Reporter:: Liang Xie

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 10/Feb/15 07:26

Updated:: 30/Aug/16 01:39

Resolved:: 24/Feb/15 23:31