[HDFS-6184] Capture NN's thread dump when it fails over - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.8.0, 3.0.0-alpha1
Component/s: namenode
Labels:
None

Hadoop Flags:

Reviewed

Description

We have seen several false positives in terms of when ZKFC considers NN to be unhealthy. Some of these triggers unnecessary failover. Examples,

1. SBN checkpoint caused ZKFC's RPC call into NN timeout. The consequence isn't bad; just that SBN will quit ZK membership and rejoin it later. But it is unnecessary. The reason is checkpoint acquires NN global write lock and all rpc requests are blocked. Even though HAServiceProtocol.monitorHealth doesn't need to acquire NN lock; it still needs to user service rpc queue.

2. When ANN is busy, sometimes the global lock can block other requests. ZKFC's RPC call timeout. This will trigger failover. The question is even if after the failover, the new ANN might run into similar issue.

We can increase ZKFC to NN timeout value to mitigate this to some degree. If ZKFC can be more accurate in judgment if NN is health or not and can predict the failover will help, that will be useful. For example, we can,

1. Have ZKFC made decision based on NN thread dump.
2. Have a dedicated rpc pool for ZKFC > NN. Given health check doesn't need to acquire NN global lock; so it can go through even if NN is doing checkpointing or very busy.

Any comments?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-6184.patch
25/Aug/14 22:16
10 kB
Ming Ma
HDFS-6184-2.patch
05/Jan/15 23:54
10 kB
Ming Ma
HDFS-6184-3.patch
08/May/15 01:43
10 kB
Ming Ma
HDFS-6184-4.patch
08/May/15 04:52
10 kB
Ming Ma
HDFS-6184-5.patch
12/May/15 16:28
10 kB
Ming Ma
HDFS-6184-6.patch
12/May/15 21:26
10 kB
Ming Ma

Activity

People

Assignee:: Ming Ma

Reporter:: Ming Ma

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 02/Apr/14 06:59

Updated:: 30/Aug/16 01:42

Resolved:: 13/May/15 02:41