[HDFS-9311] Support optional offload of NameNode HA service health checks to a separate RPC server. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.8.0, 3.0.0-alpha1
Component/s: ha, namenode
Labels:
None

Target Version/s:

2.8.0
Hadoop Flags:

Reviewed
Release Note:

Hide
There is now support for offloading HA health check RPC activity to a separate RPC server endpoint running within the NameNode process. This may improve reliability of HA health checks and prevent spurious failovers in highly overloaded conditions. For more details, please refer to the hdfs-default.xml documentation for properties dfs.namenode.lifeline.rpc-address, dfs.namenode.lifeline.rpc-bind-host and dfs.namenode.lifeline.handler.count.

Show
There is now support for offloading HA health check RPC activity to a separate RPC server endpoint running within the NameNode process. This may improve reliability of HA health checks and prevent spurious failovers in highly overloaded conditions. For more details, please refer to the hdfs-default.xml documentation for properties dfs.namenode.lifeline.rpc-address, dfs.namenode.lifeline.rpc-bind-host and dfs.namenode.lifeline.handler.count.

Description

When a NameNode is overwhelmed with load, it can lead to resource exhaustion of the RPC handler pools (both client-facing and service-facing). Eventually, this blocks the health check RPC issued from ZKFC, which triggers a failover. Depending on fencing configuration, the former active NameNode may be killed. In an overloaded situation, the new active NameNode is likely to suffer the same fate, because client load patterns don't change after the failover. This can degenerate into flapping between the 2 NameNodes without real recovery. If a NameNode had been killed by fencing, then it would have to transition through safe mode, further delaying time to recovery.

This issue proposes a separate, optional RPC server at the NameNode for isolating the HA health checks. These health checks are lightweight operations that do not suffer from contention issues on the namesystem lock or other shared resources. Isolating the RPC handlers is sufficient to avoid this situation.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-9311.001.patch
27/Oct/15 00:04
39 kB
Chris Nauroth
HDFS-9311.002.patch
27/Oct/15 19:23
38 kB
Chris Nauroth
HDFS-9311.003.patch
27/Oct/15 20:50
38 kB
Chris Nauroth

Issue Links

is related to

HDFS-9239 DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

Resolved

Activity

People

Assignee:: Chris Nauroth

Reporter:: Chris Nauroth

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 27/Oct/15 00:02

Updated:: 30/Aug/16 01:24

Resolved:: 28/Oct/15 06:20