[HDFS-14652] HealthMonitor connection retry times should be configurable - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.3.0
Component/s: None
Labels:
None

Description

On our production HDFS cluster, some client's burst requests cause the tcp kernel queue full on NameNode's host, since the configuration value of "net.ipv4.tcp_syn_retries" in our environment is 1, so after 3 seconds, the ZooKeeper Healthmonitor got an connection error like this:

WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at nn_host_name/ip_address:port: Call From zkfc_host_name/ip to nn_host_name:port failed on connection exception: java.net.ConnectException: Connection timed out; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused

This error caused a failover and affects the availability of that cluster, we fixed this issue by enlarge the kernel parameter net.ipv4.tcp_syn_retries to 6

But during working on this issue, we found that the connection retry time(ipc.client.connect.max.retries) of health-monitor is hard coded as 1, I think it should be configurable, then if we don't want the health-monitor so sensitive, we can change it's behavior by change this configuration

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-14652.003.patch
06/Aug/19 03:40
8 kB
Chen Zhang
HDFS-14652-001.patch
15/Jul/19 15:53
7 kB
Chen Zhang
HDFS-14652-002.patch
16/Jul/19 03:09
7 kB
Chen Zhang

Activity

People

Assignee:: Chen Zhang

Reporter:: Chen Zhang

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 15/Jul/19 15:50

Updated:: 02/Oct/19 17:15

Resolved:: 06/Aug/19 22:25