Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-14652

HealthMonitor connection retry times should be configurable

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.3.0
    • Component/s: None
    • Labels:
      None

      Description

      On our production HDFS cluster, some client's burst requests cause the tcp kernel queue full on NameNode's host,  since the configuration value of "net.ipv4.tcp_syn_retries" in our environment is 1, so after 3 seconds, the ZooKeeper Healthmonitor got an connection error like this:

      WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at nn_host_name/ip_address:port: Call From zkfc_host_name/ip to nn_host_name:port failed on connection exception: java.net.ConnectException: Connection timed out; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
      

      This error caused a failover and affects the availability of that cluster, we fixed this issue by enlarge the kernel parameter net.ipv4.tcp_syn_retries to 6

      But during working on this issue, we found that the connection retry time(ipc.client.connect.max.retries) of health-monitor is hard coded as 1, I think it should be configurable, then if we don't want the health-monitor so sensitive, we can change it's behavior by change this configuration

        Attachments

        1. HDFS-14652.003.patch
          8 kB
          Chen Zhang
        2. HDFS-14652-001.patch
          7 kB
          Chen Zhang
        3. HDFS-14652-002.patch
          7 kB
          Chen Zhang

          Activity

            People

            • Assignee:
              zhangchen Chen Zhang
              Reporter:
              zhangchen Chen Zhang
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: