Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-9311

Support optional offload of NameNode HA service health checks to a separate RPC server.

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.8.0, 3.0.0-alpha1
    • Component/s: ha, namenode
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      There is now support for offloading HA health check RPC activity to a separate RPC server endpoint running within the NameNode process. This may improve reliability of HA health checks and prevent spurious failovers in highly overloaded conditions. For more details, please refer to the hdfs-default.xml documentation for properties dfs.namenode.lifeline.rpc-address, dfs.namenode.lifeline.rpc-bind-host and dfs.namenode.lifeline.handler.count.
      Show
      There is now support for offloading HA health check RPC activity to a separate RPC server endpoint running within the NameNode process. This may improve reliability of HA health checks and prevent spurious failovers in highly overloaded conditions. For more details, please refer to the hdfs-default.xml documentation for properties dfs.namenode.lifeline.rpc-address, dfs.namenode.lifeline.rpc-bind-host and dfs.namenode.lifeline.handler.count.

      Description

      When a NameNode is overwhelmed with load, it can lead to resource exhaustion of the RPC handler pools (both client-facing and service-facing). Eventually, this blocks the health check RPC issued from ZKFC, which triggers a failover. Depending on fencing configuration, the former active NameNode may be killed. In an overloaded situation, the new active NameNode is likely to suffer the same fate, because client load patterns don't change after the failover. This can degenerate into flapping between the 2 NameNodes without real recovery. If a NameNode had been killed by fencing, then it would have to transition through safe mode, further delaying time to recovery.

      This issue proposes a separate, optional RPC server at the NameNode for isolating the HA health checks. These health checks are lightweight operations that do not suffer from contention issues on the namesystem lock or other shared resources. Isolating the RPC handlers is sufficient to avoid this situation.

        Attachments

        1. HDFS-9311.003.patch
          38 kB
          Chris Nauroth
        2. HDFS-9311.002.patch
          38 kB
          Chris Nauroth
        3. HDFS-9311.001.patch
          39 kB
          Chris Nauroth

          Issue Links

            Activity

              People

              • Assignee:
                cnauroth Chris Nauroth
                Reporter:
                cnauroth Chris Nauroth
              • Votes:
                0 Vote for this issue
                Watchers:
                13 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: