Description
The namenode processes RPC requests from clients that are reading/writing to files as well as heartbeats/block reports from datanodes.
Sometime, because of various reasons (Java GC runs, inconsistent performance of NFS filer that stores HDFS transacttion logs, etc), the namenode encounters transient slowness. For example, if the device that stores the HDFS transaction logs becomes sluggish, the Namenode's ability to process RPCs slows down to a certain extent. During this time, the RPCs from clients as well as the RPCs from datanodes suffer in similar fashion. If the underlying problem becomes worse, the NN's ability to process a heartbeat from a DN is severly impacted, thus causing the NN to declare that the DN is dead. Then the NN starts replicating blocks that used to reside on the now-declared-dead datanode. This adds extra load to the NN. Then the now-declared-datanode finally re-establishes contact with the NN, and sends a block report. The block report processing on the NN is another heavyweight activity, thus casing more load to the already overloaded namenode.
My proposal is tha the NN should try its best to continue processing RPCs from datanodes and give lesser priority to serving client requests. The Datanode RPCs are integral to the consistency and performance of the Hadoop file system, and it is better to protect it at all costs. This will ensure that NN recovers from the hiccup much faster than what it does now.
Attachments
Attachments
Issue Links
- blocks
-
HDFS-1156 Make Balancer run on the service port
- Open
- is blocked by
-
HADOOP-6469 Multiple RPC Servers using same object instance
- Resolved
- is related to
-
HDFS-1357 HFTP traffic served by DataNode shouldn't use service port on NameNode
- Closed
-
HADOOP-6764 Add number of reader threads and queue length as configuration parameters in RPC.getServer
- Closed
- relates to
-
HDFS-1321 If service port and main port are the same, there is no clear log message explaining the issue.
- Resolved
-
HDFS-1291 Delay start of client RPC server until out of safemode
- Open
-
HDFS-1392 Improve namenode scalability by prioritizing datanode heartbeats over block reports
- Resolved
-
HBASE-2782 QOS for META table access
- Closed