[HADOOP-9640] RPC Congestion Control with FairCallQueue - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.2.0, 3.0.0-alpha1
Fix Version/s: None
Component/s: None
Labels:
- hdfs
- qos
- rpc

Release Note:
Enable optional RPC-level priority to combat congestion and make request latencies more consistent.

Description

For an easy-to-read summary see: http://www.ebaytechblog.com/2014/08/21/quality-of-service-in-hadoop/

Several production Hadoop cluster incidents occurred where the Namenode was overloaded and failed to respond.

We can improve quality of service for users during namenode peak loads by replacing the FIFO call queue with a Fair Call Queue. (this plan supersedes rpc-congestion-control-draft-plan).

Excerpted from the communication of one incident, “The map task of a user was creating huge number of small files in the user directory. Due to the heavy load on NN, the JT also was unable to communicate with NN...The cluster became responsive only once the job was killed.”

Excerpted from the communication of another incident, “Namenode was overloaded by GetBlockLocation requests (Correction: should be getFileInfo requests. the job had a bug that called getFileInfo for a nonexistent file in an endless loop). All other requests to namenode were also affected by this and hence all jobs slowed down. Cluster almost came to a grinding halt…Eventually killed jobtracker to kill all jobs that are running.”

Excerpted from ~~HDFS-945~~, “We've seen defective applications cause havoc on the NameNode, for e.g. by doing 100k+ 'listStatus' on very large directories (60k files) etc.”

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

rpc-congestion-control-draft-plan.pdf
20/Jun/13 19:19
488 kB
Xiaobo Peng
NN-denial-of-service-updated-plan.pdf
03/Dec/13 22:36
2.76 MB
Chris Li
MinorityMajorityPerformance.pdf
05/Dec/13 01:25
72 kB
Chris Li
FairCallQueue-PerformanceOnCluster.pdf
23/Apr/14 23:21
694 kB
Chris Li
faircallqueue7_with_runtime_swapping.patch
23/Jan/14 20:57
134 kB
Chris Li
faircallqueue6.patch
16/Dec/13 19:30
74 kB
Chris Li
faircallqueue5.patch
09/Dec/13 21:10
73 kB
Chris Li
faircallqueue4.patch
09/Dec/13 21:00
74 kB
Chris Li
faircallqueue3.patch
06/Dec/13 21:19
73 kB
Chris Li
faircallqueue2.patch
06/Dec/13 19:55
73 kB
Chris Li
faircallqueue.patch
18/Nov/13 22:48
41 kB
Chris Li

Issue Links

depends upon

HADOOP-10508 RefreshCallQueue fails when authorization is enabled

Closed

is related to

HADOOP-9194 RPC Support for QoS

Closed

HDFS-945 Make NameNode resilient to DoS attacks (malicious or otherwise)

Resolved

relates to

HDFS-237 Better handling of dfsadmin command when namenode is slow

Resolved

HADOOP-10598 Support configurable RPC fair share

Open

HADOOP-10599 Support prioritization of DN RPCs over client RPCs

Open

HADOOP-13029 Have FairCallQueue try all lower priority sub queues before backoff

Open

HADOOP-15481 Emit FairCallQueue stats as metrics

Resolved

HADOOP-10286 Allow RPCCallBenchmark to benchmark calls by different users

Patch Available

(4 relates to)

Sub-Tasks

There are no Sub-Tasks for this issue.

Activity

People

Assignee:: Chris Li

Reporter:: Xiaobo Peng

Votes:: 5 Vote for this issue

Watchers:: 91 Start watching this issue

Dates

Created:: 11/Jun/13 22:27

Updated:: 02/Oct/19 17:14

Resolved:: 08/Feb/19 17:40