Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.16.0
-
None
-
None
Description
This jira is intended to enhance IPC's scalability and robustness.
Currently an IPC server can easily hung due to a disk failure or garbage collection, during which it cannot respond to the clients promptly. This has caused a lot of dropped calls and delayed responses thus many running applications fail on timeout. On the other side if busy clients send a lot of requests to the server in a short period of time or too many clients communicate with the server simultaneously, the server may be swarmed by requests and cannot work responsively.
The proposed changes aim to
- provide a better client/server coordination
- Server should be able to throttle client during burst of requests.
- A slow client should not affect server from serving other clients.
- A temporary hanging server should not cause catastrophic failures to clients.
- Client/server should detect remote side failures. Examples of failures include: (1) the remote host is crashed; (2) the remote host is crashed and then rebooted; (3) the remote process is crashed or shut down by an operator;
- Fairness. Each client should be able to make progress.
Attachments
Attachments
Issue Links
- depends upon
-
HADOOP-2870 Datanode.shutdown() and Namenode.stop() should close all rpc connections
- Closed
- incorporates
-
HADOOP-2909 Improve IPC idle connection management
- Closed
-
HADOOP-2975 IPC server should not allocate a buffer for each request
- Resolved
-
HADOOP-2188 RPC should send a ping rather than use client timeouts
- Closed
-
HADOOP-2910 Throttle IPC Client/Server during bursts of requests or server slowdown
- Closed