This jira is intended to enhance IPC's scalability and robustness.
Currently an IPC server can easily hung due to a disk failure or garbage collection, during which it cannot respond to the clients promptly. This has caused a lot of dropped calls and delayed responses thus many running applications fail on timeout. On the other side if busy clients send a lot of requests to the server in a short period of time or too many clients communicate with the server simultaneously, the server may be swarmed by requests and cannot work responsively.
The proposed changes aim to
- provide a better client/server coordination
- Server should be able to throttle client during burst of requests.
- A slow client should not affect server from serving other clients.
- A temporary hanging server should not cause catastrophic failures to clients.
- Client/server should detect remote side failures. Examples of failures include: (1) the remote host is crashed; (2) the remote host is crashed and then rebooted; (3) the remote process is crashed or shut down by an operator;
- Fairness. Each client should be able to make progress.