Details
-
Improvement
-
Status: Closed
-
Minor
-
Resolution: Implemented
-
None
-
None
-
None
-
None
Description
We have seen a couple of situations where clients fetching a large row can cause the whole server to go down, due to large GC pauses/Out of memory error.
This should be easily avoidable, if the client can use a Scan instead of a Get, and/or use batching to reduce the size. But, it seems difficult to enforce this. Moreover,
once in a while, there may be genuine outliers/bad clients, that cause such large requests.
We need to handle such situations gracefully, and not have the RS reboot for things that can be prevented. The proposal here is to enforce a maximum response size
at the Server end, so we are not at the mercy of the client's good behavior to let the server running.
We already log large responses. But, if the response is too large, it just kills the server. We don't have it logged, and the only way to find out is to go through the heap dump.
More importantly, our availability/reliability numbers will go down because the whole region/regionserver fails instead of just the single bad request.
I think it will be useful for the server to maintain a maximum request size that it will serve. Something large like 2-3G, so normal operations
do not need to be bothered. If a single get/scan operation exceeds the size, we will just throw an exception for the request. This will
a) avoid the RS from going on and on until it hits out of memory, and
b) will give a cleaner way for the clients, and for us to see what is the problem.