Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
1.2.0, 1.3.0, 1.4.0, 1.5.0, 2.0.0
-
None
-
None
-
Retry, RPC, cascading failure, region server
Description
I recently discovered that the fix for HBase-14598 does not completely resolve the issue. Their fix addressed two aspects: first, when the Scan/Get RPC attempts to allocate a very large array that could potentially lead to an out-of-memory (OOM) error, it will check the size of the array before allocation and directly throw an exception to prevent the region server from crashing and avoid possible cascading failures. Second, the developer intends for the client to stop retrying after such a failure, as retrying will not resolve the issue.
However, their fix involved throwing a DoNotRetryException. After ByteBufferOutputStream.write throws the DoNotRetryException, in the call stack (ByteBufferOutputStream.write --> encoder.write --> encodeCellsTo --> his.cellBlockBuilder.buildCellBlockStream --> call.setResponse), the DoNotRetryException is ultimately caught in the CallRunner.run function, with only a log printed. Consequently, the DoNotRetryException is not sent back to the client side. Instead, the client receives a generic exception for the failed RPC request and continues retrying, which is not the desired behavior. I have reproduced this on the cluster.
In the code of CallRunner, it is obvious that the DoNotRetryException in call.setResponse will be swallowed in the error handler with just a LOG printed.