[HBASE-28589] Client Does not Stop Retrying after DoNotRetryException - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 1.2.0, 1.3.0, 1.4.0, 1.5.0, 2.0.0
Fix Version/s: None
Component/s: IPC/RPC
Labels:
None

Tags:
Retry, RPC, cascading failure, region server

Description

I recently discovered that the fix for HBase-14598 does not completely resolve the issue. Their fix addressed two aspects: first, when the Scan/Get RPC attempts to allocate a very large array that could potentially lead to an out-of-memory (OOM) error, it will check the size of the array before allocation and directly throw an exception to prevent the region server from crashing and avoid possible cascading failures. Second, the developer intends for the client to stop retrying after such a failure, as retrying will not resolve the issue.

However, their fix involved throwing a DoNotRetryException. After ByteBufferOutputStream.write throws the DoNotRetryException, in the call stack (ByteBufferOutputStream.write --> encoder.write --> encodeCellsTo --> his.cellBlockBuilder.buildCellBlockStream --> call.setResponse), the DoNotRetryException is ultimately caught in the CallRunner.run function, with only a log printed. Consequently, the DoNotRetryException is not sent back to the client side. Instead, the client receives a generic exception for the failed RPC request and continues retrying, which is not the desired behavior. I have reproduced this on the cluster.

In the code of CallRunner, it is obvious that the DoNotRetryException in call.setResponse will be swallowed in the error handler with just a LOG printed.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: ZhenyuLi

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 11/May/24 15:03

Updated:: 13/May/24 04:12