Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
We recently had an operational incident where the RegionServer got aborted, but failed to exit within a reasonable timeframe. We're going to tune hbase.regionserver.abort.timeout much lower than the 20m default, but even with that it makes little sense to accept requests when the server is aborting.
In our case, the server was impaired and not processing requests. The call queue was full, so NettyRpcServer kept trying and failing to add requests to the queue. This results in CallQueueTooBigException, which is not a meta cache clearing exception. It continued throwing these exceptions for multiple minutes until we finally manually killed the server.
I'd like to add a check in ServerRpcConnection.processRequest, where we check if regionServer.isAborted() and throw a RegionServerAbortedException rather than attempt to enqueue the request.
Attachments
Issue Links
- links to