Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.4.10
-
None
-
None
-
Reviewed
Description
Analysis of a recent production incident is not yet complete but an item of note is an apparent deadlock. Imagine you are gracefully draining a regionserver by way of a flurry of moveRegion requests. The handler for moveRegion submits a TRSP and then waits on its future without timeout. Imagine that there are sufficient number of moveRegion requests to tie up the normal priority master RPC pool. Now imagine that all of those requests are waiting on TRSPs pending on a regionserver that is concurrently bounced or maybe it fails. The TRSPs are blocked in REGION_STATE_TRANSITION_CLOSE because the target regionserver terminated before responding to the close requests, blocking the moveRegion requests, blocking the RPC handlers. The regionserver restarts and tries to check in, but cannot report to the master because there are no free normal priority handlers to handle it. It seems not correct to have the regionserver operational dependencies (regionServerStartup, regionServerReport, and reportFatalRSError) contending with normal priority requests.
They should be made ADMIN_QOS priority to avoid this case.
Attachments
Issue Links
- links to