Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
-
Reviewed
-
Add backoff. Avoid retrying every 100ms.
Description
We observed this recently on some cluster, I'm still investigating the root cause however seems like the retries should have special handling for this exception; and separately probably a cap on number of retries
2019-04-20 04:24:27,093 WARN [RSProcedureDispatcher-pool4-t1285] procedure.RSProcedureDispatcher: request to server ,17020,1555742560432 failed due to java.io.IOException: Call to :17020 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: :17020, try=26603, retrying...
The corresponding worker is stuck
Attachments
Issue Links
- links to