[HBASE-22287] inifinite retries on failed server in RSProcedureDispatcher - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.0.0-alpha-1, 2.3.0
Component/s: None
Labels:
None

Hadoop Flags:

Reviewed
Release Note:
Add backoff. Avoid retrying every 100ms.

Description

We observed this recently on some cluster, I'm still investigating the root cause however seems like the retries should have special handling for this exception; and separately probably a cap on number of retries

2019-04-20 04:24:27,093 WARN  [RSProcedureDispatcher-pool4-t1285] procedure.RSProcedureDispatcher: request to server ,17020,1555742560432 failed due to java.io.IOException: Call to :17020 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: :17020, try=26603, retrying...

The corresponding worker is stuck

Attachments

Issue Links

links to

GitHub Pull Request #1800

Activity

People

Assignee:: Michael Stack

Reporter:: Sergey Shelukhin

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 22/Apr/19 18:01

Updated:: 30/May/20 14:15

Resolved:: 29/May/20 17:06