[HBASE-8216] Be able to differentiate Power failures from Rack switch reboot - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.89-fb
Component/s: None
Labels:
None

Description

The master in 0.89-fb waits for 5-6 mins to check if RS'es become accessible; when it sees a co-related failure such as a rack-switch-reboot.

The rationale behind doing this is that it is not worth assigning and reassigning regions – causing churn, when the rack switch reboots are expected to heal themselves in 5-6 mins. In earlier deployments, where this feature was not present, we used to find ourselves in a bad situation for 30mins-1hr.

However, co-related failures also happen when there is a power failure for the rack. These cases take much longer to heal; so waiting for 5-6 mins is a wasted effort.

The master should be able to differentiate the two scenario, by checking if any of the RS in the rack is able to communicate. Unless all the servers in the rack are unaccessible, we should proceed with reassigning the regions.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

hbase-8216.diff
28/Mar/13 19:36
5 kB
Amitanand Aiyer

Issue Links

relates to

HBASE-5843 Improve HBase MTTR - Mean Time To Recover

Closed

Activity

People

Assignee:: Amitanand Aiyer

Reporter:: Amitanand Aiyer

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 28/Mar/13 18:54

Updated:: 16/Jun/22 05:55

Resolved:: 28/Mar/13 18:59