[HBASE-20087] Periodically attempt redeploy of regions in FAILED_OPEN state - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.4.2
Fix Version/s: 1.5.0
Component/s: master, Region Assignment
Labels:
None

Hadoop Flags:

Reviewed
Release Note:

Hide
The AssignmentManager will attempt to assign regions in FAILED_OPEN state at an interval specified by the configuration setting "hbase.assignment.failed.open.retry.period", defaulting to 300000 (5 minutes). If a transient condition leads a region to repeatedly fail to open sufficient to transition into FAILED_OPEN state, such as the temporary inability to satisfy a RSGroups assignment constraint after server failures, the retries may automatically redeploy the region without operator intervention. Set to 0 to disable and keep the old behavior where regions in FAILED_OPEN state are left to operators to manually reassign.

Show
The AssignmentManager will attempt to assign regions in FAILED_OPEN state at an interval specified by the configuration setting "hbase.assignment.failed.open.retry.period", defaulting to 300000 (5 minutes). If a transient condition leads a region to repeatedly fail to open sufficient to transition into FAILED_OPEN state, such as the temporary inability to satisfy a RSGroups assignment constraint after server failures, the retries may automatically redeploy the region without operator intervention. Set to 0 to disable and keep the old behavior where regions in FAILED_OPEN state are left to operators to manually reassign.

Description

Because RSGroups can cause permanent RIT with regions in FAILED_OPEN state, we added logic to the master portion of the RSGroups extention to enumerate RITs and retry assignment of regions in FAILED_OPEN state.

However, this strategy can be applied generally to reduce need of operator involvement in cluster operations. Now an operator has to manually resolve FAILED_OPEN assignments but there is little risk in automatically retrying them after a while. If the reason the assignment failed has not cleared, the assignment will just fail again. Should the reason the assignment failed be resolved, then operators don't have to do more in order for the cluster to fully heal.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HBASE-20087-branch-1.patch
28/Feb/18 00:11
8 kB
Andrew Kyle Purtell
HBASE-20087-branch-1.patch
27/Feb/18 02:52
8 kB
Andrew Kyle Purtell
0001-W-4723090-Port-the-RIT-FAILED_OPEN-state-hack-from-R.patch
26/Feb/18 19:22
11 kB
Andrew Kyle Purtell

Issue Links

is blocked by

HBASE-20102 AssignmentManager#shutdown doesn't shut down scheduled executor

Resolved

Activity

People

Assignee:: Andrew Kyle Purtell

Reporter:: Andrew Kyle Purtell

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 26/Feb/18 18:27

Updated:: 28/Feb/18 02:18

Resolved:: 28/Feb/18 02:14