[HBASE-15251] During a cluster restart, Hmaster thinks it is a failover by mistake - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.0
Fix Version/s: 2.0.0
Component/s: master
Labels:
None

Hadoop Flags:

Reviewed

Description

We often need to do cluster restart as part of release for a cluster of > 1000 nodes. We have tried our best to get clean shutdown but 50% of the time, hmaster still thinks it is a failover. This increases the restart time from 5 min to 30 min and decreases locality from 99% to 5% since we didn't use a locality-aware balancer. We had a bug ~~HBASE-14129~~ but the fix didn't work.

After adding more logging and inspecting the logs, we identified two things that trigger the failover handling:
1. When Hmaster.AssignmentManager detects any dead servers on service manager during joinCluster(), it determines this is a failover without further check. I added a check whether there is even any region assigned to these servers. During a clean restart, the regions are not even assigned.
2. When there are some leftover empty folders for log and split directories or empty wal files, it is also treated as a failover. I added a check for that. Although this can be resolved by manual cleanup, it is still too tedious for restarting a large cluster.

Patch will follow shortly. The fix is tested and used in production now.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HBASE-15251-master.patch
11/Feb/16 20:46
6 kB
Clara Xiong
HBASE-15251-master-v1.patch
17/Feb/16 08:44
7 kB
Clara Xiong

Issue Links

is related to

HBASE-18036 HBase 1.x : Data locality is not maintained after cluster restart or SSH

Resolved

relates to

HBASE-14129 If any regionserver gets shutdown uncleanly during full cluster restart, locality looks to be lost

Closed

Activity

People

Assignee:: Clara Xiong

Reporter:: Clara Xiong

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 11/Feb/16 01:08

Updated:: 12/May/17 15:14

Resolved:: 20/Feb/16 00:28