[HBASE-21742] master can create bad procedures during abort, making entire cluster unusable - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Patch Available
Priority: Critical
Resolution: Unresolved
Affects Version/s: 3.0.0-alpha-1
Fix Version/s: None
Component/s: amv2, meta, Region Assignment
Labels:
None

Description

Some small HDFS hiccup causes master and meta RS to fail together. Master goes first:

2019-01-18 08:09:46,790 INFO  [KeepAlivePEWorker-311] zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in ZooKeeper as meta-rs,17020,1547824792484
...
2019-01-18 10:01:16,904 ERROR [PEWorker-11] master.HMaster: ***** ABORTING master master,17000,1547604554447: FAILED [blah] *****
...
2019-01-18 10:01:17,087 INFO  [master/master:17000] assignment.AssignmentManager: Stopping assignment manager

Bunch of stuff keeps happening, including procedure retries, which is also suspect, but not the point here:

2019-01-18 10:01:21,598 INFO  [PEWorker-3] procedure2.TimeoutExecutorThread: ADDED pid=104031, state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, ...

Then the meta RS decides it's time to go:

2019-01-18 10:01:25,319 INFO  [RegionServerTracker-0] master.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [meta-rs,17020,1547824792484]
...
2019-01-18 10:01:25,463 INFO  [RegionServerTracker-0] assignment.AssignmentManager: Added meta-rs,17020,1547824792484 to dead servers which carryingMeta=false, submitted ServerCrashProcedure pid=104313

Note that the SCP for this server has meta=false, even though it is holding the meta. That is because, as per above "Stopping assigment manager", AM state including region map got cleared.
This SCP gets persisted, so when the next master starts, it waits forever for meta to be onlined, while there's no SCP with meta=true to online it.

The only way around this is to delete the procv2 WAL - master has all the information here, as it often does in bugs I've found recently, but some split brain procedures cause it to get stuck one way or another.

I will file a separate bug about that.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HBASE-21742.patch
22/Jan/19 22:56
2 kB
Sergey Shelukhin

Activity

People

Assignee:: Sergey Shelukhin

Reporter:: Sergey Shelukhin

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 19/Jan/19 00:00

Updated:: 13/Mar/19 01:31