Description
As was reported in HBASE-21788, we have repeatedly seen regions getting stuck in OPENING after master restarts. Here is one scenario that I've observed recently:
1) There is a region is transit (RIT).
2) The active master aborts and begins shutting down.
3) The backup master becomes active quickly, finds the RIT, creates OpenRegionProcedure and send request to some server.
4) The server quickly opens the region and posts OPENED state transition, but it uses its cached master instead of the new one.
5) The old active master which had not completely shutdown its assignment manager yet, notes the OPENED state report and ignores it. Because no corresponding procedure can be found.
6) The new master waits forever for a response to its OPEN region request.
This happens more often with the meta region because it's small and takes a few seconds to open. Below are some related logs:
Previous HMaster:
2019-03-14 13:19:16,310 ERROR [PEWorker-1] master.HMaster: ***** ABORTING master <master-1>,17000,1552438242232: Shutting down HBase cluster: file system not available *****
2019-03-14 13:19:16,310 INFO [PEWorker-1] regionserver.HRegionServer: ***** STOPPING region server '<master-1>,17000,1552438242232' *****
2019-03-14 13:20:54,358 WARN [RpcServer.priority.FPBQ.Fifo.handler=11,queue=1,port=17000] assignment.AssignmentManager: No matching procedure found for rit=OPEN, location=*************,17020,1552561955412, table=hbase:meta, region=1588230740 transition to OPENED
2019-03-14 13:20:55,707 INFO [master/<master-1>:17000] assignment.AssignmentManager: Stopping assignment manager
New HMaster logs:
2019-03-14 13:19:16,907 INFO [master/<master-2>:17000:becomeActiveMaster] master.ActiveMasterManager: Deleting ZNode for /HBaseServerZnodeCommonDir/**************/backup-masters/<master-2>,17000,1552438259871 from backup master directory
2019-03-14 13:19:17,031 INFO [master/<master-2>:17000:becomeActiveMaster] master.ActiveMasterManager: Registered as active master=<master-2>,17000,1552438259871
2019-03-14 13:20:52,017 INFO [PEWorker-12] zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in ZooKeeper as <server-1>,17020,1552536956826
2019-03-14 13:20:52,105 INFO [PEWorker-12] procedure2.ProcedureExecutor: Initialized subprocedures=[\{pid=178230, ppid=178229, state=RUNNABLE, hasLock=false; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]
HServer logs:
2019-03-14 13:20:52,708 INFO [RS_CLOSE_META-regionserver/<server-1>:17020-0] handler.AssignRegionHandler: Open hbase:meta,,1.1588230740
2019-03-14 13:20:54,353 INFO [RS_CLOSE_META-regionserver/<server-1>:17020-0] regionserver.HRegion: Opened 1588230740; next sequenceid=229166
2019-03-14 13:20:54,356 INFO [RS_CLOSE_META-regionserver/<server-1>:17020-0] regionserver.HRegionServer: Post open deploy tasks for hbase:meta,,1.1588230740
2019-03-14 13:20:54,358 INFO [RS_CLOSE_META-regionserver/<server-1>:17020-0] handler.AssignRegionHandler: Opened hbase:meta,,1.1588230740
Attachments
Attachments
Issue Links
- is a clone of
-
HBASE-21788 OpenRegionProcedure (after recovery?) is unreliable and needs to be improved
- Resolved
- is related to
-
HBASE-21788 OpenRegionProcedure (after recovery?) is unreliable and needs to be improved
- Resolved
- relates to
-
HBASE-22074 Should use procedure store to persist the state in reportRegionStateTransition
- Resolved
-
HBASE-22150 rssStub in HRegionServer is not thread safe and should not directly be used
- Resolved