Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-22060

postOpenDeployTasks could send OPENED region transition state to the wrong master

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Implemented
    • 3.0.0-alpha-1
    • None
    • amv2, proc-v2
    • None

    Description

      As was reported in HBASE-21788, we have repeatedly seen regions getting stuck in OPENING after master restarts. Here is one scenario that I've observed recently:

       

      1) There is a region is transit (RIT).

      2) The active master aborts and begins shutting down.

      3) The backup master becomes active quickly, finds the RIT, creates OpenRegionProcedure and send request to some server.

      4) The server quickly opens the region and posts OPENED state transition, but it uses its cached master instead of the new one. 

      5) The old active master which had not completely shutdown its assignment manager yet, notes the OPENED state report and ignores it. Because no corresponding procedure can be found.

      6) The new master waits forever for a response to its OPEN region request.

       

      This happens more often with the meta region because it's small and takes a few seconds to open. Below are some related logs:

      Previous HMaster:
      2019-03-14 13:19:16,310 ERROR [PEWorker-1] master.HMaster: ***** ABORTING master <master-1>,17000,1552438242232: Shutting down HBase cluster: file system not available *****
      2019-03-14 13:19:16,310 INFO [PEWorker-1] regionserver.HRegionServer: ***** STOPPING region server '<master-1>,17000,1552438242232' *****
      2019-03-14 13:20:54,358 WARN [RpcServer.priority.FPBQ.Fifo.handler=11,queue=1,port=17000] assignment.AssignmentManager: No matching procedure found for rit=OPEN, location=*************,17020,1552561955412, table=hbase:meta, region=1588230740 transition to OPENED
      2019-03-14 13:20:55,707 INFO [master/<master-1>:17000] assignment.AssignmentManager: Stopping assignment manager

      New HMaster logs:

      2019-03-14 13:19:16,907 INFO [master/<master-2>:17000:becomeActiveMaster] master.ActiveMasterManager: Deleting ZNode for /HBaseServerZnodeCommonDir/**************/backup-masters/<master-2>,17000,1552438259871 from backup master directory
      2019-03-14 13:19:17,031 INFO [master/<master-2>:17000:becomeActiveMaster] master.ActiveMasterManager: Registered as active master=<master-2>,17000,1552438259871
      2019-03-14 13:20:52,017 INFO [PEWorker-12] zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in ZooKeeper as <server-1>,17020,1552536956826
      2019-03-14 13:20:52,105 INFO [PEWorker-12] procedure2.ProcedureExecutor: Initialized subprocedures=[\{pid=178230, ppid=178229, state=RUNNABLE, hasLock=false; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]

       

      HServer logs:

      2019-03-14 13:20:52,708 INFO [RS_CLOSE_META-regionserver/<server-1>:17020-0] handler.AssignRegionHandler: Open hbase:meta,,1.1588230740
      2019-03-14 13:20:54,353 INFO [RS_CLOSE_META-regionserver/<server-1>:17020-0] regionserver.HRegion: Opened 1588230740; next sequenceid=229166
      2019-03-14 13:20:54,356 INFO [RS_CLOSE_META-regionserver/<server-1>:17020-0] regionserver.HRegionServer: Post open deploy tasks for hbase:meta,,1.1588230740
      2019-03-14 13:20:54,358 INFO [RS_CLOSE_META-regionserver/<server-1>:17020-0] handler.AssignRegionHandler: Opened hbase:meta,,1.1588230740

       

       

      Attachments

        Issue Links

          Activity

            People

              zhangduo Duo Zhang
              bahramch Bahram Chehrazy
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: