Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-9665

Region gets lost when balancer & SSH both trying to assign

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Incomplete
    • 0.96.0
    • None
    • Region Assignment
    • None

    Description

      In summary, a server dies and its regions are re-assigned. While right before SSH, balancer is starting assign one region on the server to somewhere.

      The balancer assignment got preempted by the SSH assignment:

      2013-09-25 11:55:32,854 INFO Priority.RpcServer.handler=7,port=60020 regionserver.HRegionServer: Received CLOSE for the region:6deb1bfefe8cbdb443084efe919fdeb7 , which we are already trying to OPEN. Cancelling OPENING.
      

      The SSH assignment(by GeneralBulkAssigner) failed too due to:

      2013-09-25 11:55:32,927 WARN  [RS_OPEN_REGION-hor15n09:60020-2] zookeeper.ZKAssign: regionserver:60020-0x14153d449d30ad0 Attempt to transition the unassigned node for 6deb1bfefe8cbdb443084efe919fdeb7 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the server that tried to transition was hor15n09.gq1.ygridcore.net,60020,1380109280320 not the expected hor15n07.gq1.ygridcore.net,60020,1380109890414
      

      In the end, the region 6deb1bfefe8cbdb443084efe919fdeb7 is lost.

      Below is the master log, you can see both balancer and SSH try to assign the region around the same time:

      2013-09-25 11:55:32,731 INFO  [MASTER_SERVER_OPERATIONS-hor15n05:60000-4] master.RegionStates: Transitioning {6deb1bfefe8cbdb443084efe919fdeb7 state=PENDING_CLOSE, ts=1380110132710, server=hor15n12.gq1.ygridcore.net,60020,1380109596307} will be handled by SSH for hor15n12.gq1.ygridcore.net,60020,1380109596307
      
      ...
      
      2013-09-25 11:55:32,849 INFO  [hor15n05.gq1.ygridcore.net,60000,1380108611483-BalancerChore] master.RegionStates: Transitioned {6deb1bfefe8cbdb443084efe919fdeb7 state=OFFLINE, ts=1380110132768, server=null} to {6deb1bfefe8cbdb443084efe919fdeb7 state=PENDING_OPEN, ts=1380110132849, server=hor15n07.gq1.ygridcore.net,60020,1380109890414}
      
      ...
      
      2013-09-25 11:55:32,898 INFO  [hor15n05.gq1.ygridcore.net,60000,1380108611483-GeneralBulkAssigner-1] master.RegionStates: Transitioned {6deb1bfefe8cbdb443084efe919fdeb7 state=OFFLINE, ts=1380110132861, server=null} to {6deb1bfefe8cbdb443084efe919fdeb7 state=PENDING_OPEN, ts=1380110132898, server=hor15n09.gq1.ygridcore.net,60020,1380109280320}
      

      Since SSH force region assignment while it doesn't recreate offline znode, the later region opening would fail with the following error. I'm suggesting to recreate offline znode when we force a region assignment(forceNewPlan=true) with low impact.

      2013-09-25 11:55:32,927 WARN  [RS_OPEN_REGION-hor15n09:60020-2] zookeeper.ZKAssign: regionserver:60020-0x14153d449d30ad0 Attempt to transition the unassigned node for 6deb1bfefe8cbdb443084efe919fdeb7 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the server that tried to transition was hor15n09.gq1.ygridcore.net,60020,1380109280320 not the expected hor15n07.gq1.ygridcore.net,60020,1380109890414
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jeffreyz Jeffrey Zhong
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: