HBase
  1. HBase
  2. HBASE-11282

Load balancer may move a region which is participating in snapshot

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Later
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      The region was tableone,,1394495094967.289ebdee6adf0a3b9c2bbcbe2ff522e7.
      From master log:

      2014-03-10 23:48:09,035 DEBUG [AM.ZK.Worker-pool2-t42] master.AssignmentManager: Found an existing plan for tableone,,1394495094967.289ebdee6adf0a3b9c2bbcbe2ff522e7.       destination server is h2-ubuntu12-sec-1394425849-hbase-4.cs1cloud.internal,60020,1394494963812 accepted as a dest server = true
      2014-03-10 23:48:09,035 DEBUG [AM.ZK.Worker-pool2-t42] master.AssignmentManager: Using pre-existing plan for tableone,,1394495094967.289ebdee6adf0a3b9c2bbcbe2ff522e7.;     plan=hri=tableone,,1394495094967.289ebdee6adf0a3b9c2bbcbe2ff522e7., src=h2-ubuntu12-sec-1394425849-hbase-9.cs1cloud.internal,60020,1394494962165, dest=h2-ubuntu12-sec-     1394425849-hbase-4.cs1cloud.internal,60020,1394494963812
      2014-03-10 23:48:09,035 INFO  [AM.ZK.Worker-pool2-t42] master.RegionStates: Transitioned {289ebdee6adf0a3b9c2bbcbe2ff522e7 state=CLOSED, ts=1394495289035, server=h2-       ubuntu12-sec-1394425849-hbase-9.cs1cloud.internal,60020,1394494962165} to {289ebdee6adf0a3b9c2bbcbe2ff522e7 state=OFFLINE, ts=1394495289035, server=h2-ubuntu12-sec-        1394425849-hbase-9.cs1cloud.internal,60020,1394494962165}
      2014-03-10 23:48:09,035 DEBUG [AM.ZK.Worker-pool2-t42] zookeeper.ZKAssign: master:60000-0x244aa9920190b04, quorum=h2-ubuntu12-sec-1394425849-hbase-8.cs1cloud.internal:2181,h2-ubuntu12-sec-1394425849-hbase-1.cs1cloud.internal:2181,h2-ubuntu12-sec-1394425849-hbase-4.cs1cloud.internal:2181, baseZNode=/hbase Creating (or updating) unassigned     node 289ebdee6adf0a3b9c2bbcbe2ff522e7 with OFFLINE state
      2014-03-10 23:48:09,044 INFO  [AM.ZK.Worker-pool2-t42] master.AssignmentManager: Assigning tableone,,1394495094967.289ebdee6adf0a3b9c2bbcbe2ff522e7. to h2-ubuntu12-sec-    1394425849-hbase-4.cs1cloud.internal,60020,1394494963812
      

      From hbase-hbase-regionserver-h2-ubuntu12-sec-1394425849-hbase-9.log :

      2014-03-10 23:48:08,487 WARN  [member: 'h2-ubuntu12-sec-1394425849-hbase-9.cs1cloud.internal,60020,1394494962165' subprocedure-pool1-thread-1] snapshot.                    RegionServerSnapshotManager: Got Exception in SnapshotSubprocedurePool
      java.util.concurrent.ExecutionException: org.apache.hadoop.hbase.NotServingRegionException: tableone,,1394495094967.289ebdee6adf0a3b9c2bbcbe2ff522e7. is closing
        at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
        at java.util.concurrent.FutureTask.get(FutureTask.java:83)
        at org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager$SnapshotSubprocedurePool.waitForOutstandingTasks(RegionServerSnapshotManager.java:325)
        at org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure.flushSnapshot(FlushSnapshotSubprocedure.java:118)
        at org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure.insideBarrier(FlushSnapshotSubprocedure.java:137)
        at org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:181)
        at org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:52)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
      Caused by: org.apache.hadoop.hbase.NotServingRegionException: tableone,,1394495094967.289ebdee6adf0a3b9c2bbcbe2ff522e7. is closing
        at org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:5699)
        at org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:5663)
        at org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure$RegionSnapshotTask.call(FlushSnapshotSubprocedure.java:79)
        at org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure$RegionSnapshotTask.call(FlushSnapshotSubprocedure.java:65)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
      

      Load balancer's move of the underlying region caused FlushSnapshotSubprocedure to fail.

      Mechanism of making load balancer be aware of region operation is desirable such that snapshot doesn't fail due to the above scenario.

        Activity

        Hide
        stack added a comment -

        Mechanism of making load balancer be aware of region operation is desirable such that snapshot doesn't fail due to the above scenario.

        It is allowed that snapshots may fail for any of myriad reasons.

        Tying together two systems we'd like to keep disparate – the balancer and snapshotting – unless it really necessary seems like a bad direction to me.

        Show
        stack added a comment - Mechanism of making load balancer be aware of region operation is desirable such that snapshot doesn't fail due to the above scenario. It is allowed that snapshots may fail for any of myriad reasons. Tying together two systems we'd like to keep disparate – the balancer and snapshotting – unless it really necessary seems like a bad direction to me.
        Hide
        Andrew Purtell added a comment -

        Ted Yu, do you have a concrete proposal for addressing what you described in the description?

        Show
        Andrew Purtell added a comment - Ted Yu , do you have a concrete proposal for addressing what you described in the description?
        Hide
        Ted Yu added a comment -

        Will come up with something after Hadoop Summit.

        One approach is to let load balancer know the regions whose table is under table lock. For the duration of active table lock, the regions of the table wouldn't be moved.

        Show
        Ted Yu added a comment - Will come up with something after Hadoop Summit. One approach is to let load balancer know the regions whose table is under table lock. For the duration of active table lock, the regions of the table wouldn't be moved.
        Hide
        Andrew Purtell added a comment -

        One approach is to let load balancer know the regions whose table is under table lock. For the duration of active table lock, the regions of the table wouldn't be moved.

        Looking forward to the patch. This avoids Stack's concern about directly tying the balancer and snapshotting together. Avoiding moving regions while the table is locked does not do that and sounds plausible.

        Show
        Andrew Purtell added a comment - One approach is to let load balancer know the regions whose table is under table lock. For the duration of active table lock, the regions of the table wouldn't be moved. Looking forward to the patch. This avoids Stack's concern about directly tying the balancer and snapshotting together. Avoiding moving regions while the table is locked does not do that and sounds plausible.
        Hide
        stack added a comment -

        Now we tie the balancer to the lock manager instead? Is the lock manager then the arbiter of whether a region should be moved or should there be a higher level service the balancer would ask? And all to address a snapshot failure, when the contract for snapshots has it that they can fail? Can we work on something more pressing?

        Show
        stack added a comment - Now we tie the balancer to the lock manager instead? Is the lock manager then the arbiter of whether a region should be moved or should there be a higher level service the balancer would ask? And all to address a snapshot failure, when the contract for snapshots has it that they can fail? Can we work on something more pressing?
        Hide
        Andrew Purtell added a comment -

        Can we work on something more pressing?

        Sure, I'd be +1 for resolving this as Later or Wont Fix

        Show
        Andrew Purtell added a comment - Can we work on something more pressing? Sure, I'd be +1 for resolving this as Later or Wont Fix

          People

          • Assignee:
            Unassigned
            Reporter:
            Ted Yu
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development