HBase
  1. HBase
  2. HBASE-3147

Regions stuck in transition after rolling restart, perpetual timeout handling but nothing happens

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.90.0
    • Component/s: None
    • Labels:
      None

      Description

      The rolling restart script is great for bringing on the weird stuff. On my little loaded cluster if I run it, it horks the cluster and it doesn't recover. I notice two issues that need fixing:

      1. We'll miss noticing that a server was carrying .META. and it never gets assigned – the shutdown handlers get stuck in perpetual wait on a .META. assign that will never happen.
      2. Perpetual cycling of the this sequence per region not succesfully assigned:

       2010-10-23 21:37:57,404 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  usertable,user510588360,1287547556587.7f2d92497d2d03917afd574ea2aca55b. state=PENDING_OPEN,                       ts=1287869814294  45154 2010-10-23 21:37:57,404 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_OPEN or OPENING for too long, reassigning region=usertable,user510588360,1287547556587.                                     7f2d92497d2d03917afd574ea2aca55b.  45155 2010-10-23 21:37:57,404 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x2bd57d1475046a Attempting to transition node 7f2d92497d2d03917afd574ea2aca55b from RS_ZK_REGION_OPENING to M_ZK_REGION_OFFLINE  45156 2010-10-23 21:37:57,404 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x2bd57d1475046a Attempt to transition the unassigned node for 7f2d92497d2d03917afd574ea2aca55b from RS_ZK_REGION_OPENING to                 M_ZK_REGION_OFFLINE failed, the node existed but was in the state M_ZK_REGION_OFFLINE  45157 2010-10-23 21:37:57,404 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region transitioned OPENING to OFFLINE so skipping timeout, region=usertable,user510588360,1287547556587.7f2d92497d2d03917afd574ea2aca55b.  
      ,,,
      

      Timeout period again elapses an then same sequence.

      This is what I've been working on.

      1. HBASE-3147-v11.patch
        30 kB
        stack
      2. HBASE-3147-v6.patch
        26 kB
        stack

        Activity

        Hide
        Jonathan Gray added a comment -

        Seems that our in-memory RIT state is PENDING_OPEN but it's in OFFLINE in ZK. Seems like a potentially common case. Server being assigned to was just not there, never began opening it.

        We should probably differentiate between PENDING_OPEN timeout and OPENING timeout. Let me see what I find in the code.

        (Your paste seems to lack line breaks so this jira is a mile wide)

        Show
        Jonathan Gray added a comment - Seems that our in-memory RIT state is PENDING_OPEN but it's in OFFLINE in ZK. Seems like a potentially common case. Server being assigned to was just not there, never began opening it. We should probably differentiate between PENDING_OPEN timeout and OPENING timeout. Let me see what I find in the code. (Your paste seems to lack line breaks so this jira is a mile wide)
        Hide
        HBase Review Board added a comment -

        Message from: "Jonathan Gray" <jgray@apache.org>

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        http://review.cloudera.org/r/1087/
        -----------------------------------------------------------

        Review request for hbase and stack.

        Summary
        -------

        Adds new handling of the timeouts for PENDING_OPEN and PENDING_CLOSE in-memory master RIT states.

        Adds some new broken RIT states into TestMasterFailover.

        Some of these broken states don't seem possible to me but as long as we aren't breaking the existing behaviors and tests I think it's okay if we handle odd cases that can be mocked. Who knows what will happen in the real world.

        The reason TestMasterFailover didn't/doesn't really test for the issue in HBASE-3147 is this new broken condition happens when an RS dies / goes offline rather than a master failover concurrent w/ RS failure.

        This addresses bug HBASE-3147.
        http://issues.apache.org/jira/browse/HBASE-3147

        Diffs


        trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1026911
        trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 1026911
        trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java 1026911
        trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java 1026911

        Diff: http://review.cloudera.org/r/1087/diff

        Testing
        -------

        TestMasterFailover passes.

        Thanks,

        Jonathan

        Show
        HBase Review Board added a comment - Message from: "Jonathan Gray" <jgray@apache.org> ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.cloudera.org/r/1087/ ----------------------------------------------------------- Review request for hbase and stack. Summary ------- Adds new handling of the timeouts for PENDING_OPEN and PENDING_CLOSE in-memory master RIT states. Adds some new broken RIT states into TestMasterFailover. Some of these broken states don't seem possible to me but as long as we aren't breaking the existing behaviors and tests I think it's okay if we handle odd cases that can be mocked. Who knows what will happen in the real world. The reason TestMasterFailover didn't/doesn't really test for the issue in HBASE-3147 is this new broken condition happens when an RS dies / goes offline rather than a master failover concurrent w/ RS failure. This addresses bug HBASE-3147 . http://issues.apache.org/jira/browse/HBASE-3147 Diffs trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1026911 trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 1026911 trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java 1026911 trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java 1026911 Diff: http://review.cloudera.org/r/1087/diff Testing ------- TestMasterFailover passes. Thanks, Jonathan
        Hide
        stack added a comment -

        I got this when I tried running patch....

        java.lang.IllegalAccessError: tried to access method org.apache.hadoop.hbase.zookeeper.ZKAssign.getNodeName(Lorg/apache/hadoop/hbase/zookeeper/ZooKeeperWatcher;Ljava/lang/String;)Ljava/lang/String; from class org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor
            at org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor.chore(AssignmentManager.java:1457)
            at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
        2010-10-25 16:07:44,354 INFO org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor: sv2borg180:60000.timeoutMonitor exiting
        

        Let me try fix.

        Show
        stack added a comment - I got this when I tried running patch.... java.lang.IllegalAccessError: tried to access method org.apache.hadoop.hbase.zookeeper.ZKAssign.getNodeName(Lorg/apache/hadoop/hbase/zookeeper/ZooKeeperWatcher;Ljava/lang/ String ;)Ljava/lang/ String ; from class org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor at org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor.chore(AssignmentManager.java:1457) at org.apache.hadoop.hbase.Chore.run(Chore.java:66) 2010-10-25 16:07:44,354 INFO org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor: sv2borg180:60000.timeoutMonitor exiting Let me try fix.
        Hide
        Jonathan Gray added a comment -

        Hmm... you should have:

        public static String getNodeName(ZooKeeperWatcher zkw, String regionName) {

        as part of the diff up on RB

        Show
        Jonathan Gray added a comment - Hmm... you should have: public static String getNodeName(ZooKeeperWatcher zkw, String regionName) { as part of the diff up on RB
        Hide
        HBase Review Board added a comment -

        Message from: stack@duboce.net

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        http://review.cloudera.org/r/1087/
        -----------------------------------------------------------

        (Updated 2010-10-25 16:29:36.379908)

        Review request for hbase and stack.

        Changes
        -------

        Added metaservershutdownhandler and rootservershutdownhandler

        Summary (updated)
        -------

        Adds new handling of the timeouts for PENDING_OPEN and PENDING_CLOSE in-memory master RIT states.

        Adds some new broken RIT states into TestMasterFailover.

        Some of these broken states don't seem possible to me but as long as we aren't breaking the existing behaviors and tests I think it's okay if we handle odd cases that can be mocked. Who knows what will happen in the real world.

        The reason TestMasterFailover didn't/doesn't really test for the issue in HBASE-3147 is this new broken condition happens when an RS dies / goes offline rather than a master failover concurrent w/ RS failure.

        v4 of the patch adds to Jons' fixes. It adds a shutdown server handler for root and another for meta so the processing of servers hosting meta/root do not get frozen out. I've seen this in my testing.

        This addresses bug HBASE-3147.
        http://issues.apache.org/jira/browse/HBASE-3147

        Diffs (updated)


        trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 1027291
        trunk/src/main/java/org/apache/hadoop/hbase/catalog/MetaReader.java 1027291
        trunk/src/main/java/org/apache/hadoop/hbase/executor/EventHandler.java 1027291
        trunk/src/main/java/org/apache/hadoop/hbase/executor/ExecutorService.java 1027291
        trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1027291
        trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java 1027291
        trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 1027291
        trunk/src/main/java/org/apache/hadoop/hbase/master/handler/MetaServerShutdownHandler.java PRE-CREATION
        trunk/src/main/java/org/apache/hadoop/hbase/master/handler/RootServerShutdownHandler.java PRE-CREATION
        trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java 1027292
        trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java 1027291
        trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java 1027291

        Diff: http://review.cloudera.org/r/1087/diff

        Testing
        -------

        TestMasterFailover passes.

        Thanks,

        Jonathan

        Show
        HBase Review Board added a comment - Message from: stack@duboce.net ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.cloudera.org/r/1087/ ----------------------------------------------------------- (Updated 2010-10-25 16:29:36.379908) Review request for hbase and stack. Changes ------- Added metaservershutdownhandler and rootservershutdownhandler Summary (updated) ------- Adds new handling of the timeouts for PENDING_OPEN and PENDING_CLOSE in-memory master RIT states. Adds some new broken RIT states into TestMasterFailover. Some of these broken states don't seem possible to me but as long as we aren't breaking the existing behaviors and tests I think it's okay if we handle odd cases that can be mocked. Who knows what will happen in the real world. The reason TestMasterFailover didn't/doesn't really test for the issue in HBASE-3147 is this new broken condition happens when an RS dies / goes offline rather than a master failover concurrent w/ RS failure. v4 of the patch adds to Jons' fixes. It adds a shutdown server handler for root and another for meta so the processing of servers hosting meta/root do not get frozen out. I've seen this in my testing. This addresses bug HBASE-3147 . http://issues.apache.org/jira/browse/HBASE-3147 Diffs (updated) trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 1027291 trunk/src/main/java/org/apache/hadoop/hbase/catalog/MetaReader.java 1027291 trunk/src/main/java/org/apache/hadoop/hbase/executor/EventHandler.java 1027291 trunk/src/main/java/org/apache/hadoop/hbase/executor/ExecutorService.java 1027291 trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1027291 trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java 1027291 trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 1027291 trunk/src/main/java/org/apache/hadoop/hbase/master/handler/MetaServerShutdownHandler.java PRE-CREATION trunk/src/main/java/org/apache/hadoop/hbase/master/handler/RootServerShutdownHandler.java PRE-CREATION trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java 1027292 trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java 1027291 trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java 1027291 Diff: http://review.cloudera.org/r/1087/diff Testing ------- TestMasterFailover passes. Thanks, Jonathan
        Hide
        HBase Review Board added a comment -

        Message from: "Jonathan Gray" <jgray@apache.org>

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        http://review.cloudera.org/r/1087/#review1662
        -----------------------------------------------------------

        Ship it!

        Looks good. Not sure if I can +1 my patch but I think we should commit

        trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
        <http://review.cloudera.org/r/1087/#comment5542>

        Should we remove this code from inside of ServerShutdownHandler now? Not a big deal but being done twice.

        • Jonathan
        Show
        HBase Review Board added a comment - Message from: "Jonathan Gray" <jgray@apache.org> ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.cloudera.org/r/1087/#review1662 ----------------------------------------------------------- Ship it! Looks good. Not sure if I can +1 my patch but I think we should commit trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java < http://review.cloudera.org/r/1087/#comment5542 > Should we remove this code from inside of ServerShutdownHandler now? Not a big deal but being done twice. Jonathan
        Hide
        stack added a comment -

        Here is what I'll commit. It does as Jon suggests removing check of root or meta carrying inside in shutdown handler since we're doing the check on the outside now. This patch also includes missing hookup that testing found.

        There is still work to do on this issue. What seems to be happening is that a watcher is not being triggered. Need to figure how that is happening. I'll see a regionserver with all of its opener handlers stuck waiting on notification that meta has been deployed.... Other servers will have gotten their watcher triggered but not one or two in the cluster.... Master is then stuck timing out this regionservers allocations and then reassigning... calling open on the rpc which adds region to queue but since all openers are stuck waiting on meta, the queues don't get processed.

        Show
        stack added a comment - Here is what I'll commit. It does as Jon suggests removing check of root or meta carrying inside in shutdown handler since we're doing the check on the outside now. This patch also includes missing hookup that testing found. There is still work to do on this issue. What seems to be happening is that a watcher is not being triggered. Need to figure how that is happening. I'll see a regionserver with all of its opener handlers stuck waiting on notification that meta has been deployed.... Other servers will have gotten their watcher triggered but not one or two in the cluster.... Master is then stuck timing out this regionservers allocations and then reassigning... calling open on the rpc which adds region to queue but since all openers are stuck waiting on meta, the queues don't get processed.
        Hide
        HBase Review Board added a comment -

        Message from: stack@duboce.net

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        http://review.cloudera.org/r/1087/
        -----------------------------------------------------------

        (Updated 2010-10-25 23:25:36.390570)

        Review request for hbase and stack.

        Changes
        -------

        So, a few things extra after digging in w/ Jon.

        1. A watch was not being called on .META. move because it was not being set; in MetaNodeTracker we were not calling the super inside in nodeDeleted to reset the watch (in rolling restart, only a few servers would actually experience a moved .META. sensation and it was these that were hanging up.. Others when they came up would see .META. in its new location)
        2. We were not assigning out .META. if master had trouble reaching meta before it saw server expired. In the case where we'd trouble contacting meta before we saw its server expire, we'd reset in the catalog tracker its location. We were using catalog tracker to determine which server was hosting meta. We use a different technique now.

        Summary
        -------

        Adds new handling of the timeouts for PENDING_OPEN and PENDING_CLOSE in-memory master RIT states.

        Adds some new broken RIT states into TestMasterFailover.

        Some of these broken states don't seem possible to me but as long as we aren't breaking the existing behaviors and tests I think it's okay if we handle odd cases that can be mocked. Who knows what will happen in the real world.

        The reason TestMasterFailover didn't/doesn't really test for the issue in HBASE-3147 is this new broken condition happens when an RS dies / goes offline rather than a master failover concurrent w/ RS failure.

        v4 of the patch adds to Jons' fixes. It adds a shutdown server handler for root and another for meta so the processing of servers hosting meta/root do not get frozen out. I've seen this in my testing.

        This addresses bug HBASE-3147.
        http://issues.apache.org/jira/browse/HBASE-3147

        Diffs (updated)


        trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 1027351
        trunk/src/main/java/org/apache/hadoop/hbase/catalog/MetaReader.java 1027351
        trunk/src/main/java/org/apache/hadoop/hbase/executor/EventHandler.java 1027351
        trunk/src/main/java/org/apache/hadoop/hbase/executor/ExecutorService.java 1027351
        trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1027351
        trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java 1027351
        trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 1027351
        trunk/src/main/java/org/apache/hadoop/hbase/master/handler/MetaServerShutdownHandler.java PRE-CREATION
        trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java 1027351
        trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/MetaNodeTracker.java 1027351
        trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java 1027351
        trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java 1027351

        Diff: http://review.cloudera.org/r/1087/diff

        Testing
        -------

        TestMasterFailover passes.

        Thanks,

        Jonathan

        Show
        HBase Review Board added a comment - Message from: stack@duboce.net ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.cloudera.org/r/1087/ ----------------------------------------------------------- (Updated 2010-10-25 23:25:36.390570) Review request for hbase and stack. Changes ------- So, a few things extra after digging in w/ Jon. 1. A watch was not being called on .META. move because it was not being set; in MetaNodeTracker we were not calling the super inside in nodeDeleted to reset the watch (in rolling restart, only a few servers would actually experience a moved .META. sensation and it was these that were hanging up.. Others when they came up would see .META. in its new location) 2. We were not assigning out .META. if master had trouble reaching meta before it saw server expired. In the case where we'd trouble contacting meta before we saw its server expire, we'd reset in the catalog tracker its location. We were using catalog tracker to determine which server was hosting meta. We use a different technique now. Summary ------- Adds new handling of the timeouts for PENDING_OPEN and PENDING_CLOSE in-memory master RIT states. Adds some new broken RIT states into TestMasterFailover. Some of these broken states don't seem possible to me but as long as we aren't breaking the existing behaviors and tests I think it's okay if we handle odd cases that can be mocked. Who knows what will happen in the real world. The reason TestMasterFailover didn't/doesn't really test for the issue in HBASE-3147 is this new broken condition happens when an RS dies / goes offline rather than a master failover concurrent w/ RS failure. v4 of the patch adds to Jons' fixes. It adds a shutdown server handler for root and another for meta so the processing of servers hosting meta/root do not get frozen out. I've seen this in my testing. This addresses bug HBASE-3147 . http://issues.apache.org/jira/browse/HBASE-3147 Diffs (updated) trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 1027351 trunk/src/main/java/org/apache/hadoop/hbase/catalog/MetaReader.java 1027351 trunk/src/main/java/org/apache/hadoop/hbase/executor/EventHandler.java 1027351 trunk/src/main/java/org/apache/hadoop/hbase/executor/ExecutorService.java 1027351 trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1027351 trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java 1027351 trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 1027351 trunk/src/main/java/org/apache/hadoop/hbase/master/handler/MetaServerShutdownHandler.java PRE-CREATION trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java 1027351 trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/MetaNodeTracker.java 1027351 trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java 1027351 trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java 1027351 Diff: http://review.cloudera.org/r/1087/diff Testing ------- TestMasterFailover passes. Thanks, Jonathan
        Hide
        HBase Review Board added a comment -

        Message from: "Jonathan Gray" <jgray@apache.org>

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        http://review.cloudera.org/r/1087/#review1668
        -----------------------------------------------------------

        Ship it!

        There was a little whitespace but I am +1 to commit this. I still have some weird failures on TestRollingRestart, need to test w/ this.

        trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/MetaNodeTracker.java
        <http://review.cloudera.org/r/1087/#comment5544>

        should be after the path check but just an efficiency thing

        • Jonathan
        Show
        HBase Review Board added a comment - Message from: "Jonathan Gray" <jgray@apache.org> ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.cloudera.org/r/1087/#review1668 ----------------------------------------------------------- Ship it! There was a little whitespace but I am +1 to commit this. I still have some weird failures on TestRollingRestart, need to test w/ this. trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/MetaNodeTracker.java < http://review.cloudera.org/r/1087/#comment5544 > should be after the path check but just an efficiency thing Jonathan
        Hide
        HBase Review Board added a comment -

        Message from: stack@duboce.net

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        http://review.cloudera.org/r/1087/
        -----------------------------------------------------------

        (Updated 2010-10-26 00:55:02.299835)

        Review request for hbase and stack.

        Changes
        -------

        This patch is almost there. Its much better. Fixed testing for .META. server by looking in map of servers to regions; that won't work since its a map of user regions only. Instead get from catalogtracker.

        Locally TestRegionRebalancing failed. I need to look at that.

        On cluster, we turned up an unexpected state as server was opening a region it was also going down. Need to dig in on that too.

        Want to also add tests at least for moved .meta.

        Summary
        -------

        Adds new handling of the timeouts for PENDING_OPEN and PENDING_CLOSE in-memory master RIT states.

        Adds some new broken RIT states into TestMasterFailover.

        Some of these broken states don't seem possible to me but as long as we aren't breaking the existing behaviors and tests I think it's okay if we handle odd cases that can be mocked. Who knows what will happen in the real world.

        The reason TestMasterFailover didn't/doesn't really test for the issue in HBASE-3147 is this new broken condition happens when an RS dies / goes offline rather than a master failover concurrent w/ RS failure.

        v4 of the patch adds to Jons' fixes. It adds a shutdown server handler for root and another for meta so the processing of servers hosting meta/root do not get frozen out. I've seen this in my testing.

        This addresses bug HBASE-3147.
        http://issues.apache.org/jira/browse/HBASE-3147

        Diffs (updated)


        trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 1027351
        trunk/src/main/java/org/apache/hadoop/hbase/catalog/MetaReader.java 1027351
        trunk/src/main/java/org/apache/hadoop/hbase/executor/EventHandler.java 1027351
        trunk/src/main/java/org/apache/hadoop/hbase/executor/ExecutorService.java 1027351
        trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1027351
        trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java 1027351
        trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 1027351
        trunk/src/main/java/org/apache/hadoop/hbase/master/handler/MetaServerShutdownHandler.java PRE-CREATION
        trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java 1027351
        trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/MetaNodeTracker.java 1027351
        trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java 1027351
        trunk/src/test/java/org/apache/hadoop/hbase/catalog/TestCatalogTracker.java 1027351
        trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java 1027351

        Diff: http://review.cloudera.org/r/1087/diff

        Testing
        -------

        TestMasterFailover passes.

        Thanks,

        Jonathan

        Show
        HBase Review Board added a comment - Message from: stack@duboce.net ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: http://review.cloudera.org/r/1087/ ----------------------------------------------------------- (Updated 2010-10-26 00:55:02.299835) Review request for hbase and stack. Changes ------- This patch is almost there. Its much better. Fixed testing for .META. server by looking in map of servers to regions; that won't work since its a map of user regions only. Instead get from catalogtracker. Locally TestRegionRebalancing failed. I need to look at that. On cluster, we turned up an unexpected state as server was opening a region it was also going down. Need to dig in on that too. Want to also add tests at least for moved .meta. Summary ------- Adds new handling of the timeouts for PENDING_OPEN and PENDING_CLOSE in-memory master RIT states. Adds some new broken RIT states into TestMasterFailover. Some of these broken states don't seem possible to me but as long as we aren't breaking the existing behaviors and tests I think it's okay if we handle odd cases that can be mocked. Who knows what will happen in the real world. The reason TestMasterFailover didn't/doesn't really test for the issue in HBASE-3147 is this new broken condition happens when an RS dies / goes offline rather than a master failover concurrent w/ RS failure. v4 of the patch adds to Jons' fixes. It adds a shutdown server handler for root and another for meta so the processing of servers hosting meta/root do not get frozen out. I've seen this in my testing. This addresses bug HBASE-3147 . http://issues.apache.org/jira/browse/HBASE-3147 Diffs (updated) trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 1027351 trunk/src/main/java/org/apache/hadoop/hbase/catalog/MetaReader.java 1027351 trunk/src/main/java/org/apache/hadoop/hbase/executor/EventHandler.java 1027351 trunk/src/main/java/org/apache/hadoop/hbase/executor/ExecutorService.java 1027351 trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1027351 trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java 1027351 trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 1027351 trunk/src/main/java/org/apache/hadoop/hbase/master/handler/MetaServerShutdownHandler.java PRE-CREATION trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java 1027351 trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/MetaNodeTracker.java 1027351 trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java 1027351 trunk/src/test/java/org/apache/hadoop/hbase/catalog/TestCatalogTracker.java 1027351 trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java 1027351 Diff: http://review.cloudera.org/r/1087/diff Testing ------- TestMasterFailover passes. Thanks, Jonathan
        Hide
        stack added a comment -

        This is the patch I'm applying. It fixes the two items raised at the top of this issue but now I'm seeing other, lesser issues for which I'll open new JIRAs.

        Show
        stack added a comment - This is the patch I'm applying. It fixes the two items raised at the top of this issue but now I'm seeing other, lesser issues for which I'll open new JIRAs.
        Hide
        stack added a comment -

        Committed. Thanks for writing half of the patch and for reviews Jon.

        Show
        stack added a comment - Committed. Thanks for writing half of the patch and for reviews Jon.

          People

          • Assignee:
            stack
            Reporter:
            stack
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development