HBase
  1. HBase
  2. HBASE-4792

SplitRegionHandler doesn't care if it deletes the znode or not, leaves the parent region stuck offline

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.92.0
    • Fix Version/s: 0.92.0, 0.94.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Saw this on a little test cluster, really easy to trigger.

      First the master log:

      2011-11-15 22:28:57,900 DEBUG org.apache.hadoop.hbase.master.handler.SplitRegionHandler: Handling SPLIT event for e5be6551c8584a6a1065466e520faf4e; deleting node
      2011-11-15 22:28:57,900 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x132f043bbde08c1 Deleting existing unassigned node for e5be6551c8584a6a1065466e520faf4e that is in expected state RS_ZK_REGION_SPLIT
      2011-11-15 22:28:57,975 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x132f043bbde08c1 Attempting to delete unassigned node in RS_ZK_REGION_SPLIT state but after verifying state, we got a version mismatch
      2011-11-15 22:28:57,975 INFO org.apache.hadoop.hbase.master.handler.SplitRegionHandler: Handled SPLIT report); parent=TestTable,0001355346,1321396080924.e5be6551c8584a6a1065466e520faf4e. daughter a=TestTable,0001355346,1321396132414.df9b549eb594a1f8728608a2a431224a.daughter b=TestTable,0001368082,1321396132414.de861596db4337dc341138f26b9c8bc2.
      ...
      2011-11-15 22:28:58,052 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_SPLIT, server=sv4r28s44,62023,1321395865619, region=e5be6551c8584a6a1065466e520faf4e
      2011-11-15 22:28:58,052 WARN org.apache.hadoop.hbase.master.AssignmentManager: Region e5be6551c8584a6a1065466e520faf4e not found on server sv4r28s44,62023,1321395865619; failed processing
      2011-11-15 22:28:58,052 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT for region e5be6551c8584a6a1065466e520faf4e from server sv4r28s44,62023,1321395865619 but it doesn't exist anymore, probably already processed its split
      (repeated forever)

      The master processes the split but when it calls ZKAssign.deleteNode it doesn't check the boolean that's returned. In this case it was false. So for the master the split was completed, but for the region server it's another story:

      2011-11-15 22:28:57,661 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:62023-0x132f043bbde08d3 Attempting to transition node e5be6551c8584a6a1065466e520faf4e from RS_ZK_REGION_SPLITTING to RS_ZK_REGION_SPLIT
      2011-11-15 22:28:57,775 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:62023-0x132f043bbde08d3 Successfully transitioned node e5be6551c8584a6a1065466e520faf4e from RS_ZK_REGION_SPLITTING to RS_ZK_REGION_SPLIT
      2011-11-15 22:28:57,775 INFO org.apache.hadoop.hbase.regionserver.SplitTransaction: Still waiting on the master to process the split for e5be6551c8584a6a1065466e520faf4e
      2011-11-15 22:28:57,876 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:62023-0x132f043bbde08d3 Attempting to transition node e5be6551c8584a6a1065466e520faf4e from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT
      2011-11-15 22:28:57,967 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:62023-0x132f043bbde08d3 Successfully transitioned node e5be6551c8584a6a1065466e520faf4e from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT
      2011-11-15 22:28:58,067 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:62023-0x132f043bbde08d3 Attempting to transition node e5be6551c8584a6a1065466e520faf4e from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT
      2011-11-15 22:28:58,108 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:62023-0x132f043bbde08d3 Successfully transitioned node e5be6551c8584a6a1065466e520faf4e from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT
      (printed forever)

      Since the znode isn't really deleted, it thinks the master just haven't got to process its region thus waits which leaves the region unavailable.

      We need to just retry the delete master-side ASAP since the RS will wait 100ms between retries.

      At the same time, it would be nice if ZKAssign.deleteNode always printed out the name of the region in its messages because it took me a while to see that the delete didn't take affect while looking at a grep.

      1. HBASE-4792-0.92.patch
        2 kB
        Jean-Daniel Cryans

        Activity

        Jean-Daniel Cryans created issue -
        Jean-Daniel Cryans made changes -
        Field Original Value New Value
        Assignee Jean-Daniel Cryans [ jdcryans ]
        Hide
        Jean-Daniel Cryans added a comment -

        Patch that does the fix I described and adds more info ZKAssign WARN logs.

        It probably needs a unit test.

        Show
        Jean-Daniel Cryans added a comment - Patch that does the fix I described and adds more info ZKAssign WARN logs. It probably needs a unit test.
        Jean-Daniel Cryans made changes -
        Attachment HBASE-4792-0.92.patch [ 12503818 ]
        Hide
        stack added a comment -

        Looks good to me. Just logging changes and looking at returned boolean. Too small for unit test. +1 on commit for 0.92.

        Show
        stack added a comment - Looks good to me. Just logging changes and looking at returned boolean. Too small for unit test. +1 on commit for 0.92.
        Hide
        Jean-Daniel Cryans added a comment -

        Committed to branch and trunk, thanks for looking at my code Stack.

        Show
        Jean-Daniel Cryans added a comment - Committed to branch and trunk, thanks for looking at my code Stack.
        Jean-Daniel Cryans made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Hadoop Flags Reviewed [ 10343 ]
        Resolution Fixed [ 1 ]
        Hide
        Ted Yu added a comment -

        A loop was introduced which doesn't have sleep().
        Should we exit the loop after certain number of attempts (certain amount of time)?

        Also:

           * @throws KeeperException.NoNodeException if node does not exist
           */
          public static boolean deleteNode(ZooKeeperWatcher zkw, String regionName,
              EventType expectedState)
        

        We should handle KeeperException.NoNodeException differently from other KeeperException's.

        Show
        Ted Yu added a comment - A loop was introduced which doesn't have sleep(). Should we exit the loop after certain number of attempts (certain amount of time)? Also: * @ throws KeeperException.NoNodeException if node does not exist */ public static boolean deleteNode(ZooKeeperWatcher zkw, String regionName, EventType expectedState) We should handle KeeperException.NoNodeException differently from other KeeperException's.
        Hide
        Jean-Daniel Cryans added a comment -

        It mirrors what's done RS-side, there's no limit there either. We could fix that too if we want.

        The reason there's no sleep is because the RS is sleeping and we don't want to hit the same race another time.

        Show
        Jean-Daniel Cryans added a comment - It mirrors what's done RS-side, there's no limit there either. We could fix that too if we want. The reason there's no sleep is because the RS is sleeping and we don't want to hit the same race another time.
        Hide
        Hudson added a comment -

        Integrated in HBase-0.92 #134 (See https://builds.apache.org/job/HBase-0.92/134/)
        HBASE-4792 SplitRegionHandler doesn't care if it deletes the znode or not,
        leaves the parent region stuck offline

        jdcryans :
        Files :

        • /hbase/branches/0.92/CHANGES.txt
        • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/handler/SplitRegionHandler.java
        • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java
        Show
        Hudson added a comment - Integrated in HBase-0.92 #134 (See https://builds.apache.org/job/HBase-0.92/134/ ) HBASE-4792 SplitRegionHandler doesn't care if it deletes the znode or not, leaves the parent region stuck offline jdcryans : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/handler/SplitRegionHandler.java /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK #2445 (See https://builds.apache.org/job/HBase-TRUNK/2445/)
        Adding the 0.92 entry for HBASE-4792
        HBASE-4792 SplitRegionHandler doesn't care if it deletes the znode or not,
        leaves the parent region stuck offline

        jdcryans :
        Files :

        • /hbase/trunk/CHANGES.txt

        jdcryans :
        Files :

        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/SplitRegionHandler.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK #2445 (See https://builds.apache.org/job/HBase-TRUNK/2445/ ) Adding the 0.92 entry for HBASE-4792 HBASE-4792 SplitRegionHandler doesn't care if it deletes the znode or not, leaves the parent region stuck offline jdcryans : Files : /hbase/trunk/CHANGES.txt jdcryans : Files : /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/SplitRegionHandler.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java
        Lars Hofhansl made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Jean-Daniel Cryans
            Reporter:
            Jean-Daniel Cryans
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development