HBase
  1. HBase
  2. HBASE-5120

Timeout monitor races with table disable handler

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.92.0
    • Fix Version/s: 0.94.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Here is what J-D described here:
      https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176

      I think I will retract from my statement that it "used to be extremely racy and caused more troubles than it fixed", on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it.

      First the region gets closed

      2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.

      2 minutes later it times out:

      2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null
      2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
      2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)

      100ms later the master finally gets the event:

      2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
      2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791
      2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
      2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
      2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED

      At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign?

      2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
      2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791

      Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on.

      The late ZK notification that the znode was deleted (but it got recreated after):

      2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted.

      Then it prints this, and much later tries to unassign it again:

      2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
      ...
      2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
      2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null
      2012-01-04 00:20:39,864 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
      2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining)
      2012-01-04 00:20:39,865 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it is not currently assigned anywhere

      And this is still ongoing.

      1. HBASE-5120.patch
        4 kB
        ramkrishna.s.vasudevan
      2. HBASE-5120_5.patch
        4 kB
        ramkrishna.s.vasudevan
      3. HBASE-5120_5.patch
        4 kB
        ramkrishna.s.vasudevan
      4. HBASE-5120_4.patch
        4 kB
        ramkrishna.s.vasudevan
      5. HBASE-5120_3.patch
        4 kB
        ramkrishna.s.vasudevan
      6. HBASE-5120_2.patch
        4 kB
        ramkrishna.s.vasudevan
      7. HBASE-5120_1.patch
        4 kB
        ramkrishna.s.vasudevan

        Activity

        Hide
        stack added a comment -

        Closed. Reopen if I have it wrong Ram.

        Show
        stack added a comment - Closed. Reopen if I have it wrong Ram.
        Hide
        Lars Hofhansl added a comment -

        Can we close this issue?

        Show
        Lars Hofhansl added a comment - Can we close this issue?
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK #2665 (See https://builds.apache.org/job/HBase-TRUNK/2665/)
        HBASE-5120 Timeout monitor races with table disable handler (Revision 1245731)

        Result = FAILURE
        stack :
        Files :

        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK #2665 (See https://builds.apache.org/job/HBase-TRUNK/2665/ ) HBASE-5120 Timeout monitor races with table disable handler (Revision 1245731) Result = FAILURE stack : Files : /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK-security #114 (See https://builds.apache.org/job/HBase-TRUNK-security/114/)
        HBASE-5120 Timeout monitor races with table disable handler (Revision 1245731)

        Result = FAILURE
        stack :
        Files :

        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK-security #114 (See https://builds.apache.org/job/HBase-TRUNK-security/114/ ) HBASE-5120 Timeout monitor races with table disable handler (Revision 1245731) Result = FAILURE stack : Files : /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        Hide
        stack added a comment -

        I did not commit to 0.90 either.

        Show
        stack added a comment - I did not commit to 0.90 either.
        Hide
        stack added a comment -

        I committed to trunk. Will not commit to 0.92. Not important enough of a bug I'd say.

        Show
        stack added a comment - I committed to trunk. Will not commit to 0.92. Not important enough of a bug I'd say.
        Hide
        ramkrishna.s.vasudevan added a comment -

        @Stack
        Ok. Thats sound good. But currently our clusters are running with 0.90.x.
        May be it may take some more time for me to bring out the results. By the time if some one volunteers to check this, I would be thankful to them.
        Any way i would also try to check this out. Thanks Stack.

        Show
        ramkrishna.s.vasudevan added a comment - @Stack Ok. Thats sound good. But currently our clusters are running with 0.90.x. May be it may take some more time for me to bring out the results. By the time if some one volunteers to check this, I would be thankful to them. Any way i would also try to check this out. Thanks Stack.
        Hide
        stack added a comment -

        So, setting down the TM period from 30 minutes to 2 minutes and doing a bunch of online enable/disable works? If so, I'm +1 on this patch (We'd actually set the timeout down from 30minutes to 5minutes over in hbase-5119?)

        Show
        stack added a comment - So, setting down the TM period from 30 minutes to 2 minutes and doing a bunch of online enable/disable works? If so, I'm +1 on this patch (We'd actually set the timeout down from 30minutes to 5minutes over in hbase-5119?)
        Hide
        ramkrishna.s.vasudevan added a comment -

        @Stack
        Can we commit this now to trunk?

        Show
        ramkrishna.s.vasudevan added a comment - @Stack Can we commit this now to trunk?
        Hide
        ramkrishna.s.vasudevan added a comment -

        Just was browsing thro HBASE-4015 where Timeout monitor was refactored.
        With 5 secs as timeout period it was tested by balancing, killing and bringing up RS. Things came out fine.
        But this disable scenario was missed out.
        Another change that i could see is when HBASE-4015 was done for forceful unassign() we check if the node is present in CLOSING state then we did not proceed with it.
        Now in recent code the check is removed. May be that exposed the problem.
        Thanks to JD for pointing this out. As per JD if after reducing the timeout period if we don't run to such type of issues then we can say TM is really fixed.

        Show
        ramkrishna.s.vasudevan added a comment - Just was browsing thro HBASE-4015 where Timeout monitor was refactored. With 5 secs as timeout period it was tested by balancing, killing and bringing up RS. Things came out fine. But this disable scenario was missed out. Another change that i could see is when HBASE-4015 was done for forceful unassign() we check if the node is present in CLOSING state then we did not proceed with it. Now in recent code the check is removed. May be that exposed the problem. Thanks to JD for pointing this out. As per JD if after reducing the timeout period if we don't run to such type of issues then we can say TM is really fixed.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12510473/HBASE-5120_5.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        -1 javadoc. The javadoc tool appears to have generated -147 warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 81 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.hbase.replication.TestReplication

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/753//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/753//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/753//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12510473/HBASE-5120_5.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javadoc. The javadoc tool appears to have generated -147 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 81 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.replication.TestReplication Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/753//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/753//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/753//console This message is automatically generated.
        Hide
        ramkrishna.s.vasudevan added a comment -

        @J-D and @Lars
        Latest patch uploaded. Sorry for uploading with the same name.
        Hope this patch addresses your concern.?

        Show
        ramkrishna.s.vasudevan added a comment - @J-D and @Lars Latest patch uploaded. Sorry for uploading with the same name. Hope this patch addresses your concern.?
        Hide
        ramkrishna.s.vasudevan added a comment -

        @J-D

        boolean deleteNode = ZKAssign.deleteNode(master.getZooKeeper(), region
                    .getEncodedName(), EventHandler.EventType.RS_ZK_REGION_CLOSED);
                // TODO : We don't abort if the delete node returns false. Is there any
                // such corner case?
                LOG.error("The deletion of the CLOSED node for the region "
                    + region.getEncodedName() + " returned " + deleteNode);
        

        This piece of code should have been

                if (!deleteNode) {
                  LOG.error("The deletion of the CLOSED node for the region "
                      + region.getEncodedName() + " returned " + deleteNode);
                }
        }
        

        Because as per the first set of logs the node in the expected CLOSED state only.. That is why it is returning true and also deleting the node.
        If we add an 'if' check before the log it won't log an error.

        Show
        ramkrishna.s.vasudevan added a comment - @J-D boolean deleteNode = ZKAssign.deleteNode(master.getZooKeeper(), region .getEncodedName(), EventHandler.EventType.RS_ZK_REGION_CLOSED); // TODO : We don't abort if the delete node returns false . Is there any // such corner case ? LOG.error( "The deletion of the CLOSED node for the region " + region.getEncodedName() + " returned " + deleteNode); This piece of code should have been if (!deleteNode) { LOG.error( "The deletion of the CLOSED node for the region " + region.getEncodedName() + " returned " + deleteNode); } } Because as per the first set of logs the node in the expected CLOSED state only.. That is why it is returning true and also deleting the node. If we add an 'if' check before the log it won't log an error.
        Hide
        Jean-Daniel Cryans added a comment -

        Testing the patch with a low timeout, I can answer the question in the code that asks "We don't abort if the delete node returns false. Is there any such corner case?" and yes here it is:

        2012-01-13 00:53:39,053 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. state=PENDING_CLOSE, ts=1326415997208, server=null
        2012-01-13 00:53:39,053 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a.
        2012-01-13 00:53:39,053 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. (offlining)
        2012-01-13 00:53:39,254 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempting to unassign region TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. which is already PENDING_CLOSE but forcing to send a CLOSE RPC again 
        2012-01-13 00:53:39,255 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: update TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. state=PENDING_CLOSE, ts=1326416019254, server=null the timestamp.
        2012-01-13 00:53:39,256 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server sv4r12s38,62023,1326415651391 returned org.apache.hadoop.hbase.regionserver.RegionAlreadyInTransitionException: org.apache.hadoop.hbase.regionserver.RegionAlreadyInTransitionException: Received:CLOSE for the region:TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. ,which we are already trying to CLOSE. for 0784c045e00205949461cb21b8f4cd6a
        2012-01-13 00:54:09,051 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. state=PENDING_CLOSE, ts=1326416019256, server=null
        2012-01-13 00:54:09,051 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a.
        2012-01-13 00:54:09,051 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. (offlining)
        2012-01-13 00:54:09,126 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempting to unassign region TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. which is already PENDING_CLOSE but forcing to send a CLOSE RPC again 
        2012-01-13 00:54:09,127 INFO org.apache.hadoop.hbase.master.AssignmentManager: While trying to recover the table TestTable to DISABLED state the region {NAME => 'TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a.', STARTKEY => '0006605550', ENDKEY => '0006616035', ENCODED => 0784c045e00205949461cb21b8f4cd6a,} was offlined but the table was in DISABLING state
        2012-01-13 00:54:09,127 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db06d62 Deleting existing unassigned node for 0784c045e00205949461cb21b8f4cd6a that is in expected state M_ZK_REGION_CLOSING
        2012-01-13 00:54:09,128 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db06d62 Attempting to delete unassigned node 0784c045e00205949461cb21b8f4cd6a in M_ZK_REGION_CLOSING state but node is in RS_ZK_REGION_CLOSED state
        2012-01-13 00:54:09,128 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db06d62 Deleting existing unassigned node for 0784c045e00205949461cb21b8f4cd6a that is in expected state RS_ZK_REGION_CLOSED
        2012-01-13 00:54:09,140 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r12s38,62023,1326415651391, region=0784c045e00205949461cb21b8f4cd6a
        2012-01-13 00:54:09,140 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 0784c045e00205949461cb21b8f4cd6a from server sv4r12s38,62023,1326415651391 but region was in  the state null and not in expected PENDING_CLOSE or CLOSING states
        2012-01-13 00:54:09,148 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db06d62 Successfully deleted unassigned node for region 0784c045e00205949461cb21b8f4cd6a in expected state RS_ZK_REGION_CLOSED
        2012-01-13 00:54:09,148 ERROR org.apache.hadoop.hbase.master.AssignmentManager: The deletion of the CLOSED node for the region 0784c045e00205949461cb21b8f4cd6a returned true
        2012-01-13 00:54:09,148 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server sv4r12s38,62023,1326415651391 returned org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Received close for TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. but we are not serving it for 0784c045e00205949461cb21b8f4cd6a
        

        I turned out ok even if I had 7 regions that did that.

        I also got the "CLOSING/CLOSED node for the region x already deleted" message, here's the context:

        2012-01-13 00:53:39,144 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. state=PENDING_CLOSE, ts=1326415995575, server=null
        2012-01-13 00:53:39,144 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48.
        2012-01-13 00:53:39,145 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. (offlining)
        2012-01-13 00:53:39,152 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempting to unassign region TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. which is already PENDING_CLOSE but forcing to send a CLOSE RPC again 
        2012-01-13 00:53:39,169 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: update TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. state=PENDING_CLOSE, ts=1326416019152, server=null the timestamp.
        2012-01-13 00:53:39,172 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server sv4r27s44,62023,1326415651133 returned org.apache.hadoop.hbase.regionserver.RegionAlreadyInTransitionException: org.apache.hadoop.hbase.regionserver.RegionAlreadyInTransitionException: Received:CLOSE for the region:TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. ,which we are already trying to CLOSE. for f6c862f9c16b15b9227c2ed46865fb48
        2012-01-13 00:54:09,070 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. state=PENDING_CLOSE, ts=1326416019172, server=null
        2012-01-13 00:54:09,070 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48.
        2012-01-13 00:54:09,070 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. (offlining)
        2012-01-13 00:54:09,076 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempting to unassign region TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. which is already PENDING_CLOSE but forcing to send a CLOSE RPC again 
        2012-01-13 00:54:09,091 INFO org.apache.hadoop.hbase.master.AssignmentManager: While trying to recover the table TestTable to DISABLED state the region {NAME => 'TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48.', STARTKEY => '0004613400', ENDKEY => '0004623885', ENCODED => f6c862f9c16b15b9227c2ed46865fb48,} was offlined but the table was in DISABLING state
        2012-01-13 00:54:09,106 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r27s44,62023,1326415651133, region=f6c862f9c16b15b9227c2ed46865fb48
        2012-01-13 00:54:09,106 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for f6c862f9c16b15b9227c2ed46865fb48
        2012-01-13 00:54:09,106 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48.
        2012-01-13 00:54:09,106 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db06d62 Deleting existing unassigned node for f6c862f9c16b15b9227c2ed46865fb48 that is in expected state RS_ZK_REGION_CLOSED
        2012-01-13 00:54:09,115 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db06d62 Successfully deleted unassigned node for region f6c862f9c16b15b9227c2ed46865fb48 in expected state RS_ZK_REGION_CLOSED
        2012-01-13 00:54:09,127 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db06d62 Deleting existing unassigned node for f6c862f9c16b15b9227c2ed46865fb48 that is in expected state M_ZK_REGION_CLOSING
        2012-01-13 00:54:09,128 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: CLOSING/CLOSED node for the region f6c862f9c16b15b9227c2ed46865fb48 already deleted
        2012-01-13 00:54:09,128 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server sv4r27s44,62023,1326415651133 returned org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Received close for TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. but we are not serving it for f6c862f9c16b15b9227c2ed46865fb48
        

        Again it turned out fine.

        I'm +1 after fixing the comment about the corner case.

        Show
        Jean-Daniel Cryans added a comment - Testing the patch with a low timeout, I can answer the question in the code that asks "We don't abort if the delete node returns false. Is there any such corner case?" and yes here it is: 2012-01-13 00:53:39,053 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. state=PENDING_CLOSE, ts=1326415997208, server=null 2012-01-13 00:53:39,053 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. 2012-01-13 00:53:39,053 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. (offlining) 2012-01-13 00:53:39,254 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempting to unassign region TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. which is already PENDING_CLOSE but forcing to send a CLOSE RPC again 2012-01-13 00:53:39,255 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: update TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. state=PENDING_CLOSE, ts=1326416019254, server=null the timestamp. 2012-01-13 00:53:39,256 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server sv4r12s38,62023,1326415651391 returned org.apache.hadoop.hbase.regionserver.RegionAlreadyInTransitionException: org.apache.hadoop.hbase.regionserver.RegionAlreadyInTransitionException: Received:CLOSE for the region:TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. ,which we are already trying to CLOSE. for 0784c045e00205949461cb21b8f4cd6a 2012-01-13 00:54:09,051 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. state=PENDING_CLOSE, ts=1326416019256, server=null 2012-01-13 00:54:09,051 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. 2012-01-13 00:54:09,051 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. (offlining) 2012-01-13 00:54:09,126 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempting to unassign region TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. which is already PENDING_CLOSE but forcing to send a CLOSE RPC again 2012-01-13 00:54:09,127 INFO org.apache.hadoop.hbase.master.AssignmentManager: While trying to recover the table TestTable to DISABLED state the region {NAME => 'TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a.', STARTKEY => '0006605550', ENDKEY => '0006616035', ENCODED => 0784c045e00205949461cb21b8f4cd6a,} was offlined but the table was in DISABLING state 2012-01-13 00:54:09,127 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db06d62 Deleting existing unassigned node for 0784c045e00205949461cb21b8f4cd6a that is in expected state M_ZK_REGION_CLOSING 2012-01-13 00:54:09,128 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db06d62 Attempting to delete unassigned node 0784c045e00205949461cb21b8f4cd6a in M_ZK_REGION_CLOSING state but node is in RS_ZK_REGION_CLOSED state 2012-01-13 00:54:09,128 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db06d62 Deleting existing unassigned node for 0784c045e00205949461cb21b8f4cd6a that is in expected state RS_ZK_REGION_CLOSED 2012-01-13 00:54:09,140 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r12s38,62023,1326415651391, region=0784c045e00205949461cb21b8f4cd6a 2012-01-13 00:54:09,140 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 0784c045e00205949461cb21b8f4cd6a from server sv4r12s38,62023,1326415651391 but region was in the state null and not in expected PENDING_CLOSE or CLOSING states 2012-01-13 00:54:09,148 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db06d62 Successfully deleted unassigned node for region 0784c045e00205949461cb21b8f4cd6a in expected state RS_ZK_REGION_CLOSED 2012-01-13 00:54:09,148 ERROR org.apache.hadoop.hbase.master.AssignmentManager: The deletion of the CLOSED node for the region 0784c045e00205949461cb21b8f4cd6a returned true 2012-01-13 00:54:09,148 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server sv4r12s38,62023,1326415651391 returned org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Received close for TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. but we are not serving it for 0784c045e00205949461cb21b8f4cd6a I turned out ok even if I had 7 regions that did that. I also got the "CLOSING/CLOSED node for the region x already deleted" message, here's the context: 2012-01-13 00:53:39,144 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. state=PENDING_CLOSE, ts=1326415995575, server=null 2012-01-13 00:53:39,144 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. 2012-01-13 00:53:39,145 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. (offlining) 2012-01-13 00:53:39,152 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempting to unassign region TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. which is already PENDING_CLOSE but forcing to send a CLOSE RPC again 2012-01-13 00:53:39,169 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: update TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. state=PENDING_CLOSE, ts=1326416019152, server=null the timestamp. 2012-01-13 00:53:39,172 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server sv4r27s44,62023,1326415651133 returned org.apache.hadoop.hbase.regionserver.RegionAlreadyInTransitionException: org.apache.hadoop.hbase.regionserver.RegionAlreadyInTransitionException: Received:CLOSE for the region:TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. ,which we are already trying to CLOSE. for f6c862f9c16b15b9227c2ed46865fb48 2012-01-13 00:54:09,070 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. state=PENDING_CLOSE, ts=1326416019172, server=null 2012-01-13 00:54:09,070 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. 2012-01-13 00:54:09,070 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. (offlining) 2012-01-13 00:54:09,076 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempting to unassign region TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. which is already PENDING_CLOSE but forcing to send a CLOSE RPC again 2012-01-13 00:54:09,091 INFO org.apache.hadoop.hbase.master.AssignmentManager: While trying to recover the table TestTable to DISABLED state the region {NAME => 'TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48.', STARTKEY => '0004613400', ENDKEY => '0004623885', ENCODED => f6c862f9c16b15b9227c2ed46865fb48,} was offlined but the table was in DISABLING state 2012-01-13 00:54:09,106 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r27s44,62023,1326415651133, region=f6c862f9c16b15b9227c2ed46865fb48 2012-01-13 00:54:09,106 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for f6c862f9c16b15b9227c2ed46865fb48 2012-01-13 00:54:09,106 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. 2012-01-13 00:54:09,106 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db06d62 Deleting existing unassigned node for f6c862f9c16b15b9227c2ed46865fb48 that is in expected state RS_ZK_REGION_CLOSED 2012-01-13 00:54:09,115 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db06d62 Successfully deleted unassigned node for region f6c862f9c16b15b9227c2ed46865fb48 in expected state RS_ZK_REGION_CLOSED 2012-01-13 00:54:09,127 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db06d62 Deleting existing unassigned node for f6c862f9c16b15b9227c2ed46865fb48 that is in expected state M_ZK_REGION_CLOSING 2012-01-13 00:54:09,128 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: CLOSING/CLOSED node for the region f6c862f9c16b15b9227c2ed46865fb48 already deleted 2012-01-13 00:54:09,128 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server sv4r27s44,62023,1326415651133 returned org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Received close for TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. but we are not serving it for f6c862f9c16b15b9227c2ed46865fb48 Again it turned out fine. I'm +1 after fixing the comment about the corner case.
        Hide
        Lars Hofhansl added a comment -

        +1 on v5

        Nit: this is a bit weird

        +      if (!ZKAssign.deleteNode(master.getZooKeeper(), region.getEncodedName(),
        +          EventHandler.EventType.M_ZK_REGION_CLOSING)) {
        +        boolean deleteNode = ZKAssign.deleteNode(master.getZooKeeper(), region
        +            .getEncodedName(), EventHandler.EventType.RS_ZK_REGION_CLOSED);
        +        // TODO : We don't abort if the delete node returns false. Is there any
        +        // such corner case?
        +        LOG.error("The deletion of the CLOSED node for the region "
        +            + region.getEncodedName() + " returned " + deleteNode);
        

        why not

        +      boolean deleteNode = ZKAssign.deleteNode(master.getZooKeeper(), region
        +            .getEncodedName(), EventHandler.EventType.RS_ZK_REGION_CLOSED);
        +      if (!deleteNode) {
        +        // TODO : We don't abort if the delete node returns false. Is there any
        +        // such corner case?
        +        LOG.error("The deletion of the CLOSED node for the region "
        +            + region.getEncodedName() + " returned " + deleteNode);
        
        Show
        Lars Hofhansl added a comment - +1 on v5 Nit: this is a bit weird + if (!ZKAssign.deleteNode(master.getZooKeeper(), region.getEncodedName(), + EventHandler.EventType.M_ZK_REGION_CLOSING)) { + boolean deleteNode = ZKAssign.deleteNode(master.getZooKeeper(), region + .getEncodedName(), EventHandler.EventType.RS_ZK_REGION_CLOSED); + // TODO : We don't abort if the delete node returns false . Is there any + // such corner case ? + LOG.error( "The deletion of the CLOSED node for the region " + + region.getEncodedName() + " returned " + deleteNode); why not + boolean deleteNode = ZKAssign.deleteNode(master.getZooKeeper(), region + .getEncodedName(), EventHandler.EventType.RS_ZK_REGION_CLOSED); + if (!deleteNode) { + // TODO : We don't abort if the delete node returns false . Is there any + // such corner case ? + LOG.error( "The deletion of the CLOSED node for the region " + + region.getEncodedName() + " returned " + deleteNode);
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12510211/HBASE-5120_5.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        -1 javadoc. The javadoc tool appears to have generated -147 warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 79 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.hbase.mapreduce.TestImportTsv
        org.apache.hadoop.hbase.mapred.TestTableMapReduce
        org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/731//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/731//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/731//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12510211/HBASE-5120_5.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javadoc. The javadoc tool appears to have generated -147 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 79 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.mapreduce.TestImportTsv org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/731//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/731//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/731//console This message is automatically generated.
        Hide
        Ted Yu added a comment -

        +1 on patch v5.

        Show
        Ted Yu added a comment - +1 on patch v5.
        Hide
        ramkrishna.s.vasudevan added a comment -

        Changed debug to error.

        Show
        ramkrishna.s.vasudevan added a comment - Changed debug to error.
        Hide
        Ted Yu added a comment - - edited

        Can you change LOG.debug() to LOG.error() in deleteClosingOrClosedNode() ?

        +        LOG.debug("The deletion of the CLOSED node for the region "
        +            + region.getEncodedName() + " returned " + deleteNode);
        
        Show
        Ted Yu added a comment - - edited Can you change LOG.debug() to LOG.error() in deleteClosingOrClosedNode() ? + LOG.debug( "The deletion of the CLOSED node for the region " + + region.getEncodedName() + " returned " + deleteNode);
        Hide
        ramkrishna.s.vasudevan added a comment -

        Latest patch available..

        Show
        ramkrishna.s.vasudevan added a comment - Latest patch available..
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12510165/HBASE-5120_4.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        -1 javadoc. The javadoc tool appears to have generated -147 warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 79 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat
        org.apache.hadoop.hbase.io.hfile.TestLruBlockCache
        org.apache.hadoop.hbase.mapred.TestTableMapReduce
        org.apache.hadoop.hbase.mapreduce.TestImportTsv

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/726//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/726//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/726//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12510165/HBASE-5120_4.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javadoc. The javadoc tool appears to have generated -147 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 79 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat org.apache.hadoop.hbase.io.hfile.TestLruBlockCache org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestImportTsv Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/726//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/726//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/726//console This message is automatically generated.
        Hide
        ramkrishna.s.vasudevan added a comment -

        @Ted
        Ok. I will make a patch. Before making the next patch will wait for others comments.
        @Stack
        Any Comments ?

        Show
        ramkrishna.s.vasudevan added a comment - @Ted Ok. I will make a patch. Before making the next patch will wait for others comments. @Stack Any Comments ?
        Hide
        Ted Yu added a comment - - edited
        +        LOG.debug("The deletion of the CLOSED node returned " + deleteNode);
        

        I think the above should be at ERROR level. Also, please log the region name.

        Show
        Ted Yu added a comment - - edited + LOG.debug( "The deletion of the CLOSED node returned " + deleteNode); I think the above should be at ERROR level. Also, please log the region name.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12510054/HBASE-5120_3.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        -1 javadoc. The javadoc tool appears to have generated -151 warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 80 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.hbase.regionserver.wal.TestLogRolling
        org.apache.hadoop.hbase.mapreduce.TestImportTsv
        org.apache.hadoop.hbase.mapred.TestTableMapReduce
        org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/718//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/718//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/718//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12510054/HBASE-5120_3.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javadoc. The javadoc tool appears to have generated -151 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 80 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.wal.TestLogRolling org.apache.hadoop.hbase.mapreduce.TestImportTsv org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/718//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/718//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/718//console This message is automatically generated.
        Hide
        ramkrishna.s.vasudevan added a comment -

        @Ted
        Thanks for your review. I have updated the patch.

        Show
        ramkrishna.s.vasudevan added a comment - @Ted Thanks for your review. I have updated the patch.
        Hide
        Ted Yu added a comment -

        I think deleteClosingOrClosedNode() should catch KeeperException and abort, considering that code appears twice in the patch.

        There is no checking in deleteClosingOrClosedNode() on the return value for the second call to ZKAssign.deleteNode(). At least we should log the return value.
        There might be more corner cases that we discover in the future.

        Good job.

        Show
        Ted Yu added a comment - I think deleteClosingOrClosedNode() should catch KeeperException and abort, considering that code appears twice in the patch. There is no checking in deleteClosingOrClosedNode() on the return value for the second call to ZKAssign.deleteNode(). At least we should log the return value. There might be more corner cases that we discover in the future. Good job.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12510034/HBASE-5120_2.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        -1 javadoc. The javadoc tool appears to have generated -151 warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 80 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.hbase.mapreduce.TestImportTsv
        org.apache.hadoop.hbase.mapred.TestTableMapReduce
        org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/716//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/716//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/716//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12510034/HBASE-5120_2.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javadoc. The javadoc tool appears to have generated -151 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 80 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.mapreduce.TestImportTsv org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/716//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/716//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/716//console This message is automatically generated.
        Hide
        ramkrishna.s.vasudevan added a comment -

        @Ted
        I feel abort is not needed. I cannot see a place where the znode can be in another state. Also if we do that if the znode is changed in the assign flow we still go with the deletion and will abort the master if deletion fails.

        Just my thought.

        Show
        ramkrishna.s.vasudevan added a comment - @Ted I feel abort is not needed. I cannot see a place where the znode can be in another state. Also if we do that if the znode is changed in the assign flow we still go with the deletion and will abort the master if deletion fails. Just my thought.
        Hide
        Ted Yu added a comment -
        +        ZKAssign.deleteNode(master.getZooKeeper(), region.getEncodedName(),
        +            EventHandler.EventType.RS_ZK_REGION_CLOSED);
        

        If the above call returns false (without exception), I think we should abort.

        Show
        Ted Yu added a comment - + ZKAssign.deleteNode(master.getZooKeeper(), region.getEncodedName(), + EventHandler.EventType.RS_ZK_REGION_CLOSED); If the above call returns false (without exception), I think we should abort.
        Hide
        ramkrishna.s.vasudevan added a comment -

        @Ted

        catch (KeeperException e) {
        
        

        On getting KeeperException we need to abort but not on NoNodeException. I missed this in the recent patch. But i don't find the use of the return value in this case. That is why it was not used. May be if we get a false we can only log the message. But ideally as per the analysis the node can be in CLOSING state or CLOSED state only.

        Show
        ramkrishna.s.vasudevan added a comment - @Ted catch (KeeperException e) { On getting KeeperException we need to abort but not on NoNodeException. I missed this in the recent patch. But i don't find the use of the return value in this case. That is why it was not used. May be if we get a false we can only log the message. But ideally as per the analysis the node can be in CLOSING state or CLOSED state only.
        Hide
        ramkrishna.s.vasudevan added a comment -

        Regards server == null, how would this arise?

        This will come when the ClosedRegionHandler does

        assignmentManager.offlineDisabledRegion(regionInfo);
        

        inside which we do

        regionOffline(regionInfo);
        
        public void setOffline(HRegionInfo regionInfo) {
            synchronized (this.regions) {
              ServerName sn = this.regions.remove(regionInfo);
        

        This is where we get a null Servername.
        Now coming to the case where the server is null but we have an entry in RIT. When the closedRegionHandler is not yet done, by the time TM kicks in and starts another unassign then there is a condition where the RIT is made null by the ClosedRegionHandler and by the time the forceful unassign again adds the RIT in pending_close

        synchronized (regionsInTransition) {
              state = regionsInTransition.get(encodedName);
              if (state == null) {
        .....
                state = new RegionState(region, RegionState.State.PENDING_CLOSE);
                regionsInTransition.put(encodedName, state);
        

        One more condition is when balancer happens and a region about to get assigned by ClosedRegionHandler

            assignmentManager.setOffline(regionInfo);
            assignmentManager.removeClosedRegion(regionInfo);
            assignmentManager.assign(regionInfo, true);
        

        In this case the this.regions can have a null server but as per the patch we dont remove from the RIT because we check for PENDING_CLOSE or CLOSING state. In case of balancer the state of RIT is CLOSED.

        This patch addresses the scenario where TM and closedRegionHandler coming thro Disable flow and also balancer flow. Pls provide your comments.

        Show
        ramkrishna.s.vasudevan added a comment - Regards server == null, how would this arise? This will come when the ClosedRegionHandler does assignmentManager.offlineDisabledRegion(regionInfo); inside which we do regionOffline(regionInfo); public void setOffline(HRegionInfo regionInfo) { synchronized ( this .regions) { ServerName sn = this .regions.remove(regionInfo); This is where we get a null Servername. Now coming to the case where the server is null but we have an entry in RIT. When the closedRegionHandler is not yet done, by the time TM kicks in and starts another unassign then there is a condition where the RIT is made null by the ClosedRegionHandler and by the time the forceful unassign again adds the RIT in pending_close synchronized (regionsInTransition) { state = regionsInTransition.get(encodedName); if (state == null ) { ..... state = new RegionState(region, RegionState.State.PENDING_CLOSE); regionsInTransition.put(encodedName, state); One more condition is when balancer happens and a region about to get assigned by ClosedRegionHandler assignmentManager.setOffline(regionInfo); assignmentManager.removeClosedRegion(regionInfo); assignmentManager.assign(regionInfo, true ); In this case the this.regions can have a null server but as per the patch we dont remove from the RIT because we check for PENDING_CLOSE or CLOSING state. In case of balancer the state of RIT is CLOSED. This patch addresses the scenario where TM and closedRegionHandler coming thro Disable flow and also balancer flow. Pls provide your comments.
        Hide
        stack added a comment -

        Convention is to flip this so server == null rather than null == server:

        +    if (null == server) {
        

        Ditto further down in the code.

        Regards server == null, how would this arise? I'm not clear on how we'd have a null server and still have a region in this.regions. If there is a legit circumstance, then this patch I can see going in – though it does not address the core issue of our not working properly when concurrent actors, in this case the TM and a disable (I still think TM should be turned way down or off altogether).

        Show
        stack added a comment - Convention is to flip this so server == null rather than null == server: + if ( null == server) { Ditto further down in the code. Regards server == null, how would this arise? I'm not clear on how we'd have a null server and still have a region in this.regions. If there is a legit circumstance, then this patch I can see going in – though it does not address the core issue of our not working properly when concurrent actors, in this case the TM and a disable (I still think TM should be turned way down or off altogether).
        Hide
        Ted Yu added a comment -

        I see different exception handling w.r.t. KeeperException in the changes.

        +      } catch (KeeperException e) {
        +        LOG.debug("CLOSING node for the region " + region.getEncodedName()
        +            + " already deleted");
        

        Can you explain why ?

        For the new method, deleteClosingOrClosedNode(), I think it should utilize return value from ZKAssign.deleteNode()
        Meaning, deleteClosingOrClosedNode() should return a boolean.

        Show
        Ted Yu added a comment - I see different exception handling w.r.t. KeeperException in the changes. + } catch (KeeperException e) { + LOG.debug( "CLOSING node for the region " + region.getEncodedName() + + " already deleted" ); Can you explain why ? For the new method, deleteClosingOrClosedNode(), I think it should utilize return value from ZKAssign.deleteNode() Meaning, deleteClosingOrClosedNode() should return a boolean.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12509921/HBASE-5120_1.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        -1 javadoc. The javadoc tool appears to have generated -151 warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 80 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat
        org.apache.hadoop.hbase.mapred.TestTableMapReduce
        org.apache.hadoop.hbase.mapreduce.TestImportTsv

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/708//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/708//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/708//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12509921/HBASE-5120_1.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javadoc. The javadoc tool appears to have generated -151 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 80 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestImportTsv Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/708//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/708//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/708//console This message is automatically generated.
        Hide
        ramkrishna.s.vasudevan added a comment -

        Patch addresses the issue. But does not make the code cleaner as Stack wanted it to be.
        @Stack
        We may need to make the TM and unassign flow lighter. This patch addresses the problem in this JIRA.

        Show
        ramkrishna.s.vasudevan added a comment - Patch addresses the issue. But does not make the code cleaner as Stack wanted it to be. @Stack We may need to make the TM and unassign flow lighter. This patch addresses the problem in this JIRA.
        Hide
        ramkrishna.s.vasudevan added a comment -

        @Stack
        Yes Stack it is impossible. But still we need to remove the region from RIT and also delete the node CLOSING node if one was created.
        By tomorrow i can submit the patch after testing.
        If you have some idea you can suggest. Also correct me if am wrong.

        Show
        ramkrishna.s.vasudevan added a comment - @Stack Yes Stack it is impossible. But still we need to remove the region from RIT and also delete the node CLOSING node if one was created. By tomorrow i can submit the patch after testing. If you have some idea you can suggest. Also correct me if am wrong.
        Hide
        stack added a comment -

        @Ram Is odd that server is null; this should be 'impossible', right?

        Show
        stack added a comment - @Ram Is odd that server is null; this should be 'impossible', right?
        Hide
        ramkrishna.s.vasudevan added a comment -

        @Stack
        Yes if the server in this.regions is null we should not go with unassign at all.
        But there is one more condition where the RIT was getting a null server destination.
        That is what i wanted to address. This patch is not final as i wanted to test some conditions with this.

        About rewrite AM is also what we thought after TimeoutMonitor refactor as adding testcase was difficult.

        Show
        ramkrishna.s.vasudevan added a comment - @Stack Yes if the server in this.regions is null we should not go with unassign at all. But there is one more condition where the RIT was getting a null server destination. That is what i wanted to address. This patch is not final as i wanted to test some conditions with this. About rewrite AM is also what we thought after TimeoutMonitor refactor as adding testcase was difficult.
        Hide
        stack added a comment -

        On the patch, Ram, should we even try unassign if server in this.regions is null; i.e. we should fail way earlier?

        IMO, this AM needs a rewrite if only to make its various transitions testable. There are also too many methods doing similar things with lots of overlap; its really hard to figure whats going on.

        Show
        stack added a comment - On the patch, Ram, should we even try unassign if server in this.regions is null; i.e. we should fail way earlier? IMO, this AM needs a rewrite if only to make its various transitions testable. There are also too many methods doing similar things with lots of overlap; its really hard to figure whats going on.
        Hide
        stack added a comment -

        When would it ever make sense that TimeoutMonitor ask a regionserver RE-close a region? It implies that somehow the regionserver 'forgot' to do the close or that it dropped the message. If either happened, we have bigger problems and having the TM do a new close will only make it worse (If a RS 'forgot' or 'failed' a close, its probably on its way out and its ephemeral node is about to vanish and master shutdown handler will do the close fixups; if it does not work this way, we should make it so?).

        TM might make some sense for region openings but even here, I'd think the above applies; i.e. if a RS can't open a region, it should kill itself. If the region is bad, its unlikely any RS will open it unless the error transient. For this latter case the TM running every 30 minutes, while conservative, is probably fine for a default.

        I think we should leave the TM at its current 30 minute timeout and undo this issue as critical against 0.92.

        There are outstanding race conditions in hbase around shutdown handler and master operations that deserve more attention that this IMO.

        Show
        stack added a comment - When would it ever make sense that TimeoutMonitor ask a regionserver RE-close a region? It implies that somehow the regionserver 'forgot' to do the close or that it dropped the message. If either happened, we have bigger problems and having the TM do a new close will only make it worse (If a RS 'forgot' or 'failed' a close, its probably on its way out and its ephemeral node is about to vanish and master shutdown handler will do the close fixups; if it does not work this way, we should make it so?). TM might make some sense for region openings but even here, I'd think the above applies; i.e. if a RS can't open a region, it should kill itself. If the region is bad, its unlikely any RS will open it unless the error transient. For this latter case the TM running every 30 minutes, while conservative, is probably fine for a default. I think we should leave the TM at its current 30 minute timeout and undo this issue as critical against 0.92. There are outstanding race conditions in hbase around shutdown handler and master operations that deserve more attention that this IMO.
        Hide
        ramkrishna.s.vasudevan added a comment -

        @Lars and @Ted
        Thanks for your reviews.
        I was also not convinced in adding the instanceof check for NPE. It does not look good.
        I was trying at other option of not calling sendRegionClose if the server is null.
        Also remove the RIT if anything has been added. But need to verify some more scenarios. Not sure if i can dedicate time tomorrow as we have some internal workshop. I will work on this and give my patch sooner.

        Show
        ramkrishna.s.vasudevan added a comment - @Lars and @Ted Thanks for your reviews. I was also not convinced in adding the instanceof check for NPE. It does not look good. I was trying at other option of not calling sendRegionClose if the server is null. Also remove the RIT if anything has been added. But need to verify some more scenarios. Not sure if i can dedicate time tomorrow as we have some internal workshop. I will work on this and give my patch sooner.
        Hide
        Ted Yu added a comment -

        Thanks for the quick turnaround.

        The patch depends on NullPointerException to detect problematic region.
        I think we should use more protective measure as Lars commented.
        The patch specifies M_ZK_REGION_CLOSING for ZKAssign.deleteNode(). This narrows the applicability of the NPE handling.

        Show
        Ted Yu added a comment - Thanks for the quick turnaround. The patch depends on NullPointerException to detect problematic region. I think we should use more protective measure as Lars commented. The patch specifies M_ZK_REGION_CLOSING for ZKAssign.deleteNode(). This narrows the applicability of the NPE handling.
        Hide
        Lars Hofhansl added a comment - - edited

        Hey Ram, I know you said this is not done, yet...

        if (t instanceof NullPointerException) {
          removeRegionInTransition(region);
        ...
        

        Could we instead add a null check at the relevant point and deal with it there (or maybe throw another exception)? (Dealing with this after the NPE occurred leaves a bad taste... What if we introduce another bug later that also causes an NPE? That would go unnoticed for a while.)

        Show
        Lars Hofhansl added a comment - - edited Hey Ram, I know you said this is not done, yet... if (t instanceof NullPointerException) { removeRegionInTransition(region); ... Could we instead add a null check at the relevant point and deal with it there (or maybe throw another exception)? (Dealing with this after the NPE occurred leaves a bad taste... What if we introduce another bug later that also causes an NPE? That would go unnoticed for a while.)
        Hide
        ramkrishna.s.vasudevan added a comment -

        Attaching the patch so that i can access it at home. First cut versiion and not fully tested.

        Show
        ramkrishna.s.vasudevan added a comment - Attaching the patch so that i can access it at home. First cut versiion and not fully tested.
        Hide
        Lars Hofhansl added a comment -

        I see, thanks Ted.
        Ram, you're are becoming the assignment/split/race-condition expert...

        Show
        Lars Hofhansl added a comment - I see, thanks Ted. Ram, you're are becoming the assignment/split/race-condition expert...
        Hide
        ramkrishna.s.vasudevan added a comment -

        I will take a look at this issue.

        Show
        ramkrishna.s.vasudevan added a comment - I will take a look at this issue.
        Hide
        Ted Yu added a comment -

        I followed J-D's suggestion of keeping the original description for HBASE-5119.
        The plan is once this bug is fixed, HBASE-5119 would lower default timeout.

        Show
        Ted Yu added a comment - I followed J-D's suggestion of keeping the original description for HBASE-5119 . The plan is once this bug is fixed, HBASE-5119 would lower default timeout.
        Hide
        Lars Hofhansl added a comment -

        HBASE-5119 is almost identical.

        Show
        Lars Hofhansl added a comment - HBASE-5119 is almost identical.
        Hide
        Ted Yu added a comment -

        J-D ended up issuing 'kill -9' to the master.

        Show
        Ted Yu added a comment - J-D ended up issuing 'kill -9' to the master.

          People

          • Assignee:
            ramkrishna.s.vasudevan
            Reporter:
            Ted Yu
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development