Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-4052

Enabling a table after master switch does not allow table scan, throwing NotServingRegionException

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.90.3
    • Fix Version/s: 0.90.4
    • Component/s: None
    • Labels:
      None
    • Environment:

      Linux

    • Hadoop Flags:
      Reviewed

      Description

      Following is the scenario:

      Start RS and Active and standby masters
      Create table and insert data.
      Disable the table.
      Stop the active master and switch to the standby master.
      Now enable the table.
      Do a scan on the enabled table.
      NotServingRegionException is Thrown.

      But the same works well when we dont switch the master.

      1. HBASE-4052-1-trunk_1.patch
        5 kB
        ramkrishna.s.vasudevan
      2. HBASE-4052-1-_TestCode.patch
        6 kB
        ramkrishna.s.vasudevan
      3. HBASE-4052-1-0.90_1.patch
        5 kB
        ramkrishna.s.vasudevan
      4. 4052.txt
        13 kB
        Ted Yu
      5. HBASE-4052.patch
        13 kB
        ramkrishna.s.vasudevan
      6. Disabling-2.bmp
        1.61 MB
        ramkrishna.s.vasudevan
      7. Disabling-1.bmp
        1.56 MB
        ramkrishna.s.vasudevan
      8. Disabled.bmp
        1.62 MB
        ramkrishna.s.vasudevan
      9. TestMasterRestartAfterDisablingTable.java
        7 kB
        ramkrishna.s.vasudevan

        Activity

        Hide
        lars_francke Lars Francke added a comment -

        This issue was closed as part of a bulk closing operation on 2015-11-20. All issues that have been resolved and where all fixVersions have been released have been closed (following discussions on the mailing list).

        Show
        lars_francke Lars Francke added a comment - This issue was closed as part of a bulk closing operation on 2015-11-20. All issues that have been resolved and where all fixVersions have been released have been closed (following discussions on the mailing list).
        Hide
        hudson Hudson added a comment -

        Integrated in HBase-TRUNK #2037 (See https://builds.apache.org/job/HBase-TRUNK/2037/)
        HBASE-4052 reapply code in unassign() which handles NotServingRegionException

        tedyu :
        Files :

        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        Show
        hudson Hudson added a comment - Integrated in HBase-TRUNK #2037 (See https://builds.apache.org/job/HBase-TRUNK/2037/ ) HBASE-4052 reapply code in unassign() which handles NotServingRegionException tedyu : Files : /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        Hide
        yuzhihong@gmail.com Ted Yu added a comment -

        Handling of NotServingRegionException has been moved to the last catch block.

        Show
        yuzhihong@gmail.com Ted Yu added a comment - Handling of NotServingRegionException has been moved to the last catch block.
        Hide
        ram_krish ramkrishna.s.vasudevan added a comment -

        @Ted, one small correction in the patch that was applied

        Actually the patch catches the NotServingRegionException as RemoteException and then
        checks for the instanceof NotServingRegionException .

        catch (Throwable t) {
              if (t instanceof RemoteException) {
                t = ((RemoteException)t).unwrapRemoteException();
              }
        

        But the latest AssignmentManager.java in trunk shows that it was applied in

        catch (NotServingRegionException nsre) {
              LOG.info("Server " + server + " returned " + nsre + " for " +
                region.getEncodedName());
              // Presume that master has stale data.  Presume remote side just split.
              // Presume that the split message when it comes in will fix up the master's
              // in memory cluster state.
              if (checkIfRegionBelongsToDisabling(region)) {
        

        So the scenario of partial disabling is failing to recover back to DISABLED state.
        Can you please check and reapply the patch.

        Show
        ram_krish ramkrishna.s.vasudevan added a comment - @Ted, one small correction in the patch that was applied Actually the patch catches the NotServingRegionException as RemoteException and then checks for the instanceof NotServingRegionException . catch (Throwable t) { if (t instanceof RemoteException) { t = ((RemoteException)t).unwrapRemoteException(); } But the latest AssignmentManager.java in trunk shows that it was applied in catch (NotServingRegionException nsre) { LOG.info("Server " + server + " returned " + nsre + " for " + region.getEncodedName()); // Presume that master has stale data. Presume remote side just split. // Presume that the split message when it comes in will fix up the master's // in memory cluster state. if (checkIfRegionBelongsToDisabling(region)) { So the scenario of partial disabling is failing to recover back to DISABLED state. Can you please check and reapply the patch.
        Hide
        ram_krish ramkrishna.s.vasudevan added a comment -

        Thanks for the review Ted and Stack.

        Show
        ram_krish ramkrishna.s.vasudevan added a comment - Thanks for the review Ted and Stack.
        Hide
        hudson Hudson added a comment -

        Integrated in HBase-TRUNK #2036 (See https://builds.apache.org/job/HBase-TRUNK/2036/)
        HBASE-4052 Use same TestMasterRestartAfterDisablingTable in branch and TRUNK
        HBASE-4052 Enabling a table after master switch does not allow table scan,
        throwing NotServingRegionException. Add unit test
        HBASE-4052 Enabling a table after master switch does not allow table scan,
        throwing NotServingRegionException (ramkrishna via Ted Yu)

        tedyu :
        Files :

        • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterRestartAfterDisablingTable.java

        tedyu :
        Files :

        • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterRestartAfterDisablingTable.java

        tedyu :
        Files :

        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        • /hbase/trunk/CHANGES.txt
        Show
        hudson Hudson added a comment - Integrated in HBase-TRUNK #2036 (See https://builds.apache.org/job/HBase-TRUNK/2036/ ) HBASE-4052 Use same TestMasterRestartAfterDisablingTable in branch and TRUNK HBASE-4052 Enabling a table after master switch does not allow table scan, throwing NotServingRegionException. Add unit test HBASE-4052 Enabling a table after master switch does not allow table scan, throwing NotServingRegionException (ramkrishna via Ted Yu) tedyu : Files : /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterRestartAfterDisablingTable.java tedyu : Files : /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterRestartAfterDisablingTable.java tedyu : Files : /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java /hbase/trunk/CHANGES.txt
        Hide
        yuzhihong@gmail.com Ted Yu added a comment -

        Integrated to branch and TRUNK.

        Thanks for the review Stack.

        @ramkrishna:
        In the future, please concatenate test file patch to main patch.

        Show
        yuzhihong@gmail.com Ted Yu added a comment - Integrated to branch and TRUNK. Thanks for the review Stack. @ramkrishna: In the future, please concatenate test file patch to main patch.
        Hide
        yuzhihong@gmail.com Ted Yu added a comment -

        I applied an addendum for HBASE-3904 v6 to TRUNK. NPE is fixed.
        However, HBASE-4087 is required for the new unit test to pass.

        Show
        yuzhihong@gmail.com Ted Yu added a comment - I applied an addendum for HBASE-3904 v6 to TRUNK. NPE is fixed. However, HBASE-4087 is required for the new unit test to pass.
        Hide
        stack stack added a comment -

        Patch looks good to me Ramkrishna. I tried to run your test but was getting NPE on trunk unrelated seemingly to your patch; I think this breaks tests 'HEAD is now at 46924a1... HBASE-3904 Addendum that fixes number of retries (Ita Pai)'

        Show
        stack stack added a comment - Patch looks good to me Ramkrishna. I tried to run your test but was getting NPE on trunk unrelated seemingly to your patch; I think this breaks tests 'HEAD is now at 46924a1... HBASE-3904 Addendum that fixes number of retries (Ita Pai)'
        Hide
        ram_krish ramkrishna.s.vasudevan added a comment -

        Ted, will look into the failure of org.apache.hadoop.hbase.replication.TestReplication.
        But my local test bed shows success for the same testcase..

        Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.031 sec
        Running org.apache.hadoop.hbase.rest.TestStatusResource
        Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 23.785 sec
        Running org.apache.hadoop.hbase.executor.TestExecutorService
        Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.324 sec
        Running org.apache.hadoop.hbase.client.TestFromClientSide
        Tests run: 42, Failures: 0, Errors: 0, Skipped: 3, Time elapsed: 279.871 sec
        Running org.apache.hadoop.hbase.replication.TestReplication
        Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 213.828 sec
        

        Anyways will look into it if this patch is the root cause.

        Show
        ram_krish ramkrishna.s.vasudevan added a comment - Ted, will look into the failure of org.apache.hadoop.hbase.replication.TestReplication. But my local test bed shows success for the same testcase.. Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.031 sec Running org.apache.hadoop.hbase.rest.TestStatusResource Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 23.785 sec Running org.apache.hadoop.hbase.executor.TestExecutorService Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.324 sec Running org.apache.hadoop.hbase.client.TestFromClientSide Tests run: 42, Failures: 0, Errors: 0, Skipped: 3, Time elapsed: 279.871 sec Running org.apache.hadoop.hbase.replication.TestReplication Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 213.828 sec Anyways will look into it if this patch is the root cause.
        Hide
        yuzhihong@gmail.com Ted Yu added a comment -

        The following test consistently failed on Linux with patch for 0.90:

        queueFailover(org.apache.hadoop.hbase.replication.TestReplication)  Time elapsed: 85.997 sec  <<< FAILURE!
        java.lang.AssertionError: Waited too much time for queueFailover replication
                at org.junit.Assert.fail(Assert.java:91)
                at org.apache.hadoop.hbase.replication.TestReplication.queueFailover(TestReplication.java:572)
                at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        
        Show
        yuzhihong@gmail.com Ted Yu added a comment - The following test consistently failed on Linux with patch for 0.90: queueFailover(org.apache.hadoop.hbase.replication.TestReplication) Time elapsed: 85.997 sec <<< FAILURE! java.lang.AssertionError: Waited too much time for queueFailover replication at org.junit.Assert.fail(Assert.java:91) at org.apache.hadoop.hbase.replication.TestReplication.queueFailover(TestReplication.java:572) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        Hide
        ram_krish ramkrishna.s.vasudevan added a comment -

        Hi Ted,
        I got the reason why we get the following error while executing the test case

        testForCheckingIfEnableAndDisableWorksFineAfterSwitch(org.apache.hadoop.hbase.master.TestMasterRestartAfterDisablingTable)  Time elapsed: 23.153 sec  <<< ERROR!
        java.lang.reflect.UndeclaredThrowableException
                at $Proxy12.isMasterRunning(Unknown Source)
                at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getMaster(HConnectionManager.java:553)
                at org.apache.hadoop.hbase.client.HBaseAdmin.<init>(HBaseAdmin.java:101)
                at org.apache.hadoop.hbase.HBaseTestingUtility.getHBaseAdmin(HBaseTestingUtility.java:1162)
                at org.apache.hadoop.hbase.master.TestMasterRestartAfterDisablingTab
        

        I think this is also a bug in HBaseAdmin.getConnection().

        The reason is
        In HConnectionManager the connection object is cached based on the HConnectionKey.
        The equals method checks the value of the CONNECTION PROPERTIES.

        Suppose if we do a restart/switch of the master and again try to do an enable table operation then in the test code we will create a new HBaseAdmin object.
        But the connection that the Admin creates to the Master is taken from the cache though it is a new connection.

        Here none of the values in the CONNECTION PROPERTIES is changed so we get the same connection object when the previous master was active and hence though the master has been restarted we get the old active master address and hence an exception is thrown.
        Workaround:
        ==========
        So in order to pass the test case we change the value of one of the CONNECTION PROPERTIES so that the cached connection object is not returned.

        I reverted the HBASE-4003 and this test case passed with the above change.
        Not sure of the reason why RS doesnot checkin.

        Show
        ram_krish ramkrishna.s.vasudevan added a comment - Hi Ted, I got the reason why we get the following error while executing the test case testForCheckingIfEnableAndDisableWorksFineAfterSwitch(org.apache.hadoop.hbase.master.TestMasterRestartAfterDisablingTable) Time elapsed: 23.153 sec <<< ERROR! java.lang.reflect.UndeclaredThrowableException at $Proxy12.isMasterRunning(Unknown Source) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getMaster(HConnectionManager.java:553) at org.apache.hadoop.hbase.client.HBaseAdmin.<init>(HBaseAdmin.java:101) at org.apache.hadoop.hbase.HBaseTestingUtility.getHBaseAdmin(HBaseTestingUtility.java:1162) at org.apache.hadoop.hbase.master.TestMasterRestartAfterDisablingTab I think this is also a bug in HBaseAdmin.getConnection(). The reason is In HConnectionManager the connection object is cached based on the HConnectionKey. The equals method checks the value of the CONNECTION PROPERTIES. Suppose if we do a restart/switch of the master and again try to do an enable table operation then in the test code we will create a new HBaseAdmin object. But the connection that the Admin creates to the Master is taken from the cache though it is a new connection. Here none of the values in the CONNECTION PROPERTIES is changed so we get the same connection object when the previous master was active and hence though the master has been restarted we get the old active master address and hence an exception is thrown. Workaround: ========== So in order to pass the test case we change the value of one of the CONNECTION PROPERTIES so that the cached connection object is not returned. I reverted the HBASE-4003 and this test case passed with the above change. Not sure of the reason why RS doesnot checkin.
        Hide
        yuzhihong@gmail.com Ted Yu added a comment -

        Patch for 0.90 looks good.
        I think patch for TRUNK should be applied around same time as patch for 0.90 is applied.

        Let's spend some more time on TRUNK although the cause for region server checkin problem may be somewhere else.

        Show
        yuzhihong@gmail.com Ted Yu added a comment - Patch for 0.90 looks good. I think patch for TRUNK should be applied around same time as patch for 0.90 is applied. Let's spend some more time on TRUNK although the cause for region server checkin problem may be somewhere else.
        Hide
        ram_krish ramkrishna.s.vasudevan added a comment -

        Thanks Ted.

        I will try looking into the issue in trunk why the restart or switch is not working.
        Is there any other issue in the patch or solution? or can it be committed ?

        Show
        ram_krish ramkrishna.s.vasudevan added a comment - Thanks Ted. I will try looking into the issue in trunk why the restart or switch is not working. Is there any other issue in the patch or solution? or can it be committed ?
        Hide
        yuzhihong@gmail.com Ted Yu added a comment -

        TestMasterRestartAfterDisablingTable passed for 0.90 branch.

        I think in TRUNK, the failure may be related to hung TestRestartCluster. See https://builds.apache.org/view/G-L/view/HBase/job/HBase-TRUNK/lastCompletedBuild/artifact/trunk/target/surefire-reports/org.apache.hadoop.hbase.master.TestRestartCluster-output.txt

        Show
        yuzhihong@gmail.com Ted Yu added a comment - TestMasterRestartAfterDisablingTable passed for 0.90 branch. I think in TRUNK, the failure may be related to hung TestRestartCluster. See https://builds.apache.org/view/G-L/view/HBase/job/HBase-TRUNK/lastCompletedBuild/artifact/trunk/target/surefire-reports/org.apache.hadoop.hbase.master.TestRestartCluster-output.txt
        Hide
        ram_krish ramkrishna.s.vasudevan added a comment -

        Hi Ted,
        I think the test case should work fine in the 0.90 branch.
        In the trunk when we do a switch the RS is not able to connect to the new Master and we get
        2011-07-12 11:01:22,602 INFO org.apache.hadoop.hbase.master.ServerManager: Waiting on regionserver(s) to checkin
        2011-07-12 11:01:25,602 INFO org.apache.hadoop.hbase.master.ServerManager: Waiting on regionserver(s) to checkin
        2011-07-12 11:01:28,602 INFO org.apache.hadoop.hbase.master.ServerManager: Waiting on regionserver(s) to checkin

        So when this test case is executing also we get the same problem.

        I will try to find why this behaviour is found in trunk and not in 0.90 branch. Can you please tell me if the
        test case is passing in 0.90 branch. Because i have verified locally by running all the test cases in 0.90 branch and the testcases were passing.

        Running org.apache.hadoop.hbase.master.TestMasterRestartAfterDisablingTable
        Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 26.148 sec
        Running org.apache.hadoop.hbase.regionserver.TestFSErrorsExposed
        Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 45.383 sec
        Running org.apache.hadoop.hbase.client.replication.TestReplicationAdmin
        Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.044 sec
        Running org.apache.hadoop.hbase.regionserver.TestScanDeleteTracker
        Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.234 sec
        Running org.apache.hadoop.hbase.client.TestMetaScanner
        
        
        Show
        ram_krish ramkrishna.s.vasudevan added a comment - Hi Ted, I think the test case should work fine in the 0.90 branch. In the trunk when we do a switch the RS is not able to connect to the new Master and we get 2011-07-12 11:01:22,602 INFO org.apache.hadoop.hbase.master.ServerManager: Waiting on regionserver(s) to checkin 2011-07-12 11:01:25,602 INFO org.apache.hadoop.hbase.master.ServerManager: Waiting on regionserver(s) to checkin 2011-07-12 11:01:28,602 INFO org.apache.hadoop.hbase.master.ServerManager: Waiting on regionserver(s) to checkin So when this test case is executing also we get the same problem. I will try to find why this behaviour is found in trunk and not in 0.90 branch. Can you please tell me if the test case is passing in 0.90 branch. Because i have verified locally by running all the test cases in 0.90 branch and the testcases were passing. Running org.apache.hadoop.hbase.master.TestMasterRestartAfterDisablingTable Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 26.148 sec Running org.apache.hadoop.hbase.regionserver.TestFSErrorsExposed Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 45.383 sec Running org.apache.hadoop.hbase.client.replication.TestReplicationAdmin Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.044 sec Running org.apache.hadoop.hbase.regionserver.TestScanDeleteTracker Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.234 sec Running org.apache.hadoop.hbase.client.TestMetaScanner
        Hide
        yuzhihong@gmail.com Ted Yu added a comment -

        Or the test hangs:

        "main" prio=5 tid=103000800 nid=0x100601000 in Object.wait() [1005fe000]
           java.lang.Thread.State: WAITING (on object monitor)
                at java.lang.Object.wait(Native Method)
                - waiting on <7a46c9828> (a org.apache.hadoop.hbase.ipc.HBaseClient$Call)
                at java.lang.Object.wait(Object.java:485)
                at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:835)
                - locked <7a46c9828> (a org.apache.hadoop.hbase.ipc.HBaseClient$Call)
                at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:142)
                at $Proxy12.isMasterRunning(Unknown Source)
                at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getMaster(HConnectionManager.java:553)
                at org.apache.hadoop.hbase.client.HBaseAdmin.<init>(HBaseAdmin.java:101)
                at org.apache.hadoop.hbase.HBaseTestingUtility.getHBaseAdmin(HBaseTestingUtility.java:1162)
                at org.apache.hadoop.hbase.master.TestMasterRestartAfterDisablingTable.testForCheckingIfEnableAndDisableWorksFineAfterSwitch(TestMasterRestartAfterDisablingTable.java:100)
        
        Show
        yuzhihong@gmail.com Ted Yu added a comment - Or the test hangs: "main" prio=5 tid=103000800 nid=0x100601000 in Object.wait() [1005fe000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <7a46c9828> (a org.apache.hadoop.hbase.ipc.HBaseClient$Call) at java.lang.Object.wait(Object.java:485) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:835) - locked <7a46c9828> (a org.apache.hadoop.hbase.ipc.HBaseClient$Call) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:142) at $Proxy12.isMasterRunning(Unknown Source) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getMaster(HConnectionManager.java:553) at org.apache.hadoop.hbase.client.HBaseAdmin.<init>(HBaseAdmin.java:101) at org.apache.hadoop.hbase.HBaseTestingUtility.getHBaseAdmin(HBaseTestingUtility.java:1162) at org.apache.hadoop.hbase.master.TestMasterRestartAfterDisablingTable.testForCheckingIfEnableAndDisableWorksFineAfterSwitch(TestMasterRestartAfterDisablingTable.java:100)
        Hide
        yuzhihong@gmail.com Ted Yu added a comment -

        I still got the following error (in TRUNK):

        testForCheckingIfEnableAndDisableWorksFineAfterSwitch(org.apache.hadoop.hbase.master.TestMasterRestartAfterDisablingTable)  Time elapsed: 23.153 sec  <<< ERROR!
        java.lang.reflect.UndeclaredThrowableException
                at $Proxy12.isMasterRunning(Unknown Source)
                at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getMaster(HConnectionManager.java:553)
                at org.apache.hadoop.hbase.client.HBaseAdmin.<init>(HBaseAdmin.java:101)
                at org.apache.hadoop.hbase.HBaseTestingUtility.getHBaseAdmin(HBaseTestingUtility.java:1162)
                at org.apache.hadoop.hbase.master.TestMasterRestartAfterDisablingTable.testForCheckingIfEnableAndDisableWorksFineAfterSwitch(TestMasterRestartAfterDisablingTable.java:100)
        ...
        Caused by: java.io.IOException: Connection reset by peer
                at sun.nio.ch.FileDispatcher.read0(Native Method)
                at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
                at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
                at sun.nio.ch.IOUtil.read(IOUtil.java:175)
        
        Show
        yuzhihong@gmail.com Ted Yu added a comment - I still got the following error (in TRUNK): testForCheckingIfEnableAndDisableWorksFineAfterSwitch(org.apache.hadoop.hbase.master.TestMasterRestartAfterDisablingTable) Time elapsed: 23.153 sec <<< ERROR! java.lang.reflect.UndeclaredThrowableException at $Proxy12.isMasterRunning(Unknown Source) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getMaster(HConnectionManager.java:553) at org.apache.hadoop.hbase.client.HBaseAdmin.<init>(HBaseAdmin.java:101) at org.apache.hadoop.hbase.HBaseTestingUtility.getHBaseAdmin(HBaseTestingUtility.java:1162) at org.apache.hadoop.hbase.master.TestMasterRestartAfterDisablingTable.testForCheckingIfEnableAndDisableWorksFineAfterSwitch(TestMasterRestartAfterDisablingTable.java:100) ... Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202) at sun.nio.ch.IOUtil.read(IOUtil.java:175)
        Hide
        ram_krish ramkrishna.s.vasudevan added a comment -

        The patch HBASE-4052-1-0.90_1.patch is for 0.90 branch.
        The patch HBASE-4052-1-trunk_1.patch is for trunk branch.
        The patch HBASE-40521_TestCode.patch is the test code.

        Show
        ram_krish ramkrishna.s.vasudevan added a comment - The patch HBASE-4052 -1-0.90_1.patch is for 0.90 branch. The patch HBASE-4052 -1-trunk_1.patch is for trunk branch. The patch HBASE-4052 1 _TestCode.patch is the test code.
        Hide
        yuzhihong@gmail.com Ted Yu added a comment -

        When I tried to apply your patch on 0.90 branch:

        tyu-mbp:90hbase tyu$ patch -p0 -i HBASE-4052.patch 
        (Stripping trailing CRs from patch.)
        patching file src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
        Hunk #1 succeeded at 59 with fuzz 1 (offset -1 lines).
        Hunk #2 FAILED at 1152.
        Hunk #3 FAILED at 1445.
        Hunk #4 FAILED at 1460.
        Hunk #5 succeeded at 1570 (offset 86 lines).
        3 out of 5 hunks FAILED -- saving rejects to file src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java.rej
        (Stripping trailing CRs from patch.)
        patching file src/test/java/org/apache/hadoop/hbase/master/TestMasterRestartAfterDisablingTable.java
        

        Please come up with clean patch for 0.90 branch. There have been 39 bug fixes after release of 0.90.3

        Since this is a bug, the fix will go to both 0.90 and TRUNK.

        Show
        yuzhihong@gmail.com Ted Yu added a comment - When I tried to apply your patch on 0.90 branch: tyu-mbp:90hbase tyu$ patch -p0 -i HBASE-4052.patch (Stripping trailing CRs from patch.) patching file src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java Hunk #1 succeeded at 59 with fuzz 1 (offset -1 lines). Hunk #2 FAILED at 1152. Hunk #3 FAILED at 1445. Hunk #4 FAILED at 1460. Hunk #5 succeeded at 1570 (offset 86 lines). 3 out of 5 hunks FAILED -- saving rejects to file src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java.rej (Stripping trailing CRs from patch.) patching file src/test/java/org/apache/hadoop/hbase/master/TestMasterRestartAfterDisablingTable.java Please come up with clean patch for 0.90 branch. There have been 39 bug fixes after release of 0.90.3 Since this is a bug, the fix will go to both 0.90 and TRUNK.
        Hide
        ram_krish ramkrishna.s.vasudevan added a comment -

        Hi ted
        I had taken the 0.90 branch code
        and applied the patch on that.
        So do I need to apply patch only
        on the trunk?
        Also the test case ran cleanly in
        my setup anyway will check it
        Sorry for the mistakes
        I will rectify them .

        Show
        ram_krish ramkrishna.s.vasudevan added a comment - Hi ted I had taken the 0.90 branch code and applied the patch on that. So do I need to apply patch only on the trunk? Also the test case ran cleanly in my setup anyway will check it Sorry for the mistakes I will rectify them .
        Hide
        yuzhihong@gmail.com Ted Yu added a comment -

        This test failure is new:

        testRPCException(org.apache.hadoop.hbase.master.TestHMasterRPCException)  Time elapsed: 1.543 sec  <<< FAILURE!
        java.lang.AssertionError: Unexpected throwable: java.net.SocketTimeoutException: Call to us01-ciqps1-grid06.carrieriq.com/10.202.50.106:19386 failed on socket timeout exception: java.net.SocketTimeoutException: 100 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.202.50.106:48181 remote=us01-ciqps1-grid06.carrieriq.com/10.202.50.106:19386]
                at org.junit.Assert.fail(Assert.java:91)
                at org.apache.hadoop.hbase.master.TestHMasterRPCException.testRPCException(TestHMasterRPCException.java:59)
                at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        
        Show
        yuzhihong@gmail.com Ted Yu added a comment - This test failure is new: testRPCException(org.apache.hadoop.hbase.master.TestHMasterRPCException) Time elapsed: 1.543 sec <<< FAILURE! java.lang.AssertionError: Unexpected throwable: java.net.SocketTimeoutException: Call to us01-ciqps1-grid06.carrieriq.com/10.202.50.106:19386 failed on socket timeout exception: java.net.SocketTimeoutException: 100 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.202.50.106:48181 remote=us01-ciqps1-grid06.carrieriq.com/10.202.50.106:19386] at org.junit.Assert.fail(Assert.java:91) at org.apache.hadoop.hbase.master.TestHMasterRPCException.testRPCException(TestHMasterRPCException.java:59) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        Hide
        yuzhihong@gmail.com Ted Yu added a comment -

        Patch for TRUNK.
        ramkrishna's patch seems to be based on an old version of 0.90 which wouldn't apply cleanly on 0.90 branch

        Also the indentation was off on several lines.
        Please use the following for new files added:

        + * Copyright 2011 The Apache Software Foundation
        
        Show
        yuzhihong@gmail.com Ted Yu added a comment - Patch for TRUNK. ramkrishna's patch seems to be based on an old version of 0.90 which wouldn't apply cleanly on 0.90 branch Also the indentation was off on several lines. Please use the following for new files added: + * Copyright 2011 The Apache Software Foundation
        Hide
        yuzhihong@gmail.com Ted Yu added a comment -

        Running your unit test, I got:

        testForCheckingIfEnableAndDisableWorksFineAfterSwitch(org.apache.hadoop.hbase.master.TestMasterRestartAfterDisablingTable)  Time elapsed: 24.594 sec  <<< ERROR!
        java.lang.reflect.UndeclaredThrowableException
                at $Proxy12.isMasterRunning(Unknown Source)
                at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getMaster(HConnectionManager.java:553)
                at org.apache.hadoop.hbase.client.HBaseAdmin.<init>(HBaseAdmin.java:101)
                at org.apache.hadoop.hbase.HBaseTestingUtility.getHBaseAdmin(HBaseTestingUtility.java:1162)
                at org.apache.hadoop.hbase.master.TestMasterRestartAfterDisablingTable.testForCheckingIfEnableAndDisableWorksFineAfterSwitch(TestMasterRestartAfterDisablingTable.java:105)
        

        I think we should wait for backup master to come up before doing:

            log("Enabling table\n");
            TEST_UTIL.getHBaseAdmin().enableTable(table);
        
        Show
        yuzhihong@gmail.com Ted Yu added a comment - Running your unit test, I got: testForCheckingIfEnableAndDisableWorksFineAfterSwitch(org.apache.hadoop.hbase.master.TestMasterRestartAfterDisablingTable) Time elapsed: 24.594 sec <<< ERROR! java.lang.reflect.UndeclaredThrowableException at $Proxy12.isMasterRunning(Unknown Source) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getMaster(HConnectionManager.java:553) at org.apache.hadoop.hbase.client.HBaseAdmin.<init>(HBaseAdmin.java:101) at org.apache.hadoop.hbase.HBaseTestingUtility.getHBaseAdmin(HBaseTestingUtility.java:1162) at org.apache.hadoop.hbase.master.TestMasterRestartAfterDisablingTable.testForCheckingIfEnableAndDisableWorksFineAfterSwitch(TestMasterRestartAfterDisablingTable.java:105) I think we should wait for backup master to come up before doing: log( "Enabling table\n" ); TEST_UTIL.getHBaseAdmin().enableTable(table);
        Hide
        ram_krish ramkrishna.s.vasudevan added a comment -

        Yes that can be renamed. I will change it.
        Any scenarios to be verified Ted?

        Show
        ram_krish ramkrishna.s.vasudevan added a comment - Yes that can be renamed. I will change it. Any scenarios to be verified Ted?
        Hide
        yuzhihong@gmail.com Ted Yu added a comment -
        +          disabledTableRegions.add(disablingTableName);
        

        Looks like disabledTableRegions should be renamed disablingTables.

        Show
        yuzhihong@gmail.com Ted Yu added a comment - + disabledTableRegions.add(disablingTableName); Looks like disabledTableRegions should be renamed disablingTables.
        Hide
        ram_krish ramkrishna.s.vasudevan added a comment -

        The .bmp files has the sequence of changes that needs to be done.

        The other 3 scenarios is difficult to reproduce through testcase.

        Please let me know if the solution is fine.
        Provide your comments and any scenarios needs to be verified.
        I am working on the patch and testing it. Will upload the patch sooner.

        Show
        ram_krish ramkrishna.s.vasudevan added a comment - The .bmp files has the sequence of changes that needs to be done. The other 3 scenarios is difficult to reproduce through testcase. Please let me know if the solution is fine. Provide your comments and any scenarios needs to be verified. I am working on the patch and testing it. Will upload the patch sooner.
        Hide
        ram_krish ramkrishna.s.vasudevan added a comment -

        This is the test for the scenario 1
        where the table state is DISABLED

        Show
        ram_krish ramkrishna.s.vasudevan added a comment - This is the test for the scenario 1 where the table state is DISABLED
        Hide
        stack stack added a comment -

        The above looks good to me Ramkrishna.

        Show
        stack stack added a comment - The above looks good to me Ramkrishna.
        Hide
        ram_krish ramkrishna.s.vasudevan added a comment -

        Hi,
        Sorry for using wrong terminologies. I was aware that the Tables are only disabled and regions are only onlined or offlined.
        I would like to attach the scenarios that needs to be addressed as part of this bug

        If a table T1 has three regions R1, R2 and R3 
        We issue a disable command for T1
        Scenario 1:
        ===========
        All the regions R1, R2 and R3 are offlined and the state in zookeeper for the table is DISABLED and then the Active master went down,
        R1- Offlined
        R2- Offlined			T1-DISABLED
        R3-Offlined
        	This is straight forward scenario and the handling is also simple.
        Scenario:2
        ==========
        All the regions R1, R2 and R3 are offlined but the Active Master went down when the zookeeper state for the table was in DISABLING.
        R1- Offlined
        R2- Offlined			T1-DISABLING
        R3-Offlined
        Scenario:3
        ========
        The regions R1, R2 and R3 are not yet offlined and the Active Master went down when the zookeeper state was DISABLING.
        R1- Online
        R2- Online		T1-DISABLING
        R3-Online
        Scenario:4
        ========
        The regions R1, R2  are offlined and R3 are not yet offlined and the Active Master went down when the zookeeper state was DISABLING.
        R1- offlined
        R2- offlined		T1-DISABLING
        R3-Online
        
        
        Show
        ram_krish ramkrishna.s.vasudevan added a comment - Hi, Sorry for using wrong terminologies. I was aware that the Tables are only disabled and regions are only onlined or offlined. I would like to attach the scenarios that needs to be addressed as part of this bug If a table T1 has three regions R1, R2 and R3 We issue a disable command for T1 Scenario 1: =========== All the regions R1, R2 and R3 are offlined and the state in zookeeper for the table is DISABLED and then the Active master went down, R1- Offlined R2- Offlined T1-DISABLED R3-Offlined This is straight forward scenario and the handling is also simple. Scenario:2 ========== All the regions R1, R2 and R3 are offlined but the Active Master went down when the zookeeper state for the table was in DISABLING. R1- Offlined R2- Offlined T1-DISABLING R3-Offlined Scenario:3 ======== The regions R1, R2 and R3 are not yet offlined and the Active Master went down when the zookeeper state was DISABLING. R1- Online R2- Online T1-DISABLING R3-Online Scenario:4 ======== The regions R1, R2 are offlined and R3 are not yet offlined and the Active Master went down when the zookeeper state was DISABLING. R1- offlined R2- offlined T1-DISABLING R3-Online
        Hide
        yuzhihong@gmail.com Ted Yu added a comment -

        I think one source of confusion in the above description is that DISABLED and DISABLING states are TableState.
        There is no concept of disabling/disabled region.

        I would opt for Checking for DISABLED and DISABLING TableState.

        I think the BulkAssigner in this case refers to EnableTableHandler.BulkEnabler which depends on assignmentManager.getRegionsOfTable() for the number of online regions.

        You need to write a new method, e.g. assignmentManager.getOnlineRegionsOfTable(), and call it in place of assignmentManager.getRegionsOfTable() in EnableTableHandler.regionsToAssign()

        Show
        yuzhihong@gmail.com Ted Yu added a comment - I think one source of confusion in the above description is that DISABLED and DISABLING states are TableState. There is no concept of disabling/disabled region. I would opt for Checking for DISABLED and DISABLING TableState. I think the BulkAssigner in this case refers to EnableTableHandler.BulkEnabler which depends on assignmentManager.getRegionsOfTable() for the number of online regions. You need to write a new method, e.g. assignmentManager.getOnlineRegionsOfTable(), and call it in place of assignmentManager.getRegionsOfTable() in EnableTableHandler.regionsToAssign()
        Hide
        ram_krish ramkrishna.s.vasudevan added a comment -

        I have some doubts and like to get some suggestion before proceeding.

        Following scenarios needs to be considered.
        Scenario 1:
        ===========
        All the regions are disabled and the state in zookeeper is DISABLED.
        Scenario:2
        ==========

        The regions are offlined but the AM went down when the zookeeper state was DISABLING.

        Scenario:3
        =========

        The regions are not yet offlined(or only few regions are offlined) and the AM went down when the zookeeper state was DISABLING.

        Now when we do a switch of the master or on restart scenario of master,
        how can we decide which regions were offlined and which are not.

        Though we can get the state of the table as either DISABLED or DISABLING, region wise i am not able to infer in what state the region is.

        So what brings me to get this info is

        The soln should be like we need to check for the state of the table while populating the regions map in master startup.

        Checking only for DISABLED state:
        ==========================
        Check for disabled state and those regions that are not in the DISABLED state add it into the regions map in master startup.

        If i check only for the DISABLED state and if the table is in DISABLING state and

        after master retry (or switch) if i try to enable then we will not be able to scan the table because while enabling none of the regions will be enabled
        as the regions in META table and the regions that i have populated in the regions map are same.
        So I will be getting the same issue as in the description of the defect.

        Checking for DISABLED and DISABLING state:
        ===================================
        if i check the state of the zookeeper for DISABLED and DISABLING and while restart of master(switch) only those regions which are not in DISABLED or DISABLING state is populated.
        When i again try to enable the region if the region was not offlined as part of disable flow(Scenario:3), the waitUntilDone in BulkAssigner is not aware that the region was
        already onlined and keeps on waiting as the waitUntilDone() sees for the number of regions to become online from the regions map and the actual count it gets from the meta table.
        This makes enable to go in a loop.

        Am i clear with the problem? so is it like before enabling any table do we need to check the state of the table and if it is DISABLING make all those regions to go to
        offline mode.

        Show
        ram_krish ramkrishna.s.vasudevan added a comment - I have some doubts and like to get some suggestion before proceeding. Following scenarios needs to be considered. Scenario 1: =========== All the regions are disabled and the state in zookeeper is DISABLED. Scenario:2 ========== The regions are offlined but the AM went down when the zookeeper state was DISABLING. Scenario:3 ========= The regions are not yet offlined(or only few regions are offlined) and the AM went down when the zookeeper state was DISABLING. Now when we do a switch of the master or on restart scenario of master, how can we decide which regions were offlined and which are not. Though we can get the state of the table as either DISABLED or DISABLING, region wise i am not able to infer in what state the region is. So what brings me to get this info is The soln should be like we need to check for the state of the table while populating the regions map in master startup. Checking only for DISABLED state: ========================== Check for disabled state and those regions that are not in the DISABLED state add it into the regions map in master startup. If i check only for the DISABLED state and if the table is in DISABLING state and after master retry (or switch) if i try to enable then we will not be able to scan the table because while enabling none of the regions will be enabled as the regions in META table and the regions that i have populated in the regions map are same. So I will be getting the same issue as in the description of the defect. Checking for DISABLED and DISABLING state: =================================== if i check the state of the zookeeper for DISABLED and DISABLING and while restart of master(switch) only those regions which are not in DISABLED or DISABLING state is populated. When i again try to enable the region if the region was not offlined as part of disable flow(Scenario:3), the waitUntilDone in BulkAssigner is not aware that the region was already onlined and keeps on waiting as the waitUntilDone() sees for the number of regions to become online from the regions map and the actual count it gets from the meta table. This makes enable to go in a loop. Am i clear with the problem? so is it like before enabling any table do we need to check the state of the table and if it is DISABLING make all those regions to go to offline mode.
        Hide
        ram_krish ramkrishna.s.vasudevan added a comment -

        Thanks Stack for your comments.
        I am working on the patch.
        I have 2 scenarios to consider
        1. The AM may get killed when the state is in DISABLING but still the regions are not closed
        2. The AM may get killed when the state is in DISABLING but the regions are closed.

        So can we check both for DISABLING and DISABLED state.
        I will provide my patch ASAP.

        Thanks

        Show
        ram_krish ramkrishna.s.vasudevan added a comment - Thanks Stack for your comments. I am working on the patch. I have 2 scenarios to consider 1. The AM may get killed when the state is in DISABLING but still the regions are not closed 2. The AM may get killed when the state is in DISABLING but the regions are closed. So can we check both for DISABLING and DISABLED state. I will provide my patch ASAP. Thanks
        Hide
        stack stack added a comment -

        What you say seems plausible Ramkrishna. I traced your reasoning above and yes, because rebuildUserRegion adds all regions without regard to whether table is enabled/disabled, when it comes to run the enable, when it asks what regions to enable, they all look as though they are already online because of what getRegionsOfTable returns (all regions that make up the table pulled from this.regions over in AM).

        Good stuff.

        Do you have a patch Ramkrishna?

        Show
        stack stack added a comment - What you say seems plausible Ramkrishna. I traced your reasoning above and yes, because rebuildUserRegion adds all regions without regard to whether table is enabled/disabled, when it comes to run the enable, when it asks what regions to enable, they all look as though they are already online because of what getRegionsOfTable returns (all regions that make up the table pulled from this.regions over in AM). Good stuff. Do you have a patch Ramkrishna?
        Hide
        ram_krish ramkrishna.s.vasudevan added a comment -

        As per my analysis the problem is that,

        When we do a remove all the online regions are closed.
        In the Enable table flow

          private List<HRegionInfo> regionsToAssign(final List<HRegionInfo> regionsInMeta)
          throws IOException {
            final List<HRegionInfo> onlineRegions =
              this.assignmentManager.getRegionsOfTable(tableName);
            regionsInMeta.removeAll(onlineRegions);
            return regionsInMeta;
          }
        

        We remove the regions if it is already online.

        But as per the bug, enable is called after switching,

        So while the standby master becomes active, in the rebuildUserRegion api,
        we add all the regions from the Meta and we dont check if it is already disabled.

        So when the flow comes to enable (regionsToAssign()) we consider the region to be onlined already.
        Finally when we try to scan the table, we get NotServingRegionException.

        Correct me if am wrong in my analysis.

        Show
        ram_krish ramkrishna.s.vasudevan added a comment - As per my analysis the problem is that, When we do a remove all the online regions are closed. In the Enable table flow private List<HRegionInfo> regionsToAssign(final List<HRegionInfo> regionsInMeta) throws IOException { final List<HRegionInfo> onlineRegions = this.assignmentManager.getRegionsOfTable(tableName); regionsInMeta.removeAll(onlineRegions); return regionsInMeta; } We remove the regions if it is already online. But as per the bug, enable is called after switching, So while the standby master becomes active, in the rebuildUserRegion api, we add all the regions from the Meta and we dont check if it is already disabled. So when the flow comes to enable (regionsToAssign()) we consider the region to be onlined already. Finally when we try to scan the table, we get NotServingRegionException. Correct me if am wrong in my analysis.

          People

          • Assignee:
            ram_krish ramkrishna.s.vasudevan
            Reporter:
            ram_krish ramkrishna.s.vasudevan
          • Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development