HBase
  1. HBase
  2. HBASE-4288

"Server not running" exception during meta verification causes RS abort

    Details

    • Type: Bug Bug
    • Status: Reopened
    • Priority: Critical Critical
    • Resolution: Unresolved
    • Affects Version/s: 0.90.4
    • Fix Version/s: 0.90.7, 0.92.0
    • Component/s: master
    • Labels:
      None

      Description

      The master tried to verify the META location just as that server was shutting down due to an abort. This caused the "Server not running" exception to get thrown, which wasn't handled properly in the master, causing it to abort.

      1. 4288-v2.txt
        14 kB
        stack
      2. 4288.txt
        3 kB
        stack

        Issue Links

          Activity

          Hide
          Jonathan Hsieh added a comment -

          First glance seems related to HBASE-5883 – may have been resolved by that.

          Show
          Jonathan Hsieh added a comment - First glance seems related to HBASE-5883 – may have been resolved by that.
          Hide
          ramkrishna.s.vasudevan added a comment -

          Already committed to 0.92 and trunk But not into 0.90. Hence not resolving just updating fix version.

          Show
          ramkrishna.s.vasudevan added a comment - Already committed to 0.92 and trunk But not into 0.90. Hence not resolving just updating fix version.
          Hide
          ramkrishna.s.vasudevan added a comment -

          Moving into 0.90.7

          Show
          ramkrishna.s.vasudevan added a comment - Moving into 0.90.7
          Hide
          Hudson added a comment -

          Integrated in HBase-0.92 #123 (See https://builds.apache.org/job/HBase-0.92/123/)
          HBASE-4288 "Server not running" exception during meta verification causes RS abort

          stack :
          Files :

          • /hbase/branches/0.92/CHANGES.txt
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
          • /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/catalog/TestCatalogTracker.java
          Show
          Hudson added a comment - Integrated in HBase-0.92 #123 (See https://builds.apache.org/job/HBase-0.92/123/ ) HBASE-4288 "Server not running" exception during meta verification causes RS abort stack : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/catalog/TestCatalogTracker.java
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #2427 (See https://builds.apache.org/job/HBase-TRUNK/2427/)
          HBASE-4288 "Server not running" exception during meta verification causes RS abort

          stack :
          Files :

          • /hbase/trunk/CHANGES.txt
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
          • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/catalog/TestCatalogTracker.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK #2427 (See https://builds.apache.org/job/HBase-TRUNK/2427/ ) HBASE-4288 "Server not running" exception during meta verification causes RS abort stack : Files : /hbase/trunk/CHANGES.txt /hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/catalog/TestCatalogTracker.java
          Hide
          stack added a comment -

          Hmm... yeah, still issue in 0.90 branch so leaving open.

          Show
          stack added a comment - Hmm... yeah, still issue in 0.90 branch so leaving open.
          Hide
          stack added a comment -

          Committed the verification test to trunk and 0.92 branch.

          Show
          stack added a comment - Committed the verification test to trunk and 0.92 branch.
          Hide
          stack added a comment -

          Seems odd that adding a test to TestCatalogTracker would call TestLogRolling fail. I tried it local w/ patch applied and TLR passes. Going to commit this test (We can open new issue if you find hole in my reasoning above post-commit)

          Show
          stack added a comment - Seems odd that adding a test to TestCatalogTracker would call TestLogRolling fail. I tried it local w/ patch applied and TLR passes. Going to commit this test (We can open new issue if you find hole in my reasoning above post-commit)
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12503165/4288-v2.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated -164 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 52 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.regionserver.wal.TestLogRolling

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/221//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/221//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/221//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12503165/4288-v2.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 javadoc. The javadoc tool appears to have generated -164 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 52 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.wal.TestLogRolling Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/221//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/221//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/221//console This message is automatically generated.
          Hide
          stack added a comment -

          Oops... let me reattach the change description:

          M src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java                       
            Fix javadoc warnings.  Change some logging that could be profuse                        
            from DEBUG to TRACE level (now it works).                                               
            (setMetaLocation) Changed from private to package protected so could                    
            use in tests.                                                                           
                                                                                                    
          M src/test/java/org/apache/hadoop/hbase/catalog/TestCatalogTracker.java                   
            (constructAndStartCatalogTracker): Removed.  Has null Connection so NPEs.               
            (testInterruptWaitOnMetaAndRoot): Now we use HTables instead                            
            of server interfaces, we need to do a bit more work getting the right                   
            HConnection into place.  Use new mockConnection method.  Make sure we                   
            cleanup the connection when we are done.                                                
            (testServerNotRunningIOException): Added test to prove hbase-4288                       
            is fixed.  This is the important addition here.                                         
            (testGetMetaServerConnectionFails): Refactored to use new mockConnection                
            utility methods.                                                                        
            (mockConnection): utility mocking up a connection.                                      
            (getMetaTableRowResult): New utility method that fakes a Result.      
          

          Here is the proof. Here's logs from the test:

          2011-11-09 19:50:29,433 INFO  [main] catalog.CatalogTracker(597): Failed verification of .META.,,1 at address=example.org,1234,1320897021185; java.io.IOException: Server not running, aborting
          2011-11-09 19:50:29,433 DEBUG [main] catalog.CatalogTracker(492): Current cached META location, example.org,1234,1320897021185, is not valid, resetting
          2011-11-09 19:50:29,438 DEBUG [main] catalog.CatalogTracker(504): Set new cached META location: example.org,1234,1320897021185
          

          See how now when we get an IOE 'Server not running', it doesn't kill us – instead, we have the .META. location as bad and we clear the bad value (In the third log line, we found a good value and set the new meta location).

          Show
          stack added a comment - Oops... let me reattach the change description: M src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java Fix javadoc warnings. Change some logging that could be profuse from DEBUG to TRACE level (now it works). (setMetaLocation) Changed from private to package protected so could use in tests. M src/test/java/org/apache/hadoop/hbase/catalog/TestCatalogTracker.java (constructAndStartCatalogTracker): Removed. Has null Connection so NPEs. (testInterruptWaitOnMetaAndRoot): Now we use HTables instead of server interfaces, we need to do a bit more work getting the right HConnection into place. Use new mockConnection method. Make sure we cleanup the connection when we are done. (testServerNotRunningIOException): Added test to prove hbase-4288 is fixed. This is the important addition here. (testGetMetaServerConnectionFails): Refactored to use new mockConnection utility methods. (mockConnection): utility mocking up a connection. (getMetaTableRowResult): New utility method that fakes a Result. Here is the proof. Here's logs from the test: 2011-11-09 19:50:29,433 INFO [main] catalog.CatalogTracker(597): Failed verification of .META.,,1 at address=example.org,1234,1320897021185; java.io.IOException: Server not running, aborting 2011-11-09 19:50:29,433 DEBUG [main] catalog.CatalogTracker(492): Current cached META location, example.org,1234,1320897021185, is not valid, resetting 2011-11-09 19:50:29,438 DEBUG [main] catalog.CatalogTracker(504): Set new cached META location: example.org,1234,1320897021185 See how now when we get an IOE 'Server not running', it doesn't kill us – instead, we have the .META. location as bad and we clear the bad value (In the third log line, we found a good value and set the new meta location).
          Hide
          stack added a comment -

          Test to prove that this no longer an issue. I think we should commit this test because it does some test refactoring and fixes some tests NPEs.

          First, here is description of the patch:

          M src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
          Fix javadoc warnings. Change some logging that could be profuse
          from DEBUG to TRACE level (now it works).
          (setMetaLocation) Changed from private to package protected so could
          use in tests.

          M src/test/java/org/apache/hadoop/hbase/catalog/TestCatalogTracker.java
          (constructAndStartCatalogTracker): Removed. Has null Connection so NPEs.
          (testInterruptWaitOnMetaAndRoot): Now we use HTables instead
          of server interfaces, we need to do a bit more work getting the right
          HConnection into place. Use new mockConnection method. Make sure we
          cleanup the connection when we are done.
          (testServerNotRunningIOException): Added test to prove hbase-4288
          is fixed. This is the important addition here.
          (testGetMetaServerConnectionFails): Refactored to use new mockConnection
          utility methods.
          (mockConnection): utility mocking up a connection.
          (getMetaTableRowResult): New utility method that fakes a Result.

          Show
          stack added a comment - Test to prove that this no longer an issue. I think we should commit this test because it does some test refactoring and fixes some tests NPEs. First, here is description of the patch: M src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java Fix javadoc warnings. Change some logging that could be profuse from DEBUG to TRACE level (now it works). (setMetaLocation) Changed from private to package protected so could use in tests. M src/test/java/org/apache/hadoop/hbase/catalog/TestCatalogTracker.java (constructAndStartCatalogTracker): Removed. Has null Connection so NPEs. (testInterruptWaitOnMetaAndRoot): Now we use HTables instead of server interfaces, we need to do a bit more work getting the right HConnection into place. Use new mockConnection method. Make sure we cleanup the connection when we are done. (testServerNotRunningIOException): Added test to prove hbase-4288 is fixed. This is the important addition here. (testGetMetaServerConnectionFails): Refactored to use new mockConnection utility methods. (mockConnection): utility mocking up a connection. (getMetaTableRowResult): New utility method that fakes a Result.
          Hide
          stack added a comment -

          I think this scenario unlikely now we have retries in place. Writing a unit test to prove it (Its not done yet).

          Show
          stack added a comment - I think this scenario unlikely now we have retries in place. Writing a unit test to prove it (Its not done yet).
          Hide
          Todd Lipcon added a comment -

          This is a little tricky. I don't think we can just catch this and return false, since there's no verification that the server is dead yet, just that it's shutting down. If we were to return false, then the users of this code would delete the root location from ZK and start re-assigning even though the old server may have unflushed edits, etc.

          Though, this makes me think: why is it ever safe to delete the root location and reassign it before the old location's logs have split?

          Show
          Todd Lipcon added a comment - This is a little tricky. I don't think we can just catch this and return false, since there's no verification that the server is dead yet , just that it's shutting down. If we were to return false, then the users of this code would delete the root location from ZK and start re-assigning even though the old server may have unflushed edits, etc. Though, this makes me think: why is it ever safe to delete the root location and reassign it before the old location's logs have split?
          Hide
          Todd Lipcon added a comment -

          2011-08-29 16:37:50,485 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown.
          org.apache.hadoop.ipc.RemoteException: java.io.IOException: Server not running, aborting
          at org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2407)
          at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionInfo(HRegionServer.java:1648)
          at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          at java.lang.reflect.Method.invoke(Method.java:597)
          at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
          at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)

          at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
          at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:258)
          at $Proxy7.getRegionInfo(Unknown Source)
          at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyRegionLocation(CatalogTracker.java:426)
          at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:273)
          at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:333)
          at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:366)
          at org.apache.hadoop.hbase.catalog.MetaReader.getRegion(MetaReader.java:355)
          at org.apache.hadoop.hbase.master.AssignmentManager.processRegionInTransition(AssignmentManager.java:312)
          at org.apache.hadoop.hbase.master.AssignmentManager.processRegionsInTransition(AssignmentManager.java:263)
          at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:227)
          at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:421)
          at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:294)

          Show
          Todd Lipcon added a comment - 2011-08-29 16:37:50,485 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown. org.apache.hadoop.ipc.RemoteException: java.io.IOException: Server not running, aborting at org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2407) at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionInfo(HRegionServer.java:1648) at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771) at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:258) at $Proxy7.getRegionInfo(Unknown Source) at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyRegionLocation(CatalogTracker.java:426) at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:273) at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:333) at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:366) at org.apache.hadoop.hbase.catalog.MetaReader.getRegion(MetaReader.java:355) at org.apache.hadoop.hbase.master.AssignmentManager.processRegionInTransition(AssignmentManager.java:312) at org.apache.hadoop.hbase.master.AssignmentManager.processRegionsInTransition(AssignmentManager.java:263) at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:227) at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:421) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:294)

            People

            • Assignee:
              Unassigned
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:

                Development