Details
Description
The active master node network was down for some time (This node contains Master,DN,ZK,RS). Here backup node got
notification, and started to became active. Immedietly backup node got aborted with the below exception.
2012-04-09 10:42:24,270 INFO org.apache.hadoop.hbase.master.SplitLogManager: finished splitting (more than or equal to) 861248320 bytes in 4 log files in [hdfs://192.168.47.205:9000/hbase/.logs/HOST-192-168-47-202,60020,1333715537172-splitting] in 26374ms 2012-04-09 10:42:24,316 FATAL org.apache.hadoop.hbase.master.HMaster: Master server abort: loaded coprocessors are: [] 2012-04-09 10:42:24,333 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown. java.io.IOException: java.net.ConnectException: Connection refused at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:375) at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1045) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:897) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:150) at $Proxy13.getProtocolVersion(Unknown Source) at org.apache.hadoop.hbase.ipc.WritableRpcEngine.getProxy(WritableRpcEngine.java:183) at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:303) at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:280) at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:332) at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:236) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1276) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1233) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1220) at org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(CatalogTracker.java:569) at org.apache.hadoop.hbase.catalog.CatalogTracker.getRootServerConnection(CatalogTracker.java:369) at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForRootServerConnection(CatalogTracker.java:353) at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyRootRegionLocation(CatalogTracker.java:660) at org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:616) at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:540) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:363) at java.lang.Thread.run(Thread.java:662) Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:488) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupConnection(HBaseClient.java:328) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:362) ... 20 more 2012-04-09 10:42:24,336 INFO org.apache.hadoop.hbase.master.HMaster: Aborting 2012-04-09 10:42:24,336 DEBUG org.apache.hadoop.hbase.master.HMaster: Stopping service threads
Attachments
Attachments
- 90-addendum.patch
- 1 kB
- Jieshan Bean
- 92-addendum.patch
- 1 kB
- Jieshan Bean
- 94-addendum.patch
- 1 kB
- Jieshan Bean
- HBASE-5883-90.patch
- 4 kB
- Jieshan Bean
- HBASE-5883-92.patch
- 4 kB
- Jieshan Bean
- HBASE-5883-94.patch
- 4 kB
- Jieshan Bean
- HBASE-5883-trunk.patch
- 4 kB
- Jieshan Bean
- trunk-addendum.patch
- 1 kB
- Jieshan Bean
Issue Links
- is related to
-
HBASE-4288 "Server not running" exception during meta verification causes RS abort
- Closed
-
HBASE-4470 ServerNotRunningException coming out of assignRootAndMeta kills the Master
- Closed
-
HBASE-4762 ROOT and META region never be assigned if IOE throws in verifyRootRegionLocation
- Closed
Activity
@Jieshan since this was committed along time ago (5/3/12) I'd suggest creating a new issue to clean it up. I'll close this after it is done.
@Jieshan So, what do we need to do to close this issue out? What do we need to apply? Thanks.
Adding 0.92.2 and 0.90.7 to fix version, as this was originally checked in under those versions. I'm also unclear what needs to be done to get this to resolved, but it should be done to 0.90.7 and 0.92.2 as well.
Yes. Nothing was fixed. It seems a misunderstanding to that exception. So the patch just like a code restructure.. So add an addendum patch to remove the unnecessary code..
From the comment here I find it hard to determine whether this is fixed or not? Is it?
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12525851/94-addendum.patch
against trunk revision .
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
+1 hadoop23. The patch compiles against the hadoop 0.23.x profile.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed these unit tests:
Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1786//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1786//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1786//console
This message is automatically generated.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12525833/trunk-addendum.patch
against trunk revision .
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
+1 hadoop23. The patch compiles against the hadoop 0.23.x profile.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed these unit tests:
org.apache.hadoop.hbase.TestDrainingServer
Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1785//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1785//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1785//console
This message is automatically generated.
Sorry, it should be:
} catch (IOException ioex) { // We only handle the ConnectException. ConnectException ce = null; if (ioex instanceof ConnectException) { ce = (ConnectException) ioex; ioe = ce; } else { // This is the exception we can't handle. throw ioex; } handleConnectionException(++reconnectAttempts, maxAttempts, protocol, addr, ce); }
Please share your comment, Thank you.
After careful consideration, I think we should just handle the ConnectionException. Otherwise, just throw it(This is the similar logic before this re-writing).
} catch (IOException ioex) { // We only handle the ConnectException. ConnectException ce = null; if (ioex instanceof ConnectException) { ce = (ConnectException) ioex; ioe = ce; } else { // This is the exception we can't handle. throw ioex; } if (ce != null) { handleConnectionException(++reconnectAttempts, maxAttempts, protocol, addr, ce); } }
handleConnectionException() is called under the condition of:
+ if (ce != null) { + handleConnectionException(++reconnectAttempts, maxAttempts, protocol, + addr, ce);
If ioe is passed, would the above condition change ?
Thank you, stack. It makes sense to me.
Anyway, ConnectionException should not be wrapped into IOE. And we should pass ioe into handleConnectionException.
I just thought we should handle all kinds of ConenctionExceptions(Either directly ConnectionException or wrapped ConnectionException).
I will update the addendum patches today.
Thank you for your review, Ted & Stack.
Integrated in HBase-0.92-security #106 (See https://builds.apache.org/job/HBase-0.92-security/106/)
HBASE-5883 Backup master is going down due to connection refused exception (Jieshan) (Revision 1333537)
Result = SUCCESS
tedyu :
Files :
- /hbase/branches/0.92/CHANGES.txt
- /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
- /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/ipc/HBaseRPC.java
Integrated in HBase-0.94-security #26 (See https://builds.apache.org/job/HBase-0.94-security/26/)
HBASE-5883 Backup master is going down due to connection refused exception (Jieshan) (Revision 1333533)
Result = SUCCESS
tedyu :
Files :
- /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
- /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/ipc/HBaseRPC.java
Should we remove this too?
+ } else if (ioex.getCause() != null + && ioex.getCause() instanceof ConnectException) { + ce = (ConnectException) ioex.getCause(); + ioe = ce;
If the above happens, we'll get a stack trace that will be missing the last few stacks; i.e. the difference between here where its handled and wherever ConnectionException was originally thrown. It could be confuse debugging later?
Also, should we pass the ioe into handleConnectionException? I'd think we'd do this for the case that ce is null (could that happen)?
Good stuff.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12525595/HBASE-5883-94-addendum.patch
against trunk revision .
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
+1 hadoop23. The patch compiles against the hadoop 0.23.x profile.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed these unit tests:
org.apache.hadoop.hbase.client.TestFromClientSide
Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1763//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1763//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1763//console
This message is automatically generated.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12525583/HBASE-5883-trunk-addendum.patch
against trunk revision .
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
+1 hadoop23. The patch compiles against the hadoop 0.23.x profile.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed unit tests in .
Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1762//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1762//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1762//console
This message is automatically generated.
Integrated in HBase-TRUNK-security #191 (See https://builds.apache.org/job/HBase-TRUNK-security/191/)
HBASE-5883 Backup master is going down due to connection refused exception (Jieshan) (Revision 1333530)
Result = SUCCESS
tedyu :
Files :
- /hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
- /hbase/trunk/src/main/java/org/apache/hadoop/hbase/ipc/HBaseRPC.java
I agree with adding an addendum patch. I think we should unify the behavior to handle the exceptions(I remember we found so many issues regarding on the exception conversion. Everyone fixes it with his own version).
In the method of HBaseClient#setupConnection, we catches 2 types of exceptions: SocketTimeoutException & IOException.
Before this patch, we also catches 2 types of exceptions but they were SocketTimeoutException & ConnectionException.
I think the previous catching is better.
I will prepare the addendum patch today if no objection.
Thank you all.
The scenario described in this JIRA may not be limited to the existence of HBASE-5673 patch.
I think an addendum removing the “.contains("connection refused"))” is needed.
Sorry for the confusion..
When i was testing this scenario, HBASE-5673 patch was exist in my cluster. It caused me the above issue.
Your argument is that it is ok to compound a flaw because we have the flaw elsewhere?
hbase-5877 is something else altogether, it is message passing in an exception messsage because dumb rpc gives no other option – it is for sure not recast of thrown exceptions
As I mentioned here: https://issues.apache.org/jira/browse/HBASE-5883?focusedCommentId=13266335&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13266335
there're places where the original exception is hidden. e.g. In WritableRpcEngine.call():
IOException ioe = new IOException(e.toString()); ioe.setStackTrace(e.getStackTrace()); throw ioe;
The above makes identifying the source of connection refused exception difficult.
Similar technique is used in HBASE-5877 as well:
2) hadoop.ipc serialization of exception is based on the #getMessage. So we have to parse it internally.
Can't we at least check the message to ensure its what we expect? (See the second catch below where we look for "connection reset"). Can we be sure what comes up here is the ConnectException we set down in HBaseRPC?
+ if (ioe instanceof ConnectException) { + // Catch. Connect refused.
This redoing of an exception seems problematic. Its really necessary?
+ } else if (ioex.getMessage().toLowerCase() + .contains("connection refused")) { + ce = new ConnectException(ioex.getMessage()); + ioe = ce;
I'd feel better about this fix if we could figure where the exception came from (Its not from the rpc stringifying of exceptions to pass them from server to client?
Integrated in HBase-0.92 #396 (See https://builds.apache.org/job/HBase-0.92/396/)
HBASE-5883 Backup master is going down due to connection refused exception (Jieshan) (Revision 1333537)
Result = FAILURE
tedyu :
Files :
- /hbase/branches/0.92/CHANGES.txt
- /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
- /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/ipc/HBaseRPC.java
Integrated in HBase-0.94 #175 (See https://builds.apache.org/job/HBase-0.94/175/)
HBASE-5883 Backup master is going down due to connection refused exception (Jieshan) (Revision 1333533)
Result = FAILURE
tedyu :
Files :
- /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
- /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/ipc/HBaseRPC.java
Integrated in HBase-TRUNK #2843 (See https://builds.apache.org/job/HBase-TRUNK/2843/)
HBASE-5883 Backup master is going down due to connection refused exception (Jieshan) (Revision 1333530)
Result = SUCCESS
tedyu :
Files :
- /hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
- /hbase/trunk/src/main/java/org/apache/hadoop/hbase/ipc/HBaseRPC.java
Integrated to 0.92 and 0.90 as well.
Thanks for the patch Jieshan.
Thanks for the review, Lars.
Will integrate the patch to trunk and 0.94 branch tomorrow @ 10 AM PST if there is no objection.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12525272/HBASE-5883-trunk.patch
against trunk revision .
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
+1 hadoop23. The patch compiles against the hadoop 0.23.x profile.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
-1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed unit tests in .
Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1721//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1721//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1721//console
This message is automatically generated.
A brief search in 0.94 branch revealed the following:
throw new IOException(ite.toString(), ite); throw new IOException(t.toString(), t); src/main/java/org/apache/hadoop/hbase/client/coprocessor/Batch.java IOException ioe = new IOException(target.toString()); IOException ioe = new IOException(e.toString()); src/main/java/org/apache/hadoop/hbase/ipc/WritableRpcEngine.java IOException ioe = new IOException(target.toString()); IOException ioe = new IOException(e.toString()); src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
@Jieshan:
Testing in real cluster is needed. Take your time.
I didn't find the code like that in HBase, and I can't figure out where this exception come from.
That's just one possibility based on my analysis.
Just in case of the exception likes: new IOException(connectionException.toString())
If we have code like that, I would consider that a bug.
I have asked our tester to test this patch in real cluster. It can be finished today, then I will attach the patch for trunk
@Jieshan:
Do you want to attach patch for trunk so that Hadoop QA can run tests ?
Just in case of the exception likes:
new IOException(connectionException.toString())
Why do we need the following code ?
+ } else if (ioex.getMessage().toLowerCase() + .contains("connection refused")) { + ce = new ConnectException(ioex.getMessage());
Patch for 94. All tests passed. We are still testing it in real cluster.
Your comments before I post the results is welcome.
Thank you.
From the below log:
2012-04-09 10:42:24,333 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown. java.io.IOException: java.net.ConnectException: Connection refused
We can deduce ConnectException was packaged as a IOException, likes below:
new IOException(new ConnecException("Connection refused"));
or something likes:
new IOException(connectException.toString());
If so, this exception is not handled from the code.
I'm resolving this. I believe Greg tested this and the addendum version and found that the pre-addendum version fixed the problem more effectively than afterwards.
The patch has been committed already on 0.94.1 (it is in the 0.94.1rc0) an in the other branches.
Please file a new issue if to address the addendum.