HBase
  1. HBase
  2. HBASE-9268

Client doesn't recover from a stalled region server

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.95.2
    • Fix Version/s: 0.98.0, 0.96.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Got this testing the 0.95.2 RC.

      I killed -STOP a region server and let it stay like that while running PE. The clients didn't find the new region locations and in the jstack were stuck doing RPC. Eventually I killed -CONT and the client printed these:

      Exception in thread "TestClient-6" java.lang.RuntimeException: org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 128 actions: IOException: 90 times, SocketTimeoutException: 38 times,

      1. 9268.v1.patch
        2 kB
        Nicolas Liochon
      2. 9268-hack.patch
        0.8 kB
        Nicolas Liochon

        Activity

        Hide
        Hudson added a comment -

        SUCCESS: Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #697 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/697/)
        HBASE-9268 Client doesn't recover from a stalled region server (nkeywal: rev 1517108)

        • /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/ipc/RpcClient.java
        Show
        Hudson added a comment - SUCCESS: Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #697 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/697/ ) HBASE-9268 Client doesn't recover from a stalled region server (nkeywal: rev 1517108) /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/ipc/RpcClient.java
        Hide
        Hudson added a comment -

        FAILURE: Integrated in hbase-0.95-on-hadoop2 #272 (See https://builds.apache.org/job/hbase-0.95-on-hadoop2/272/)
        HBASE-9268 Client doesn't recover from a stalled region server (nkeywal: rev 1517109)

        • /hbase/branches/0.95/hbase-client/src/main/java/org/apache/hadoop/hbase/ipc/RpcClient.java
        Show
        Hudson added a comment - FAILURE: Integrated in hbase-0.95-on-hadoop2 #272 (See https://builds.apache.org/job/hbase-0.95-on-hadoop2/272/ ) HBASE-9268 Client doesn't recover from a stalled region server (nkeywal: rev 1517109) /hbase/branches/0.95/hbase-client/src/main/java/org/apache/hadoop/hbase/ipc/RpcClient.java
        Hide
        Hudson added a comment -

        FAILURE: Integrated in hbase-0.95 #493 (See https://builds.apache.org/job/hbase-0.95/493/)
        HBASE-9268 Client doesn't recover from a stalled region server (nkeywal: rev 1517109)

        • /hbase/branches/0.95/hbase-client/src/main/java/org/apache/hadoop/hbase/ipc/RpcClient.java
        Show
        Hudson added a comment - FAILURE: Integrated in hbase-0.95 #493 (See https://builds.apache.org/job/hbase-0.95/493/ ) HBASE-9268 Client doesn't recover from a stalled region server (nkeywal: rev 1517109) /hbase/branches/0.95/hbase-client/src/main/java/org/apache/hadoop/hbase/ipc/RpcClient.java
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in HBase-TRUNK #4434 (See https://builds.apache.org/job/HBase-TRUNK/4434/)
        HBASE-9268 Client doesn't recover from a stalled region server (nkeywal: rev 1517108)

        • /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/ipc/RpcClient.java
        Show
        Hudson added a comment - SUCCESS: Integrated in HBase-TRUNK #4434 (See https://builds.apache.org/job/HBase-TRUNK/4434/ ) HBASE-9268 Client doesn't recover from a stalled region server (nkeywal: rev 1517108) /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/ipc/RpcClient.java
        Hide
        Nicolas Liochon added a comment -

        Committed on 0.95 & trunk. I don't know why it doesn't show up on the 0.94.
        Thanks for the finding, the tests and the review JD!. Thanks for the review Stack.

        Show
        Nicolas Liochon added a comment - Committed on 0.95 & trunk. I don't know why it doesn't show up on the 0.94. Thanks for the finding, the tests and the review JD!. Thanks for the review Stack.
        Hide
        stack added a comment -

        +1

        I put in an attempted fix for the above unrelated TestNamespaceUpgrade failure.

        Show
        stack added a comment - +1 I put in an attempted fix for the above unrelated TestNamespaceUpgrade failure.
        Hide
        Jean-Daniel Cryans added a comment -

        +1 on v1, tried it and it was seamless.

        Show
        Jean-Daniel Cryans added a comment - +1 on v1, tried it and it was seamless.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12599598/9268.v1.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 hadoop1.0. The patch compiles against the hadoop 1.0 profile.

        +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 lineLengths. The patch does not introduce lines longer than 100

        +1 site. The mvn site goal succeeds with this patch.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.hbase.migration.TestNamespaceUpgrade

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/6856//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6856//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6856//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6856//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6856//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6856//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6856//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6856//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6856//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/6856//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12599598/9268.v1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop1.0 . The patch compiles against the hadoop 1.0 profile. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 +1 site . The mvn site goal succeeds with this patch. -1 core tests . The patch failed these unit tests: org.apache.hadoop.hbase.migration.TestNamespaceUpgrade Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/6856//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6856//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6856//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6856//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6856//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6856//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6856//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6856//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6856//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/6856//console This message is automatically generated.
        Hide
        Nicolas Liochon added a comment -

        I tried the 0.94 on a pseudo cluster. It seems to work well 90% of the time (that is, I had a failure).
        A possible explanation is that the writes won't block until the server side buffer is full (a side effect of kill -STOP: the socket stuff is done by the OS not the process), and that the 0.95 message size is bigger than the 0.94 (why would it be?). It's not very satisfying. The patch seems to work however, to there is a solution that makes sense even if I don't fully understand the 0.94 scenario. I will spend more time on this.

        The stack when it works on 0.94 is

        Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/127.0.0.1:42395 remote=sd-box/127.0.0.1:60020]
                at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
                at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
                at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
                at java.io.FilterInputStream.read(FilterInputStream.java:116)
                at org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:373)
                at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
                at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
                at java.io.DataInputStream.readInt(DataInputStream.java:370)
                at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:646)
                at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:580)
        
        Show
        Nicolas Liochon added a comment - I tried the 0.94 on a pseudo cluster. It seems to work well 90% of the time (that is, I had a failure). A possible explanation is that the writes won't block until the server side buffer is full (a side effect of kill -STOP: the socket stuff is done by the OS not the process), and that the 0.95 message size is bigger than the 0.94 (why would it be?). It's not very satisfying. The patch seems to work however, to there is a solution that makes sense even if I don't fully understand the 0.94 scenario. I will spend more time on this. The stack when it works on 0.94 is Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/127.0.0.1:42395 remote=sd-box/127.0.0.1:60020] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.FilterInputStream.read(FilterInputStream.java:116) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:373) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:646) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:580)
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12599207/9268-hack.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 hadoop1.0. The patch compiles against the hadoop 1.0 profile.

        +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

        -1 javadoc. The javadoc tool appears to have generated 2 warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 lineLengths. The patch does not introduce lines longer than 100

        +1 site. The mvn site goal succeeds with this patch.

        +1 core tests. The patch passed unit tests in .

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/6837//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6837//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6837//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6837//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6837//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6837//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6837//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6837//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6837//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/6837//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12599207/9268-hack.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop1.0 . The patch compiles against the hadoop 1.0 profile. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. -1 javadoc . The javadoc tool appears to have generated 2 warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 +1 site . The mvn site goal succeeds with this patch. +1 core tests . The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/6837//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6837//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6837//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6837//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6837//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6837//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6837//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6837//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/6837//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/6837//console This message is automatically generated.
        Hide
        Nicolas Liochon added a comment -

        How do you define "new" here? I haven't tested it but I'm pretty sure this wasn't an issue in 0.94.

        Yeah, that's the 0.94 I was thinking about. The code has changed a lot around this area, but in HBaseClient.java it seems like in 0.95: no timeouts on writes. So I wonder what's the extra logic that makes it work on 0.94 (it's not purely theoretical: we could have this issue somewhere else in 0.95). I'm going to try the 0.94 to be sure.

        It worked fine with HBASE-7590 once I fixed the class name in the release note

        Thanks for the test and the fix, JD.

        Show
        Nicolas Liochon added a comment - How do you define "new" here? I haven't tested it but I'm pretty sure this wasn't an issue in 0.94. Yeah, that's the 0.94 I was thinking about. The code has changed a lot around this area, but in HBaseClient.java it seems like in 0.95: no timeouts on writes. So I wonder what's the extra logic that makes it work on 0.94 (it's not purely theoretical: we could have this issue somewhere else in 0.95). I'm going to try the 0.94 to be sure. It worked fine with HBASE-7590 once I fixed the class name in the release note Thanks for the test and the fix, JD.
        Hide
        Jean-Daniel Cryans added a comment -

        Looking at the code, I don't think it's a new issue. JD, what do you think?

        How do you define "new" here? I haven't tested it but I'm pretty sure this wasn't an issue in 0.94.

        btw, I'm interested to know if you have the same issue when you activate HBASE-7590 (it should work well).

        It worked fine with HBASE-7590 once I fixed the class name in the release note

        Show
        Jean-Daniel Cryans added a comment - Looking at the code, I don't think it's a new issue. JD, what do you think? How do you define "new" here? I haven't tested it but I'm pretty sure this wasn't an issue in 0.94. btw, I'm interested to know if you have the same issue when you activate HBASE-7590 (it should work well). It worked fine with HBASE-7590 once I fixed the class name in the release note
        Hide
        Nicolas Liochon added a comment -

        btw, I'm interested to know if you have the same issue when you activate HBASE-7590 (it should work well).

        Show
        Nicolas Liochon added a comment - btw, I'm interested to know if you have the same issue when you activate HBASE-7590 (it should work well).
        Hide
        Nicolas Liochon added a comment -

        Hum. Different points:

        • 38 is about the number of puts that have failed with a SocketTimeout. As it's a multi put, it's likely to be a single message. It does not mean that the client retried 38 times.
        • we do a socket#setSoTimeout, but this is only for reads, not for write.
        • it's not possible to do write timeout in java w/o using nio API.
        • HDFS added SocketOutputStream back in HADOOP-2346, but HBase does not use it.
        • The API to use is NetUtils.getOutputStream(socket, timeout); Tested, it works.
        • We can use it, but the API does not allows to change the timeout on the fly as we do.
        • I'm not sure of the time needed by ZooKeeper to decide that the server was dead. The tests were strange.

        So, synthesis is:

        • Looking at the code, I don't think it's a new issue. JD, what do you think?
        • It seems we can fix or improve things here. I will give it a try.
        • I need to double check the zookeeper stuff.
        Show
        Nicolas Liochon added a comment - Hum. Different points: 38 is about the number of puts that have failed with a SocketTimeout. As it's a multi put, it's likely to be a single message. It does not mean that the client retried 38 times. we do a socket#setSoTimeout, but this is only for reads, not for write. it's not possible to do write timeout in java w/o using nio API. HDFS added SocketOutputStream back in HADOOP-2346 , but HBase does not use it. The API to use is NetUtils.getOutputStream(socket, timeout); Tested, it works. We can use it, but the API does not allows to change the timeout on the fly as we do. I'm not sure of the time needed by ZooKeeper to decide that the server was dead. The tests were strange. So, synthesis is: Looking at the code, I don't think it's a new issue. JD, what do you think? It seems we can fix or improve things here. I will give it a try. I need to double check the zookeeper stuff.
        Hide
        Jean-Daniel Cryans added a comment -

        Yup, but if you look at the log message I posted in this jira's description, it says it got a SocketTimeout 38 times!

        Show
        Jean-Daniel Cryans added a comment - Yup, but if you look at the log message I posted in this jira's description, it says it got a SocketTimeout 38 times!
        Hide
        Nicolas Liochon added a comment -

        I've played with a pseudo distributed cluster + ycsb and got this when I kill -STOP the regionserver:

           java.lang.Thread.State: BLOCKED (on object monitor)
                at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
                - waiting to lock <0x00000007de6af410****** > (a java.io.BufferedOutputStream)
                at java.io.DataOutputStream.flush(DataOutputStream.java:106)
                at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
                at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:232)
                at org.apache.hadoop.io.IOUtils.closeStream(IOUtils.java:248)
                at org.apache.hadoop.hbase.ipc.RpcClient$Connection.close(RpcClient.java:963)
                - locked <0x00000007de6ab808> (a org.apache.hadoop.hbase.ipc.RpcClient$Connection)
                at org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClient.java:718)
        
        "hbase-table-pool-1-thread-6" daemon prio=10 tid=0x00007f93000ce800 nid=0x649c runnable [0x00007f932aa85000]
           java.lang.Thread.State: RUNNABLE
                at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
                at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
                at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
                at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
                - locked <0x00000007deb3ff60> (a sun.nio.ch.Util$2)
                - locked <0x00000007deb3ff50> (a java.util.Collections$UnmodifiableSet)
                - locked <0x00000007deb3fd48> (a sun.nio.ch.EPollSelectorImpl)
                at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
                at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:332)
                at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
                at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
                at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
                at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
                - locked <0x00000007de6af410******> (a java.io.BufferedOutputStream)
                at java.io.DataOutputStream.write(DataOutputStream.java:90)
                - locked <0x00000007de6af3f0> (a java.io.DataOutputStream)
                at org.apache.hadoop.hbase.ipc.IPCUtil.write(IPCUtil.java:230)
                at org.apache.hadoop.hbase.ipc.IPCUtil.write(IPCUtil.java:220)
                at org.apache.hadoop.hbase.ipc.RpcClient$Connection.writeRequest(RpcClient.java:1039)
                - locked <0x00000007de6af3f0> (a java.io.DataOutputStream)
                at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1407)
                at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1635)
        

        It's exactly like if the timeout on the socket was not set. Strange.

        Show
        Nicolas Liochon added a comment - I've played with a pseudo distributed cluster + ycsb and got this when I kill -STOP the regionserver: java.lang.Thread.State: BLOCKED (on object monitor) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) - waiting to lock <0x00000007de6af410****** > (a java.io.BufferedOutputStream) at java.io.DataOutputStream.flush(DataOutputStream.java:106) at java.io.FilterOutputStream.close(FilterOutputStream.java:140) at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:232) at org.apache.hadoop.io.IOUtils.closeStream(IOUtils.java:248) at org.apache.hadoop.hbase.ipc.RpcClient$Connection.close(RpcClient.java:963) - locked <0x00000007de6ab808> (a org.apache.hadoop.hbase.ipc.RpcClient$Connection) at org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClient.java:718) "hbase-table-pool-1-thread-6" daemon prio=10 tid=0x00007f93000ce800 nid=0x649c runnable [0x00007f932aa85000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69) - locked <0x00000007deb3ff60> (a sun.nio.ch.Util$2) - locked <0x00000007deb3ff50> (a java.util.Collections$UnmodifiableSet) - locked <0x00000007deb3fd48> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:332) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) - locked <0x00000007de6af410******> (a java.io.BufferedOutputStream) at java.io.DataOutputStream.write(DataOutputStream.java:90) - locked <0x00000007de6af3f0> (a java.io.DataOutputStream) at org.apache.hadoop.hbase.ipc.IPCUtil.write(IPCUtil.java:230) at org.apache.hadoop.hbase.ipc.IPCUtil.write(IPCUtil.java:220) at org.apache.hadoop.hbase.ipc.RpcClient$Connection.writeRequest(RpcClient.java:1039) - locked <0x00000007de6af3f0> (a java.io.DataOutputStream) at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1407) at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1635) It's exactly like if the timeout on the socket was not set. Strange.
        Hide
        Nicolas Liochon added a comment -

        Jean-Daniel Cryans
        I won't have access to a real cluster this week, but I would like to have a look. Could you please attach or send me the client logs?

        Show
        Nicolas Liochon added a comment - Jean-Daniel Cryans I won't have access to a real cluster this week, but I would like to have a look. Could you please attach or send me the client logs?
        Hide
        Jean-Daniel Cryans added a comment -

        Oh and when it fails when I -CONT, I missed that I was getting this:

        2013-08-19 23:37:03,468 DEBUG [TestClient-11] client.ClientScanner: Finished region={ENCODED => 1588230740, NAME => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''}
        Exception in thread "TestClient-8" java.lang.NullPointerException
        	at org.apache.hadoop.hbase.client.AsyncProcess.findDestLocation(AsyncProcess.java:294)
        	at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:239)
        	at org.apache.hadoop.hbase.client.HTable.backgroundFlushCommits(HTable.java:894)
        	at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1275)
        	at org.apache.hadoop.hbase.PerformanceEvaluation$Test.testTakedown(PerformanceEvaluation.java:853)
        	at org.apache.hadoop.hbase.PerformanceEvaluation$Test.test(PerformanceEvaluation.java:870)
        	at org.apache.hadoop.hbase.PerformanceEvaluation.runOneClient(PerformanceEvaluation.java:1209)
        	at org.apache.hadoop.hbase.PerformanceEvaluation$1.run(PerformanceEvaluation.java:585)
        
        Show
        Jean-Daniel Cryans added a comment - Oh and when it fails when I -CONT, I missed that I was getting this: 2013-08-19 23:37:03,468 DEBUG [TestClient-11] client.ClientScanner: Finished region={ENCODED => 1588230740, NAME => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''} Exception in thread "TestClient-8" java.lang.NullPointerException at org.apache.hadoop.hbase.client.AsyncProcess.findDestLocation(AsyncProcess.java:294) at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:239) at org.apache.hadoop.hbase.client.HTable.backgroundFlushCommits(HTable.java:894) at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1275) at org.apache.hadoop.hbase.PerformanceEvaluation$Test.testTakedown(PerformanceEvaluation.java:853) at org.apache.hadoop.hbase.PerformanceEvaluation$Test.test(PerformanceEvaluation.java:870) at org.apache.hadoop.hbase.PerformanceEvaluation.runOneClient(PerformanceEvaluation.java:1209) at org.apache.hadoop.hbase.PerformanceEvaluation$1.run(PerformanceEvaluation.java:585)
        Hide
        Jean-Daniel Cryans added a comment -

        It's actually MultiServerCallable that gets stuck, I don't get this issue while reading. I see all my clients stuck on:

        "hbase-table-pool-16-thread-1" daemon prio=10 tid=0x00007f2e8487e000 nid=0x5952 waiting for monitor entry [0x00007f2e5da05000]
           java.lang.Thread.State: BLOCKED (on object monitor)
        	at org.apache.hadoop.hbase.ipc.RpcClient$Connection.writeRequest(RpcClient.java:1036)
        	- waiting to lock <0x00000000c40526a0> (a java.io.DataOutputStream)
        	at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1403)
        	at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1630)
        	at org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1687)
        	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.multi(ClientProtos.java:21274)
        	at org.apache.hadoop.hbase.client.MultiServerCallable.call(MultiServerCallable.java:105)
        	at org.apache.hadoop.hbase.client.MultiServerCallable.call(MultiServerCallable.java:43)
        	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:183)
        	at org.apache.hadoop.hbase.client.AsyncProcess$1.run(AsyncProcess.java:420)
        
        Show
        Jean-Daniel Cryans added a comment - It's actually MultiServerCallable that gets stuck, I don't get this issue while reading. I see all my clients stuck on: "hbase-table-pool-16-thread-1" daemon prio=10 tid=0x00007f2e8487e000 nid=0x5952 waiting for monitor entry [0x00007f2e5da05000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.hbase.ipc.RpcClient$Connection.writeRequest(RpcClient.java:1036) - waiting to lock <0x00000000c40526a0> (a java.io.DataOutputStream) at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1403) at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1630) at org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1687) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.multi(ClientProtos.java:21274) at org.apache.hadoop.hbase.client.MultiServerCallable.call(MultiServerCallable.java:105) at org.apache.hadoop.hbase.client.MultiServerCallable.call(MultiServerCallable.java:43) at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:183) at org.apache.hadoop.hbase.client.AsyncProcess$1.run(AsyncProcess.java:420)

          People

          • Assignee:
            Nicolas Liochon
            Reporter:
            Jean-Daniel Cryans
          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development