Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-3364

TestFileAppend4.testRecoverFinalizedBlock occasionally times out

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 2.0.0-alpha
    • Fix Version/s: None
    • Component/s: test
    • Labels:
      None

      Description

      I've seen TestFileAppend4.testRecoverFinalizedBlock shutdown occasionally time out in jenkins. Doesn't fail for me locally.

      test timed out after 60000 milliseconds
      Stacktrace
      
      java.lang.Exception: test timed out after 60000 milliseconds
      	at java.lang.Object.wait(Native Method)
      	at java.lang.Thread.join(Thread.java:1186)
      	at java.lang.Thread.join(Thread.java:1239)
      	at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.join(BPServiceActor.java:473)
      	at org.apache.hadoop.hdfs.server.datanode.BPOfferService.join(BPOfferService.java:259)
      	at org.apache.hadoop.hdfs.server.datanode.BlockPoolManager.shutDownAll(BlockPoolManager.java:117)
      	at org.apache.hadoop.hdfs.server.datanode.DataNode.shutdown(DataNode.java:1098)
      	at org.apache.hadoop.hdfs.MiniDFSCluster.shutdownDataNodes(MiniDFSCluster.java:1280)
      	at org.apache.hadoop.hdfs.MiniDFSCluster.shutdown(MiniDFSCluster.java:1260)
      	at org.apache.hadoop.hdfs.TestFileAppend4.testRecoverFinalizedBlock(TestFileAppend4.java:208)
      

        Issue Links

          Activity

          Hide
          Junping Du added a comment -

          I saw this error in pre-commit test for HADOOP-8472, and also cannot reproduce in local. The full log is https://builds.apache.org/job/PreCommit-HADOOP-Build/1155//testReport/org.apache.hadoop.hdfs/TestFileAppend4/testCompleteOtherLeaseHoldersFile/ and the relevant log is as following:

          2012-06-29 16:26:22,187 INFO  ipc.Server (Server.java:stop(1991)) - Stopping server on 46047
          2012-06-29 16:26:22,424 DEBUG datanode.DataNode (BPServiceActor.java:sendHeartBeat(431)) - Sending heartbeat from service actor: Block pool BP-166940465-67.195.138.20-1340987178877 (storage id DS-851929818-67.195.138.20-36687-1340987179189) service to localhost/127.0.0.1:49801
          2012-06-29 16:26:22,424 DEBUG datanode.DataNode (BPServiceActor.java:sendHeartBeat(431)) - Sending heartbeat from service actor: Block pool BP-166940465-67.195.138.20-1340987178877 (storage id DS-1404243958-67.195.138.20-36424-1340987179356) service to localhost/127.0.0.1:49801
          2012-06-29 16:26:22,424 DEBUG datanode.DataNode (BPServiceActor.java:sendHeartBeat(431)) - Sending heartbeat from service actor: Block pool BP-166940465-67.195.138.20-1340987178877 (storage id DS-1672148922-67.195.138.20-53642-1340987179269) service to localhost/127.0.0.1:49801
          2012-06-29 16:26:22,425 INFO  ipc.Server (Server.java:run(638)) - Stopping IPC Server listener on 46047
          2012-06-29 16:26:22,425 INFO  ipc.Server (Server.java:run(780)) - Stopping IPC Server Responder
          2012-06-29 16:26:22,425 INFO  datanode.DataNode (DataNode.java:shutdown(1068)) - Waiting for threadgroup to exit, active threads is 1
          2012-06-29 16:26:22,428 WARN  ipc.Server (Server.java:processResponse(979)) - IPC Server Responder, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 127.0.0.1:51257: output error
          2012-06-29 16:26:22,429 INFO  ipc.Server (Server.java:run(1745)) - IPC Server handler 9 on 49801 caught an exception
          java.nio.channels.ClosedChannelException
          	at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
          	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
          	at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2131)
          	at org.apache.hadoop.ipc.Server.access$2000(Server.java:107)
          	at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:930)
          	at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:994)
          	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1738)
          

          From this log, the problem happens in shutting down 3rd datanodes as previous 2 datanodes are shut down successfully. It looks like when shutting down ipcserver on 3rd datanode, it happens to do heartbeat for BPServiceActor in offerService() so some heartbeat exceptions are thrown out.
          Also, the following code which is for interrupt offerService() is not taking effective which means offerService() pending on somewhere else? like heartbeat response or pendingIncrementalBR. Does adding catch InterruptedException(...) in whole offerService() can help here?

            synchronized(pendingIncrementalBR) {
              if (waitTime > 0 && pendingReceivedRequests == 0) {
                try {
                  pendingIncrementalBR.wait(waitTime);
                } catch (InterruptedException ie) {
                  LOG.warn("BPOfferService for " + this + " interrupted");
                }
              }
            }
          
          Show
          Junping Du added a comment - I saw this error in pre-commit test for HADOOP-8472 , and also cannot reproduce in local. The full log is https://builds.apache.org/job/PreCommit-HADOOP-Build/1155//testReport/org.apache.hadoop.hdfs/TestFileAppend4/testCompleteOtherLeaseHoldersFile/ and the relevant log is as following: 2012-06-29 16:26:22,187 INFO ipc.Server (Server.java:stop(1991)) - Stopping server on 46047 2012-06-29 16:26:22,424 DEBUG datanode.DataNode (BPServiceActor.java:sendHeartBeat(431)) - Sending heartbeat from service actor: Block pool BP-166940465-67.195.138.20-1340987178877 (storage id DS-851929818-67.195.138.20-36687-1340987179189) service to localhost/127.0.0.1:49801 2012-06-29 16:26:22,424 DEBUG datanode.DataNode (BPServiceActor.java:sendHeartBeat(431)) - Sending heartbeat from service actor: Block pool BP-166940465-67.195.138.20-1340987178877 (storage id DS-1404243958-67.195.138.20-36424-1340987179356) service to localhost/127.0.0.1:49801 2012-06-29 16:26:22,424 DEBUG datanode.DataNode (BPServiceActor.java:sendHeartBeat(431)) - Sending heartbeat from service actor: Block pool BP-166940465-67.195.138.20-1340987178877 (storage id DS-1672148922-67.195.138.20-53642-1340987179269) service to localhost/127.0.0.1:49801 2012-06-29 16:26:22,425 INFO ipc.Server (Server.java:run(638)) - Stopping IPC Server listener on 46047 2012-06-29 16:26:22,425 INFO ipc.Server (Server.java:run(780)) - Stopping IPC Server Responder 2012-06-29 16:26:22,425 INFO datanode.DataNode (DataNode.java:shutdown(1068)) - Waiting for threadgroup to exit, active threads is 1 2012-06-29 16:26:22,428 WARN ipc.Server (Server.java:processResponse(979)) - IPC Server Responder, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 127.0.0.1:51257: output error 2012-06-29 16:26:22,429 INFO ipc.Server (Server.java:run(1745)) - IPC Server handler 9 on 49801 caught an exception java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2131) at org.apache.hadoop.ipc.Server.access$2000(Server.java:107) at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:930) at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:994) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1738) From this log, the problem happens in shutting down 3rd datanodes as previous 2 datanodes are shut down successfully. It looks like when shutting down ipcserver on 3rd datanode, it happens to do heartbeat for BPServiceActor in offerService() so some heartbeat exceptions are thrown out. Also, the following code which is for interrupt offerService() is not taking effective which means offerService() pending on somewhere else? like heartbeat response or pendingIncrementalBR. Does adding catch InterruptedException(...) in whole offerService() can help here? synchronized (pendingIncrementalBR) { if (waitTime > 0 && pendingReceivedRequests == 0) { try { pendingIncrementalBR.wait(waitTime); } catch (InterruptedException ie) { LOG.warn( "BPOfferService for " + this + " interrupted" ); } } }

            People

            • Assignee:
              Unassigned
              Reporter:
              Eli Collins
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Development