Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-19061

Capture exception in rpcRequestSender.start() in IPC.Connection.run()

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.5.0
    • 3.5.0
    • ipc

    Description

      rpcRequestThread.start() can fail due to OOM. This will immediately crash the Connection thread, without removing itself from the connections pool. Then for all following getConnection(remoteid), we will get this bad connection object and all rpc requests will be hanging, because this is a bad connection object, without threads being properly running (Neither Connection or Connection.rpcRequestSender thread is running due to OOM.).

      In this PR, we moved the rpcRequestThread.start() to be within the try{}-catch{} block, to capture OOM from rpcRequestThread.start() and proper cleaning is followed if we hit OOM.

      IPC.Connection.run()
      
        @Override
          public void run() {
            // Don't start the ipc parameter sending thread until we start this
            // thread, because the shutdown logic only gets triggered if this
            // thread is started.
            rpcRequestThread.start();
            if (LOG.isDebugEnabled())
              LOG.debug(getName() + ": starting, having connections " 
                  + connections.size());      
      
            try {
              while (waitForWork()) {//wait here for work - read or close connection
                receiveRpcResponse();
              }
            } catch (Throwable t) {
              // This truly is unexpected, since we catch IOException in receiveResponse
              // -- this is only to be really sure that we don't leave a client hanging
              // forever.
              LOG.warn("Unexpected error reading responses on connection " + this, t);
              markClosed(new IOException("Error reading responses", t));
            }

      Because there is no rpcRequestSender thread consuming the rpcRequestQueue, all rpc request enqueue operations for this connection will be blocked and will be hanging at this while loop forever during sendRpcRequest().

      while (!shouldCloseConnection.get()) {
        if (rpcRequestQueue.offer(Pair.of(call, buf), 1, TimeUnit.SECONDS)) {
          break;
        }
      }

      OOM exception in starting the rpcRequestSender thread.

      Exception in thread "IPC Client (1664093259) connection to nn01.grid.linkedin.com/IP-Address:portNum from kafkaetl" java.lang.OutOfMemoryError: unable to create new native thread
      	at java.lang.Thread.start0(Native Method)
      	at java.lang.Thread.start(Thread.java:717)
      	at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1034)
      

      Multiple threads blocked by queue.offer(). and we don't found any "IPC Client" or "IPC Parameter Sending Thread" in thread dump.

      Thread 2156123: (state = BLOCKED)
       - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
       - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, long) @bci=20, line=215 (Compiled frame)
       - java.util.concurrent.SynchronousQueue$TransferQueue.awaitFulfill(java.util.concurrent.SynchronousQueue$TransferQueue$QNode, java.lang.Object, boolean, long) @bci=156, line=764 (Compiled frame)
       - java.util.concurrent.SynchronousQueue$TransferQueue.transfer(java.lang.Object, boolean, long) @bci=148, line=695 (Compiled frame)
       - java.util.concurrent.SynchronousQueue.offer(java.lang.Object, long, java.util.concurrent.TimeUnit) @bci=24, line=895 (Compiled frame)
       - org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(org.apache.hadoop.ipc.Client$Call) @bci=88, line=1134 (Compiled frame)
       - org.apache.hadoop.ipc.Client.call(org.apache.hadoop.ipc.RPC$RpcKind, org.apache.hadoop.io.Writable, org.apache.hadoop.ipc.Client$ConnectionId, int, java.util.concurrent.atomic.AtomicBoolean, org.apache.hadoop.ipc.AlignmentContext) @bci=36, line=1402 (Interpreted frame)
       - org.apache.hadoop.ipc.Client.call(org.apache.hadoop.ipc.RPC$RpcKind, org.apache.hadoop.io.Writable, org.apache.hadoop.ipc.Client$ConnectionId, java.util.concurrent.atomic.AtomicBoolean, org.apache.hadoop.ipc.AlignmentContext) @bci=9, line=1349 (Compiled frame)
       - org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(java.lang.Object, java.lang.reflect.Method, java.lang.Object[]) @bci=248, line=230 (Compiled frame)
       - org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(java.lang.Object, java.lang.reflect.Method, java.lang.Object[]) @bci=4, line=118 (Compiled frame)
       - com.sun.proxy.$Proxy11.getBlockLocations(

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            xinglin Xing Lin
            xinglin Xing Lin
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment