Hadoop Common
  1. Hadoop Common
  2. HADOOP-9654

IPC timeout doesn't seem to be kicking in

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 2.1.0-beta
    • Fix Version/s: None
    • Component/s: ipc
    • Labels:
      None

      Description

      During my Bigtop testing I made the NN OOM. This, in turn, made all of the clients stuck in the IPC call (even the new clients that I run after the NN went OOM). Here's an example of a jstack output on the client that was running:

      $ hadoop fs -lsr /
      

      Stacktrace:

      /usr/java/jdk1.6.0_21/bin/jstack 19078
      2013-06-19 23:14:00
      Full thread dump Java HotSpot(TM) 64-Bit Server VM (17.0-b16 mixed mode):
      
      "Attach Listener" daemon prio=10 tid=0x00007fcd8c8c1800 nid=0x5105 waiting on condition [0x0000000000000000]
         java.lang.Thread.State: RUNNABLE
      
      "IPC Client (1223039541) connection to ip-10-144-82-213.ec2.internal/10.144.82.213:17020 from root" daemon prio=10 tid=0x00007fcd8c7ea000 nid=0x4aa0 runnable [0x00007fcd443e2000]
         java.lang.Thread.State: RUNNABLE
      	at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
      	at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
      	at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
      	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
      	- locked <0x00007fcd7529de18> (a sun.nio.ch.Util$1)
      	- locked <0x00007fcd7529de00> (a java.util.Collections$UnmodifiableSet)
      	- locked <0x00007fcd7529da80> (a sun.nio.ch.EPollSelectorImpl)
      	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
      	at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
      	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
      	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
      	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
      	at java.io.FilterInputStream.read(FilterInputStream.java:116)
      	at java.io.FilterInputStream.read(FilterInputStream.java:116)
      	at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:421)
      	at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
      	at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
      	- locked <0x00007fcd752aaf18> (a java.io.BufferedInputStream)
      	at java.io.DataInputStream.readInt(DataInputStream.java:370)
      	at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:943)
      	at org.apache.hadoop.ipc.Client$Connection.run(Client.java:840)
      
      "Low Memory Detector" daemon prio=10 tid=0x00007fcd8c090000 nid=0x4a9b runnable [0x0000000000000000]
         java.lang.Thread.State: RUNNABLE
      
      "CompilerThread1" daemon prio=10 tid=0x00007fcd8c08d800 nid=0x4a9a waiting on condition [0x0000000000000000]
         java.lang.Thread.State: RUNNABLE
      
      "CompilerThread0" daemon prio=10 tid=0x00007fcd8c08a800 nid=0x4a99 waiting on condition [0x0000000000000000]
         java.lang.Thread.State: RUNNABLE
      
      "Signal Dispatcher" daemon prio=10 tid=0x00007fcd8c088800 nid=0x4a98 runnable [0x0000000000000000]
         java.lang.Thread.State: RUNNABLE
      
      "Finalizer" daemon prio=10 tid=0x00007fcd8c06a000 nid=0x4a97 in Object.wait() [0x00007fcd902e9000]
         java.lang.Thread.State: WAITING (on object monitor)
      	at java.lang.Object.wait(Native Method)
      	- waiting on <0x00007fcd75fc0470> (a java.lang.ref.ReferenceQueue$Lock)
      	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
      	- locked <0x00007fcd75fc0470> (a java.lang.ref.ReferenceQueue$Lock)
      	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
      	at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
      
      "Reference Handler" daemon prio=10 tid=0x00007fcd8c068000 nid=0x4a96 in Object.wait() [0x00007fcd903ea000]
         java.lang.Thread.State: WAITING (on object monitor)
      	at java.lang.Object.wait(Native Method)
      	- waiting on <0x00007fcd75fc0550> (a java.lang.ref.Reference$Lock)
      	at java.lang.Object.wait(Object.java:485)
      	at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
      	- locked <0x00007fcd75fc0550> (a java.lang.ref.Reference$Lock)
      
      "main" prio=10 tid=0x00007fcd8c00a800 nid=0x4a92 in Object.wait() [0x00007fcd91b06000]
         java.lang.Thread.State: WAITING (on object monitor)
      	at java.lang.Object.wait(Native Method)
      	- waiting on <0x00007fcd752528e8> (a org.apache.hadoop.ipc.Client$Call)
      	at java.lang.Object.wait(Object.java:485)
      	at org.apache.hadoop.ipc.Client.call(Client.java:1284)
      	- locked <0x00007fcd752528e8> (a org.apache.hadoop.ipc.Client$Call)
      	at org.apache.hadoop.ipc.Client.call(Client.java:1250)
      	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:204)
      	at $Proxy9.getFileInfo(Unknown Source)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      	at java.lang.reflect.Method.invoke(Method.java:597)
      	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
      	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
      	at $Proxy9.getFileInfo(Unknown Source)
      	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:649)
      	at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1599)
      	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:838)
      	at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1684)
      	at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1630)
      	at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1605)
      	at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:326)
      	at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:224)
      	at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:207)
      	at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:190)
      	at org.apache.hadoop.fs.shell.Command.run(Command.java:154)
      	at org.apache.hadoop.fs.FsShell.run(FsShell.java:255)
      	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
      	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
      	at org.apache.hadoop.fs.FsShell.main(FsShell.java:305)
      
      "VM Thread" prio=10 tid=0x00007fcd8c064000 nid=0x4a95 runnable 
      
      "GC task thread#0 (ParallelGC)" prio=10 tid=0x00007fcd8c01d800 nid=0x4a93 runnable 
      
      "GC task thread#1 (ParallelGC)" prio=10 tid=0x00007fcd8c01f800 nid=0x4a94 runnable 
      
      "VM Periodic Task Thread" prio=10 tid=0x00007fcd8c09a800 nid=0x4a9c waiting on condition 
      
      JNI global references: 1086
      

        Activity

        Hide
        Jagane Sundar added a comment -

        Roman - pardon me if you already know this and are configuring your BigTop test correctly. If you take a look at HDFS-4646 and HDFS-4858, I have observed similar failure to timeout issues with both the HDFS Client to NameNode ipc (HDFS-4646) and the Datanode to NameNode ipc (HDFS-4858).

        By default ipc.client.ping is true. The meaning of this is that the IPC layer is to send out a periodic ping but to never timeout.

        In order to timeout, ipc.client.ping needs to be configured false and ipc.ping.interval needs to be set to some value e.g. 14000. This configuration means that the IPC Client should timeout in 14000. Is BigTop configuring hadoop so?

        Show
        Jagane Sundar added a comment - Roman - pardon me if you already know this and are configuring your BigTop test correctly. If you take a look at HDFS-4646 and HDFS-4858 , I have observed similar failure to timeout issues with both the HDFS Client to NameNode ipc ( HDFS-4646 ) and the Datanode to NameNode ipc ( HDFS-4858 ). By default ipc.client.ping is true. The meaning of this is that the IPC layer is to send out a periodic ping but to never timeout. In order to timeout, ipc.client.ping needs to be configured false and ipc.ping.interval needs to be set to some value e.g. 14000. This configuration means that the IPC Client should timeout in 14000. Is BigTop configuring hadoop so?
        Hide
        Roman Shaposhnik added a comment -

        Jagane Sundar as a matter of fact I didn't know that – thanks a million for bringing this up! I can definitely give your suggestion a try (the NN keeps OOMing – which gives me a perfect testbed for this).

        I do have a question for the rest of the folks here though – a client that never times out doesn't strike me as a great default. Am I missing something? Should we change the default for the client to actually timeout?

        Show
        Roman Shaposhnik added a comment - Jagane Sundar as a matter of fact I didn't know that – thanks a million for bringing this up! I can definitely give your suggestion a try (the NN keeps OOMing – which gives me a perfect testbed for this). I do have a question for the rest of the folks here though – a client that never times out doesn't strike me as a great default. Am I missing something? Should we change the default for the client to actually timeout?

          People

          • Assignee:
            Unassigned
            Reporter:
            Roman Shaposhnik
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:

              Development