Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-3187

Pig on tez hang with java.io.IOException: Connection reset by peer

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.8.2
    • None
    • None
    • None
    • Hadoop 2.5.0
      Pig 0.15.0
      Tez 0.8.2

    Description

      We are experiencing occasional application hangs, when testing an existing Pig MapReduce script, executing on Tez. When this occurs, we find this in the syslog for the executing dag:

      016-03-21 16:39:01,643 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000822, containerExpiryTime=1458603541415, idleTimeout=5000, taskRequestsCount=0, heldContainers=112, delayedContainers=27, isNew=false
      2016-03-21 16:39:01,825 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000824, containerExpiryTime=1458603541692, idleTimeout=5000, taskRequestsCount=0, heldContainers=111, delayedContainers=26, isNew=false
      2016-03-21 16:39:01,990 [INFO] Socket Reader #1 for port 53324 |ipc.Server|: Socket Reader #1 for port 53324: readAndProcess from client 10.102.173.86 threw exception [java.io.IOException: Connection reset by peer]
      java.io.IOException: Connection reset by peer
      at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
      at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
      at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
      at sun.nio.ch.IOUtil.read(IOUtil.java:197)
      at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
      at org.apache.hadoop.ipc.Server.channelRead(Server.java:2593)
      at org.apache.hadoop.ipc.Server.access$2800(Server.java:135)
      at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1471)
      at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:762)
      at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:636)
      at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:607)
      2016-03-21 16:39:02,032 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000811, containerExpiryTime=1458603541828, idleTimeout=5000, taskRequestsCount=0, heldContainers=110, delayedContainers=25, isNew=false

      In all cases I've been able to analyze so far, this also correlates with a warning in the node identified in the IOException:

      2016-03-21 16:36:13,641 [WARN] [I/O Setup 2 Initialize:

      {scope-178}

      ] |retry.RetryInvocationHandler|: A failover has occurred since the start of this method invocation attempt.

      However, it does not appear that any namenode failover has actually occurred (the most recent failover we see in logs is from 2015).

      Attached:
      syslog_dag_1437886552023_169758_3.gz: syslog for the dag which hangs
      10.102.173.86.logs.gz: aggregated logs from the host identified in the IOException

      Attachments

        1. task_attempts.tar.gz
          193 kB
          Kurt Muehlner
        2. stack.application_1437886552023_171131.out
          124 kB
          Kurt Muehlner
        3. dag_1437886552023_169758_3.dot
          15 kB
          Kurt Muehlner
        4. TEZ-3187.incomplete-tasks.txt
          10 kB
          Hitesh Shah
        5. 10.102.173.86.logs.gz
          276 kB
          Kurt Muehlner
        6. syslog_dag_1437886552023_169758_3.gz
          507 kB
          Kurt Muehlner

        Issue Links

          Activity

            People

              Unassigned Unassigned
              kmuehlner Kurt Muehlner
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: