[TEZ-3187] Pig on tez hang with java.io.IOException: Connection reset by peer - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 0.8.2
Fix Version/s: None
Component/s: None
Labels:
None
Environment:

Hadoop 2.5.0
Pig 0.15.0
Tez 0.8.2

Description

We are experiencing occasional application hangs, when testing an existing Pig MapReduce script, executing on Tez. When this occurs, we find this in the syslog for the executing dag:

016-03-21 16:39:01,643 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000822, containerExpiryTime=1458603541415, idleTimeout=5000, taskRequestsCount=0, heldContainers=112, delayedContainers=27, isNew=false
2016-03-21 16:39:01,825 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000824, containerExpiryTime=1458603541692, idleTimeout=5000, taskRequestsCount=0, heldContainers=111, delayedContainers=26, isNew=false
2016-03-21 16:39:01,990 [INFO] Socket Reader #1 for port 53324 |ipc.Server|: Socket Reader #1 for port 53324: readAndProcess from client 10.102.173.86 threw exception [java.io.IOException: Connection reset by peer]
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at org.apache.hadoop.ipc.Server.channelRead(Server.java:2593)
at org.apache.hadoop.ipc.Server.access$2800(Server.java:135)
at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1471)
at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:762)
at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:636)
at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:607)
2016-03-21 16:39:02,032 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000811, containerExpiryTime=1458603541828, idleTimeout=5000, taskRequestsCount=0, heldContainers=110, delayedContainers=25, isNew=false

In all cases I've been able to analyze so far, this also correlates with a warning in the node identified in the IOException:

2016-03-21 16:36:13,641 [WARN] [I/O Setup 2 Initialize:

{scope-178}

] |retry.RetryInvocationHandler|: A failover has occurred since the start of this method invocation attempt.

However, it does not appear that any namenode failover has actually occurred (the most recent failover we see in logs is from 2015).

Attached:
syslog_dag_1437886552023_169758_3.gz: syslog for the dag which hangs
10.102.173.86.logs.gz: aggregated logs from the host identified in the IOException

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

task_attempts.tar.gz
31/Mar/16 17:39
193 kB
Kurt Muehlner
stack.application_1437886552023_171131.out
25/Mar/16 18:26
124 kB
Kurt Muehlner
dag_1437886552023_169758_3.dot
24/Mar/16 18:55
15 kB
Kurt Muehlner
TEZ-3187.incomplete-tasks.txt
24/Mar/16 18:37
10 kB
Hitesh Shah
10.102.173.86.logs.gz
24/Mar/16 16:48
276 kB
Kurt Muehlner
syslog_dag_1437886552023_169758_3.gz
24/Mar/16 16:48
507 kB
Kurt Muehlner

Issue Links

is related to

PIG-4869 Removing unwanted configuration in Tez broke ConfiguredFailoverProxyProvider

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Kurt Muehlner

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 24/Mar/16 16:47

Updated:: 12/Apr/16 19:10