[YARN-3944] Connection refused to nodemanagers are retried at multiple levels - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Won't Fix
Affects Version/s: 2.6.0
Fix Version/s: None
Component/s: None
Labels:
None

Description

This is related to ~~YARN-3238~~. When NM is down, ipc client will get ConnectException.

Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
at org.apache.hadoop.ipc.Client.call(Client.java:1438)

However, retry happens at two layers(ipc retry 40 times and serverProxy retrying 91 times), this could end up with ~1 hour retry interval.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

YARN-3944.v1.patch
20/Jul/15 22:02
1 kB
Siqi Li

Issue Links

relates to

YARN-3238 Connection timeouts to nodemanagers are retried at multiple levels

Closed

Activity

People

Assignee:: Siqi Li

Reporter:: Siqi Li

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 20/Jul/15 21:42

Updated:: 28/Sep/15 17:37

Resolved:: 28/Sep/15 17:33