Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Duplicate
-
None
-
None
-
None
-
None
Description
The connection failure log during NM restart is as following:
014-11-12 03:31:20,728 WARN [fetcher#23] org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to ip-172-31-37-212.ec2.internal:13562 with 4 map outputs java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.<init>(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.http.HttpClient.New(HttpClient.java:326) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.connect(Fetcher.java:685) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.setupConnectionsWithRetry(Fetcher.java:386) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:292) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193) 2014-11-12 03:31:20,743 INFO [fetcher#22] org.apache.hadoop.mapreduce.task.reduce.Fetcher: for url=13562/mapOutput?job=job_1415762969065_0001&reduce=3&map=attempt_1415762969065_0001_m_000021_0,attempt_1415762969065_0001_m_000004_0,attempt_1415762969065_0001_m_000018_0,attempt_1415762969065_0001_m_000015_0,attempt_1415762969065_0001_m_000001_0,attempt_1415762969065_0001_m_000009_0,attempt_1415762969065_0001_m_000012_0,attempt_1415762969065_0001_m_000006_0 sent hash and received reply
We have some code to handle the retry logic for connection with a timeout (as below). But if connection get refused quickly, we only try very limited times and it get failed also quickly.
while (true) { try { connection.connect(); break; } catch (IOException ioe) { // update the total remaining connect-timeout connectionTimeout -= unit; // throw an exception if we have waited for timeout amount of time // note that the updated value if timeout is used here if (connectionTimeout == 0) { throw ioe; } // reset the connect timeout for the last try if (connectionTimeout < unit) { unit = connectionTimeout; // reset the connect time out for the final connect connection.setConnectTimeout(unit); } }
We should fix this to make retry can continue until timeout.
Attachments
Attachments
Issue Links
- duplicates
-
MAPREDUCE-6156 Fetcher - connect() doesn't handle connection refused correctly
- Closed