Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6157

Connect failed in shuffle (due to NM down) could break current retry logic to tolerant NM restart.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Duplicate
    • None
    • None
    • None
    • None

    Description

      The connection failure log during NM restart is as following:

      014-11-12 03:31:20,728 WARN [fetcher#23] org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to ip-172-31-37-212.ec2.internal:13562 with 4 map outputs
      java.net.ConnectException: Connection refused
              at java.net.PlainSocketImpl.socketConnect(Native Method)
              at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
              at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
              at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
              at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
              at java.net.Socket.connect(Socket.java:579)
              at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
              at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
              at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
              at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
              at sun.net.www.http.HttpClient.New(HttpClient.java:308)
              at sun.net.www.http.HttpClient.New(HttpClient.java:326)
              at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996)
              at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932)
              at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850)
              at org.apache.hadoop.mapreduce.task.reduce.Fetcher.connect(Fetcher.java:685)
              at org.apache.hadoop.mapreduce.task.reduce.Fetcher.setupConnectionsWithRetry(Fetcher.java:386)
              at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:292)
              at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)
      2014-11-12 03:31:20,743 INFO [fetcher#22] org.apache.hadoop.mapreduce.task.reduce.Fetcher: for url=13562/mapOutput?job=job_1415762969065_0001&reduce=3&map=attempt_1415762969065_0001_m_000021_0,attempt_1415762969065_0001_m_000004_0,attempt_1415762969065_0001_m_000018_0,attempt_1415762969065_0001_m_000015_0,attempt_1415762969065_0001_m_000001_0,attempt_1415762969065_0001_m_000009_0,attempt_1415762969065_0001_m_000012_0,attempt_1415762969065_0001_m_000006_0 sent hash and received reply
      

      We have some code to handle the retry logic for connection with a timeout (as below). But if connection get refused quickly, we only try very limited times and it get failed also quickly.

      while (true) {
            try {
              connection.connect();
              break;
            } catch (IOException ioe) {
              // update the total remaining connect-timeout
              connectionTimeout -= unit;
      
              // throw an exception if we have waited for timeout amount of time
              // note that the updated value if timeout is used here
              if (connectionTimeout == 0) {
                throw ioe;
              }
      
              // reset the connect timeout for the last try
              if (connectionTimeout < unit) {
                unit = connectionTimeout;
                // reset the connect time out for the final connect
                connection.setConnectTimeout(unit);
              }
            }
      

      We should fix this to make retry can continue until timeout.

      Attachments

        1. MAPREDUCE-6157.patch
          2 kB
          Junping Du

        Issue Links

          Activity

            People

              junping_du Junping Du
              junping_du Junping Du
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: