Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6437

Add retry on some connection exception on job commit phase

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Job commit failed: java.net.ConnectException: Call From TS-DN-167/172.22.5.167 to SHYF-H11-BH03:52310 failed on connection exception: java.net.ConnectException: Connection timed out; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
      at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
      at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
      at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
      at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
      at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
      at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
      at org.apache.hadoop.ipc.Client.call(Client.java:1415)
      at org.apache.hadoop.ipc.Client.call(Client.java:1364)
      at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
      at com.sun.proxy.$Proxy14.create(Unknown Source)
      at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:287)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:606)
      at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
      at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
      at com.sun.proxy.$Proxy15.create(Unknown Source)
      at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1645)
      at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1627)
      at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1552)
      at org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:396)
      at org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:392)
      at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
      at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:392)
      at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:336)
      at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
      at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
      at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
      at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.touchz(CommitterEventHandler.java:244)
      at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.handleJobCommit(CommitterEventHandler.java:250)
      at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.run(CommitterEventHandler.java:216)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      at java.lang.Thread.run(Thread.java:745)
      Caused by: java.net.ConnectException: Connection timed out
      at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
      at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
      at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
      at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
      at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
      at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)
      at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700)
      at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
      at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463)
      at org.apache.hadoop.ipc.Client.call(Client.java:1382)
      ... 28 more
      

      Check the code, there is no chance to make another application master attempt if it encounters the issue of connection. So could we identify the exception, and make another retry or kick off another AM attempt?

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                jiangbinglover Bing Jiang
              • Votes:
                0 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: