Hadoop Common
  1. Hadoop Common
  2. HADOOP-7488

When Namenode network is unplugged, DFSClient operations waits for ever

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: ipc
    • Labels:
      None

      Description

      When NN/DN is shutdown gracefully, the DFSClient operations which are waiting for a response from NN/DN, will throw exception & come out quickly

      But when the NN/DN network is unplugged, the DFSClient operations which are waiting for a response from NN/DN, waits for ever.

      1. HADOOP-7488.patch
        6 kB
        Uma Maheswara Rao G

        Issue Links

          Activity

          Hide
          Uma Maheswara Rao G added a comment -

          Can make use of HADOOP-6889. marking it as duplicate to it.

          Show
          Uma Maheswara Rao G added a comment - Can make use of HADOOP-6889 . marking it as duplicate to it.
          Hide
          Uma Maheswara Rao G added a comment -

          Hi Konstantin,

          I want your opinion on this. Can you have a look?

          Thanks
          Uma

          Show
          Uma Maheswara Rao G added a comment - Hi Konstantin, I want your opinion on this. Can you have a look? Thanks Uma
          Hide
          Uma Maheswara Rao G added a comment -

          Hi Konstantin,

          Thanks alot for taking a look on this issue.

          If rpcTimeout > 0 then {{ handleTimeout()}} will throw SocketTimeoutException instead of going into ping loop. Can you control the required behavior by setting rpcTimeout > 0 rather introducing the # of pings limit.

          Yes, with this parameter also, we can control.

          I am planning to add below code in DataNode when gettng the proxy.

                  // get NN proxy
                DatanodeProtocol dnp = 
                  (DatanodeProtocol)RPC.waitForProxy(DatanodeProtocol.class,
                      DatanodeProtocol.versionID, nnAddr, conf, socketTimeout,
                 Long.MAX_VALUE);
           

          Here the sockettimeout is rpcTimeOut.
          this property already used for createInterDataNodeProtocolProxy as rpcTimeOut.
          this.socketTimeout = conf.getInt(DFS_CLIENT_SOCKET_TIMEOUT_KEY,
          HdfsConstants.READ_TIMEOUT);

          But my question is, if i use socketTimeout (default 60*1000 ms) as rpcTimeOut, default behaviour will be changed. I dont want to change the default behavior here.
          any suggestion for this?

          DataNodes and TaskTrackers are designed to ping NN and JT infinitely, because during startup you cannot predict when NN will come online as it depends on the size of the image and edits. Also when NN becomes busy it is important for DNs to keep retrying rather than assuming the NN is dead.

          Yes. But there are some scenarios like network unplug may thorugh tomeouts and because of the timeout handlings, unneccerily system will be blocked for long time.
          As i know, even if we through that timeout exception out to JT or DN, they will handle it and retry again in their offerService methods.
          except in below condition

           catch(RemoteException re) {
                    String reClass = re.getClassName();
                    if (UnregisteredNodeException.class.getName().equals(reClass) ||
                        DisallowedDatanodeException.class.getName().equals(reClass) ||
                        IncorrectVersionException.class.getName().equals(reClass)) {
                      LOG.warn("blockpool " + blockPoolId + " is shutting down", re);
                      shouldServiceRun = false;
                      return;
                    }
          

          And even if they don't this should be an HDFS change not generic IPC change, which affects many Hadoop components

          What i felt is, this particular issue will be applicable for all the components who is using Hadoop IPC. And also planned to retain the default behaviour as it is to not effect the other componenets. and if user really required then he will tune the configuration parameter based on his requirement.

          Anyway we decided to use rcpTimeOut right, IPC user code only should pass this value. In that case this will come under HDFS specific chnage. Also need to check the for MapReduce as well ( same situation for JT)

          As for HA I don't know what you did for HA and therefore cannot understand what problem you are trying to solve here. I can guess that you want DNs switch to another NN when they timeout rather than retrying. In this case you should be able to use rpcTimeout

          Yes, your guess is correct
          In our HA solution, we are using BackupNode and Switching framework is Zookeeper based LeaderElection. DNs will contain both the active and standby node addresses configured. On any failure, DNs will try to switch to other NN.
          Here the scenario is, We unplugged the active NN network card, then all DN are blocked for long time.

          --Thanks

          Show
          Uma Maheswara Rao G added a comment - Hi Konstantin, Thanks alot for taking a look on this issue. If rpcTimeout > 0 then {{ handleTimeout()}} will throw SocketTimeoutException instead of going into ping loop. Can you control the required behavior by setting rpcTimeout > 0 rather introducing the # of pings limit. Yes, with this parameter also, we can control. I am planning to add below code in DataNode when gettng the proxy. // get NN proxy DatanodeProtocol dnp = (DatanodeProtocol)RPC.waitForProxy(DatanodeProtocol.class, DatanodeProtocol.versionID, nnAddr, conf, socketTimeout, Long .MAX_VALUE); Here the sockettimeout is rpcTimeOut. this property already used for createInterDataNodeProtocolProxy as rpcTimeOut. this.socketTimeout = conf.getInt(DFS_CLIENT_SOCKET_TIMEOUT_KEY, HdfsConstants.READ_TIMEOUT); But my question is, if i use socketTimeout (default 60*1000 ms) as rpcTimeOut, default behaviour will be changed. I dont want to change the default behavior here. any suggestion for this? DataNodes and TaskTrackers are designed to ping NN and JT infinitely, because during startup you cannot predict when NN will come online as it depends on the size of the image and edits. Also when NN becomes busy it is important for DNs to keep retrying rather than assuming the NN is dead. Yes. But there are some scenarios like network unplug may thorugh tomeouts and because of the timeout handlings, unneccerily system will be blocked for long time. As i know, even if we through that timeout exception out to JT or DN, they will handle it and retry again in their offerService methods. except in below condition catch (RemoteException re) { String reClass = re.getClassName(); if (UnregisteredNodeException.class.getName().equals(reClass) || DisallowedDatanodeException.class.getName().equals(reClass) || IncorrectVersionException.class.getName().equals(reClass)) { LOG.warn( "blockpool " + blockPoolId + " is shutting down" , re); shouldServiceRun = false ; return ; } And even if they don't this should be an HDFS change not generic IPC change, which affects many Hadoop components What i felt is, this particular issue will be applicable for all the components who is using Hadoop IPC. And also planned to retain the default behaviour as it is to not effect the other componenets. and if user really required then he will tune the configuration parameter based on his requirement. Anyway we decided to use rcpTimeOut right, IPC user code only should pass this value. In that case this will come under HDFS specific chnage. Also need to check the for MapReduce as well ( same situation for JT) As for HA I don't know what you did for HA and therefore cannot understand what problem you are trying to solve here. I can guess that you want DNs switch to another NN when they timeout rather than retrying. In this case you should be able to use rpcTimeout Yes, your guess is correct In our HA solution, we are using BackupNode and Switching framework is Zookeeper based LeaderElection. DNs will contain both the active and standby node addresses configured. On any failure, DNs will try to switch to other NN. Here the scenario is, We unplugged the active NN network card, then all DN are blocked for long time. --Thanks
          Hide
          Konstantin Shvachko added a comment -

          If rpcTimeout > 0 then {{ handleTimeout()}} will throw SocketTimeoutException instead of going into ping loop. Can you control the required behavior by setting rpcTimeout > 0 rather introducing the # of pings limit.

          DataNodes and TaskTrackers are designed to ping NN and JT infinitely, because during startup you cannot predict when NN will come online as it depends on the size of the image and edits. Also when NN becomes busy it is important for DNs to keep retrying rather than assuming the NN is dead.

          For DFSClient this may make sense, but I think they already timeout. At list DFSShell ls does. And even if they don't this should be an HDFS change not generic IPC change, which affects many Hadoop components.

          As for HA I don't know what you did for HA and therefore cannot understand what problem you are trying to solve here. I can guess that you want DNs switch to another NN when they timeout rather than retrying. In this case you should be able to use rpcTimeout.

          Show
          Konstantin Shvachko added a comment - If rpcTimeout > 0 then {{ handleTimeout()}} will throw SocketTimeoutException instead of going into ping loop. Can you control the required behavior by setting rpcTimeout > 0 rather introducing the # of pings limit. DataNodes and TaskTrackers are designed to ping NN and JT infinitely, because during startup you cannot predict when NN will come online as it depends on the size of the image and edits. Also when NN becomes busy it is important for DNs to keep retrying rather than assuming the NN is dead. For DFSClient this may make sense, but I think they already timeout. At list DFSShell ls does. And even if they don't this should be an HDFS change not generic IPC change, which affects many Hadoop components. As for HA I don't know what you did for HA and therefore cannot understand what problem you are trying to solve here. I can guess that you want DNs switch to another NN when they timeout rather than retrying. In this case you should be able to use rpcTimeout.
          Hide
          Uma Maheswara Rao G added a comment -


          updated a patch for review!

          Show
          Uma Maheswara Rao G added a comment - updated a patch for review!
          Hide
          Uma Maheswara Rao G added a comment -

          Thanks Jhon for taking alook on this issue.
          Updated a patch for review!.

          This patch introduces a property( max.ping.retries.on.socket.timeout ). Default value will be -1, represents that disabling of this property.

          In this scenario, if we unplug the network cable between the nodes, this ping reads will get timeouts continuosly . SocketTimeOuts was handled and retried infinitely.So, it was waiting for long time.....

          Now , to avoid this problem, we can configure the number of ping retries.
          Anyway continuos timeouts means , somthing wrong in network/cluster, we can restrict this retries by configuring the above property.

          Bydefault this property will be disabled.

          --Thanks

          Show
          Uma Maheswara Rao G added a comment - Thanks Jhon for taking alook on this issue. Updated a patch for review!. This patch introduces a property( max.ping.retries.on.socket.timeout ). Default value will be -1, represents that disabling of this property. In this scenario, if we unplug the network cable between the nodes, this ping reads will get timeouts continuosly . SocketTimeOuts was handled and retried infinitely.So, it was waiting for long time..... Now , to avoid this problem, we can configure the number of ping retries. Anyway continuos timeouts means , somthing wrong in network/cluster, we can restrict this retries by configuring the above property. Bydefault this property will be disabled. --Thanks
          Hide
          John George added a comment -

          > Uma Maheswara Rao G commented on HADOOP-6889:
          > ---------------------------------------------
          >
          > Hi John,
          >
          > I have seen waitForProxy is passing 0 as rpcTimeOut. It is hardcoded value.
          >
          >

          > return waitForProtocolProxy(protocol, clientVersion, addr, conf, 0,
          > connTimeout);
          > 

          If you want to control this value, you could use the waitForProtocolProxy() that accepts "rpcTimeout" as an argument. You could pass in any value
          (eg: "DFS_CLIENT_SOCKET_TIMEOUT_KEY") as rpcTimeout (though that means that it will timeout within that time instead of retrying.

            public static <T> ProtocolProxy<T> waitForProtocolProxy(Class<T> protocol,
                                         long clientVersion,
                                         InetSocketAddress addr, Configuration conf,
                                         int rpcTimeout,
                                         long timeout) throws IOException {
          

          > If user wants to control this value then , how can he configure?

          HADOOP-6889 ensures that any communication to/from DN (DFSclient->DN & DN->DN), times out within rpcTimeout. If a user wants to control this value from configuration, it can be done like it is used today. For example, both these use the "DFS_CLIENT_SOCKET_TIMEOUT_KEY" configuration value to timeout. Like you said, this change does not change any timeout mechanisms to NN communication.

          >
          > Here we have a situation, where clients are waiting for long time.HDFS-1880.

          Based on the attached trace, I can see that DN is trying to reconnect to NN because it wants to send heartbeats to NN. When you say client, do you mean DFSClient is waiting to also doing the same thing trying to communicate with NN. For "connection timeouts" The maximum number of times a client should wait during each time is close to 15 minutes (45 retries with each "connect() taking 20 seconds). For IOExceptions, it should not try more than 4 minutes or so.
          In the trace that is attached here, you can see that it is an "IOException" and not a "SocketTimeoutException". Whenever an IOException is encountered, it tries "ipc.client.connect.max.retries" before it gives up, which can be controlled by conf. As you can see, it does give up after 10 retries, but since DN keeps trying to send heartbeats, it keeps doing it even after it fails.

          conf.getInt("ipc.client.connect.max.retries", 10)
          

          >
          > I thought, HADOOP-6889 can solve that problem. But how this can be controlled
          > by the user in Hadoop (looks no configuration parameters available).
          >

          > I plan to add a new configuration ipc.client.max.pings that specifies the max
          > number of pings that a client could try. If a response can not be received
          > after the specified max number of pings, a SocketTimeoutException is thrown.
          > If this configuration property is not set, a client maintains the current
          > semantics, waiting forever.

          >
          > We have choosen this implementation for our cluster.
          >
          > I am just checking , whether i can use rpcTimeOut itself to control. ( since
          > this change already committed).
          >
          > Can you please clarify more?

          If you just want to fail the call after a certain number of pings, introducing this new value "max.pings" might be a good idea. By using rpcTimeout, all it is doing is setting the socket timeout to be "rpcTimeout". There are no pings sent at all.

          >
          > Can you just check HDFS-1880.
          >
          >
          > @Hairong
          > I thought about introducing a configuration parameter. But clients or
          > DataNodes want to have timeout for RPCs to DataNodes but no timeout for RPCs
          > to NameNodes. Adding a rpcTimeout parameter makes this easy.
          > I think considering HA, clients and NameNode also requires some timeout.
          > If Active goes down, then clients should not wait in timeouts right?

          I do not know enough about HA to comment about this.

          Show
          John George added a comment - > Uma Maheswara Rao G commented on HADOOP-6889 : > --------------------------------------------- > > Hi John, > > I have seen waitForProxy is passing 0 as rpcTimeOut. It is hardcoded value. > > > return waitForProtocolProxy(protocol, clientVersion, addr, conf, 0, > connTimeout); > If you want to control this value, you could use the waitForProtocolProxy() that accepts "rpcTimeout" as an argument. You could pass in any value (eg: "DFS_CLIENT_SOCKET_TIMEOUT_KEY") as rpcTimeout (though that means that it will timeout within that time instead of retrying. public static <T> ProtocolProxy<T> waitForProtocolProxy( Class <T> protocol, long clientVersion, InetSocketAddress addr, Configuration conf, int rpcTimeout, long timeout) throws IOException { > If user wants to control this value then , how can he configure? HADOOP-6889 ensures that any communication to/from DN (DFSclient->DN & DN->DN), times out within rpcTimeout. If a user wants to control this value from configuration, it can be done like it is used today. For example, both these use the "DFS_CLIENT_SOCKET_TIMEOUT_KEY" configuration value to timeout. Like you said, this change does not change any timeout mechanisms to NN communication. > > Here we have a situation, where clients are waiting for long time. HDFS-1880 . Based on the attached trace, I can see that DN is trying to reconnect to NN because it wants to send heartbeats to NN. When you say client, do you mean DFSClient is waiting to also doing the same thing trying to communicate with NN. For "connection timeouts" The maximum number of times a client should wait during each time is close to 15 minutes (45 retries with each "connect() taking 20 seconds). For IOExceptions, it should not try more than 4 minutes or so. In the trace that is attached here, you can see that it is an "IOException" and not a "SocketTimeoutException". Whenever an IOException is encountered, it tries "ipc.client.connect.max.retries" before it gives up, which can be controlled by conf. As you can see, it does give up after 10 retries, but since DN keeps trying to send heartbeats, it keeps doing it even after it fails. conf.getInt( "ipc.client.connect.max.retries" , 10) > > I thought, HADOOP-6889 can solve that problem. But how this can be controlled > by the user in Hadoop (looks no configuration parameters available). > > I plan to add a new configuration ipc.client.max.pings that specifies the max > number of pings that a client could try. If a response can not be received > after the specified max number of pings, a SocketTimeoutException is thrown. > If this configuration property is not set, a client maintains the current > semantics, waiting forever. > > We have choosen this implementation for our cluster. > > I am just checking , whether i can use rpcTimeOut itself to control. ( since > this change already committed). > > Can you please clarify more? If you just want to fail the call after a certain number of pings, introducing this new value "max.pings" might be a good idea. By using rpcTimeout, all it is doing is setting the socket timeout to be "rpcTimeout". There are no pings sent at all. > > Can you just check HDFS-1880 . > > > @Hairong > I thought about introducing a configuration parameter. But clients or > DataNodes want to have timeout for RPCs to DataNodes but no timeout for RPCs > to NameNodes. Adding a rpcTimeout parameter makes this easy. > I think considering HA, clients and NameNode also requires some timeout. > If Active goes down, then clients should not wait in timeouts right? I do not know enough about HA to comment about this.
          Hide
          Uma Maheswara Rao G added a comment -

          Hi Aaron,
          I have seen similar issue HADOOP-6889.

          That issue introduces rpcTimeOut, But from waitForProxy apis that values has been passed as 0 (hard coded value).

          I am just checking , whether i can use rpcTimeOut itself to control. ( since this change already committed). But in current code, i could not see the way to configure rpcTimeOut.

          see https://issues.apache.org/jira/browse/HADOOP-6889

          Show
          Uma Maheswara Rao G added a comment - Hi Aaron, I have seen similar issue HADOOP-6889 . That issue introduces rpcTimeOut, But from waitForProxy apis that values has been passed as 0 (hard coded value). I am just checking , whether i can use rpcTimeOut itself to control. ( since this change already committed). But in current code, i could not see the way to configure rpcTimeOut. see https://issues.apache.org/jira/browse/HADOOP-6889
          Hide
          Aaron T. Myers added a comment -

          Yes , we may need to control this retries, so that it can break this loop after some number of retries.Because continusly getting timeout exception also can be consider as some problem in cluster environment.

          There's a fair amount of code to support automatic retries with configurable policies for IPC calls. Perhaps this could be adapted slightly and reused in the case of failures during data transfer. The relevant code is all in "common/src/java/org/apache/hadoop/(io|ipc)".

          Show
          Aaron T. Myers added a comment - Yes , we may need to control this retries, so that it can break this loop after some number of retries.Because continusly getting timeout exception also can be consider as some problem in cluster environment. There's a fair amount of code to support automatic retries with configurable policies for IPC calls. Perhaps this could be adapted slightly and reused in the case of failures during data transfer. The relevant code is all in " common/src/java/org/apache/hadoop/(io|ipc) ".
          Hide
          Uma Maheswara Rao G added a comment -

          Since we are moving towards HA implementations, this issue will create many problems.
          We were observing the same in our HA clusters.

          Here the actual problem is at:

               
              public int read(byte[] buf, int off, int len) throws IOException {
                  do {
                    try {
                      return super.read(buf, off, len);
                    } catch (SocketTimeoutException e) {
                      handleTimeout(e);
                    }
                  } while (true);
                }
           
           

          When we unplug the network cable, this super.read will throw SocketTimeoutException.
          It is handled the SocketTimeoutException and again it will trying to send the ping request.

          SO, this loop is getting repeated.

          So I feel we can add some configuration to retry for that specific interval so that as per the need.

          Yes , we may need to control this retries, so that it can break this loop after some number of retries.Because continusly getting timeout exception also can be consider as some problem in cluster environment.

          Show
          Uma Maheswara Rao G added a comment - Since we are moving towards HA implementations, this issue will create many problems. We were observing the same in our HA clusters. Here the actual problem is at: public int read( byte [] buf, int off, int len) throws IOException { do { try { return super .read(buf, off, len); } catch (SocketTimeoutException e) { handleTimeout(e); } } while ( true ); } When we unplug the network cable, this super.read will throw SocketTimeoutException. It is handled the SocketTimeoutException and again it will trying to send the ping request. SO, this loop is getting repeated. So I feel we can add some configuration to retry for that specific interval so that as per the need. Yes , we may need to control this retries, so that it can break this loop after some number of retries.Because continusly getting timeout exception also can be consider as some problem in cluster environment.
          Hide
          ramkrishna.s.vasudevan added a comment -

          Thanks for checking the defect.

          Yes, these are in datanode.

          The same problem will exist for any client that is using ipc.Client.

          So I feel we can add some configuration to retry for that specific interval so that as per the need.

          Show
          ramkrishna.s.vasudevan added a comment - Thanks for checking the defect. Yes, these are in datanode. The same problem will exist for any client that is using ipc.Client. So I feel we can add some configuration to retry for that specific interval so that as per the need.
          Hide
          Steve Loughran added a comment -

          these are all in the Datanode? This is designed to spin forever waiting for the Namenode to come back up. Are you also seeing the problem in other clients?

          Show
          Steve Loughran added a comment - these are all in the Datanode? This is designed to spin forever waiting for the Namenode to come back up. Are you also seeing the problem in other clients?
          Hide
          ramkrishna.s.vasudevan added a comment -

          The problem that we are facing is

          If we have to switch the namenode using some OM then as the DN is not going down after few retries, though the namenode switch may happen but this DN still continues to connect to the old NN.

          So we suggest like we can add retry mechanism for a specified interval (this can be configured) and throw exception so that the DN goes down.

          Show
          ramkrishna.s.vasudevan added a comment - The problem that we are facing is If we have to switch the namenode using some OM then as the DN is not going down after few retries, though the namenode switch may happen but this DN still continues to connect to the old NN. So we suggest like we can add retry mechanism for a specified interval (this can be configured) and throw exception so that the DN goes down.
          Hide
          ramkrishna.s.vasudevan added a comment -

          Hi pls find the logs below

          2010-06-06 19:56:45,406 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=DataNode, sessionId=null
          2010-06-06 19:56:45,426 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=DataNode, port=50020
          2010-06-06 19:56:45,428 INFO org.apache.hadoop.ipc.metrics.RpcDetailedMetrics: Initializing RPC Metrics with hostName=DataNode, port=50020
          2010-06-06 19:56:45,433 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: dnRegistration = DatanodeRegistration(linux112:50010, storageID=, infoPort=50075, ipcPort=50020)
          2010-06-06 19:56:45,437 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 50020
          2010-06-06 19:56:47,804 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: New storage id DS-1238806821-10.18.52.112-50010-1275834407685 is assigned to data-node 10.18.52.112:50010
          2010-06-06 19:56:47,805 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.18.52.112:50010, storageID=DS-1238806821-10.18.52.112-50010-1275834407685, infoPort=50075, ipcPort=50020)In DataNode.run, data = FSDataset

          {dirpath='/home/ramkrishna/opensrchadoop/hadoop-common-0.23.0-SNAPSHOT/hadoop-root/dfs/data/current/finalized'}

          2010-06-06 19:56:47,806 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
          2010-06-06 19:56:47,808 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 50020: starting
          2010-06-06 19:56:47,808 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50020: starting
          2010-06-06 19:56:47,809 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50020: starting
          2010-06-06 19:56:47,809 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: using BLOCKREPORT_INTERVAL of 60000msec Initial delay: 0msec
          2010-06-06 19:56:47,810 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 50020: starting
          2010-06-06 19:56:47,839 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 0 blocks took 2 msec to generate and 17 msecs for RPC and NN processing
          2010-06-06 19:56:47,840 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Starting Periodic block scanner.
          2010-06-06 19:57:32,878 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 0 blocks took 0 msec to generate and 4 msecs for RPC and NN processing
          2010-06-06 19:58:32,953 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 0 blocks took 0 msec to generate and 3 msecs for RPC and NN processing
          2010-06-06 20:14:40,742 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Call to /10.18.52.181:9000 failed on local exception: java.io.IOException: No route to host
          at org.apache.hadoop.ipc.Client.wrapException(Client.java:1087)
          at org.apache.hadoop.ipc.Client.call(Client.java:1055)
          at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:251)
          at $Proxy4.sendHeartbeat(Unknown Source)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:933)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1489)
          at java.lang.Thread.run(Thread.java:619)
          Caused by: java.io.IOException: No route to host
          at sun.nio.ch.FileDispatcher.read0(Native Method)
          at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
          at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
          at sun.nio.ch.IOUtil.read(IOUtil.java:206)
          at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
          at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:59)
          at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
          at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:159)
          at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:132)
          at java.io.FilterInputStream.read(FilterInputStream.java:116)
          at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:371)
          at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
          at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
          at java.io.DataInputStream.readInt(DataInputStream.java:370)
          at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:784)
          at org.apache.hadoop.ipc.Client$Connection.run(Client.java:722)

          2010-06-06 20:14:44,748 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 0 time(s).
          2010-06-06 20:14:48,756 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 1 time(s).
          2010-06-06 20:14:52,765 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 2 time(s).
          2010-06-06 20:14:56,773 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 3 time(s).
          2010-06-06 20:15:00,781 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 4 time(s).
          2010-06-06 20:15:04,789 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 5 time(s).
          2010-06-06 20:15:08,798 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 6 time(s).
          2010-06-06 20:15:12,806 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 7 time(s).
          2010-06-06 20:15:16,814 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 8 time(s).
          2010-06-06 20:15:20,822 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 9 time(s).
          2010-06-06 20:15:23,827 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Call to /10.18.52.181:9000 failed on local exception: java.net.NoRouteToHostException: No route to host
          at org.apache.hadoop.ipc.Client.wrapException(Client.java:1087)
          at org.apache.hadoop.ipc.Client.call(Client.java:1055)
          at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:251)
          at $Proxy4.sendHeartbeat(Unknown Source)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:933)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1489)
          at java.lang.Thread.run(Thread.java:619)
          Caused by: java.net.NoRouteToHostException: No route to host
          at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
          at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
          at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
          at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:375)
          at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:440)
          at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:528)
          at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:209)
          at org.apache.hadoop.ipc.Client.getConnection(Client.java:1188)
          at org.apache.hadoop.ipc.Client.call(Client.java:1032)
          ... 5 more

          2010-06-06 20:15:27,835 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 0 time(s).
          2010-06-06 20:15:31,843 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 1 time(s).
          2010-06-06 20:15:35,851 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 2 time(s).
          2010-06-06 20:15:39,860 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 3 time(s).
          2010-06-06 20:15:43,868 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 4 time(s).
          2010-06-06 20:15:47,876 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 5 time(s).
          2010-06-06 20:15:51,884 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 6 time(s).
          2010-06-06 20:15:55,893 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 7 time(s).
          2010-06-06 20:15:59,901 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 8 time(s).
          2010-06-06 20:16:03,909 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 9 time(s).
          2010-06-06 20:16:06,914 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Call to /10.18.52.181:9000 failed on local exception: java.net.NoRouteToHostException: No route to host
          at org.apache.hadoop.ipc.Client.wrapException(Client.java:1087)
          at org.apache.hadoop.ipc.Client.call(Client.java:1055)
          at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:251)
          at $Proxy4.sendHeartbeat(Unknown Source)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:933)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1489)
          at java.lang.Thread.run(Thread.java:619)
          Caused by: java.net.NoRouteToHostException: No route to host
          at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
          at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
          at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
          at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:375)
          at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:440)
          at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:528)
          at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:209)
          at org.apache.hadoop.ipc.Client.getConnection(Client.java:1188)
          at org.apache.hadoop.ipc.Client.call(Client.java:1032)
          ... 5 more

          2010-06-06 20:16:10,922 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 0 time(s).
          2010-06-06 20:16:14,930 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 1 time(s).
          2010-06-06 20:16:18,938 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 2 time(s).
          2010-06-06 20:16:22,946 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 3 time(s).
          2010-06-06 20:16:26,955 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 4 time(s).
          2010-06-06 20:16:30,963 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 5 time(s).
          2010-06-06 20:16:34,971 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 6 time(s).
          2010-06-06 20:16:38,979 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 7 time(s).
          2010-06-06 20:16:42,988 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 8 time(s).
          2010-06-06 20:16:46,996 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 9 time(s).
          2010-06-06 20:16:50,001 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Call to /10.18.52.181:9000 failed on local exception: java.net.NoRouteToHostException: No route to host
          at org.apache.hadoop.ipc.Client.wrapException(Client.java:1087)
          at org.apache.hadoop.ipc.Client.call(Client.java:1055)
          at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:251)
          at $Proxy4.sendHeartbeat(Unknown Source)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:933)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1489)
          at java.lang.Thread.run(Thread.java:619)
          Caused by: java.net.NoRouteToHostException: No route to host
          at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
          at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
          at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
          at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:375)
          at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:440)
          at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:528)
          at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:209)
          at org.apache.hadoop.ipc.Client.getConnection(Client.java:1188)
          at org.apache.hadoop.ipc.Client.call(Client.java:1032)
          ... 5 more

          2010-06-06 20:16:54,008 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 0 time(s).
          2010-06-06 20:16:58,017 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 1 time(s).
          2010-06-06 20:17:02,025 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 2 time(s).
          2010-06-06 20:17:06,033 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 3 time(s).
          2010-06-06 20:17:10,041 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 4 time(s).
          2010-06-06 20:17:14,050 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 5 time(s).
          2010-06-06 20:17:18,058 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 6 time(s).
          2010-06-06 20:17:22,066 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 7 time(s).
          2010-06-06 20:17:26,074 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 8 time(s).
          2010-06-06 20:17:30,083 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 9 time(s).
          2010-06-06 20:17:33,088 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Call to /10.18.52.181:9000 failed on local exception: java.net.NoRouteToHostException: No route to host
          at org.apache.hadoop.ipc.Client.wrapException(Client.java:1087)
          at org.apache.hadoop.ipc.Client.call(Client.java:1055)
          at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:251)
          at $Proxy4.sendHeartbeat(Unknown Source)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:933)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1489)
          at java.lang.Thread.run(Thread.java:619)
          Caused by: java.net.NoRouteToHostException: No route to host
          at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
          at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
          at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
          at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:375)
          at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:440)
          at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:528)
          at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:209)
          at org.apache.hadoop.ipc.Client.getConnection(Client.java:1188)
          at org.apache.hadoop.ipc.Client.call(Client.java:1032)
          ... 5 more

          2010-06-06 20:17:37,095 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 0 time(s).
          2010-06-06 20:17:41,103 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 1 time(s).
          2010-06-06 20:17:45,111 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 2 time(s).
          2010-06-06 20:17:49,120 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 3 time(s).
          2010-06-06 20:17:53,128 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 4 time(s).
          2010-06-06 20:17:57,136 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 5 time(s).
          2010-06-06 20:18:01,144 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 6 time(s).
          2010-06-06 20:18:04,163 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action: DNA_REGISTER
          2010-06-06 20:18:04,178 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 0 blocks took 0 msec to generate and 8 msecs for RPC and NN processing

          Show
          ramkrishna.s.vasudevan added a comment - Hi pls find the logs below 2010-06-06 19:56:45,406 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=DataNode, sessionId=null 2010-06-06 19:56:45,426 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=DataNode, port=50020 2010-06-06 19:56:45,428 INFO org.apache.hadoop.ipc.metrics.RpcDetailedMetrics: Initializing RPC Metrics with hostName=DataNode, port=50020 2010-06-06 19:56:45,433 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: dnRegistration = DatanodeRegistration(linux112:50010, storageID=, infoPort=50075, ipcPort=50020) 2010-06-06 19:56:45,437 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 50020 2010-06-06 19:56:47,804 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: New storage id DS-1238806821-10.18.52.112-50010-1275834407685 is assigned to data-node 10.18.52.112:50010 2010-06-06 19:56:47,805 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.18.52.112:50010, storageID=DS-1238806821-10.18.52.112-50010-1275834407685, infoPort=50075, ipcPort=50020)In DataNode.run, data = FSDataset {dirpath='/home/ramkrishna/opensrchadoop/hadoop-common-0.23.0-SNAPSHOT/hadoop-root/dfs/data/current/finalized'} 2010-06-06 19:56:47,806 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting 2010-06-06 19:56:47,808 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 50020: starting 2010-06-06 19:56:47,808 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50020: starting 2010-06-06 19:56:47,809 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50020: starting 2010-06-06 19:56:47,809 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: using BLOCKREPORT_INTERVAL of 60000msec Initial delay: 0msec 2010-06-06 19:56:47,810 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 50020: starting 2010-06-06 19:56:47,839 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 0 blocks took 2 msec to generate and 17 msecs for RPC and NN processing 2010-06-06 19:56:47,840 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Starting Periodic block scanner. 2010-06-06 19:57:32,878 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 0 blocks took 0 msec to generate and 4 msecs for RPC and NN processing 2010-06-06 19:58:32,953 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 0 blocks took 0 msec to generate and 3 msecs for RPC and NN processing 2010-06-06 20:14:40,742 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Call to /10.18.52.181:9000 failed on local exception: java.io.IOException: No route to host at org.apache.hadoop.ipc.Client.wrapException(Client.java:1087) at org.apache.hadoop.ipc.Client.call(Client.java:1055) at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:251) at $Proxy4.sendHeartbeat(Unknown Source) at org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:933) at org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1489) at java.lang.Thread.run(Thread.java:619) Caused by: java.io.IOException: No route to host at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233) at sun.nio.ch.IOUtil.read(IOUtil.java:206) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:59) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:159) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:132) at java.io.FilterInputStream.read(FilterInputStream.java:116) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:371) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:784) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:722) 2010-06-06 20:14:44,748 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 0 time(s). 2010-06-06 20:14:48,756 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 1 time(s). 2010-06-06 20:14:52,765 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 2 time(s). 2010-06-06 20:14:56,773 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 3 time(s). 2010-06-06 20:15:00,781 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 4 time(s). 2010-06-06 20:15:04,789 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 5 time(s). 2010-06-06 20:15:08,798 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 6 time(s). 2010-06-06 20:15:12,806 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 7 time(s). 2010-06-06 20:15:16,814 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 8 time(s). 2010-06-06 20:15:20,822 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 9 time(s). 2010-06-06 20:15:23,827 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Call to /10.18.52.181:9000 failed on local exception: java.net.NoRouteToHostException: No route to host at org.apache.hadoop.ipc.Client.wrapException(Client.java:1087) at org.apache.hadoop.ipc.Client.call(Client.java:1055) at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:251) at $Proxy4.sendHeartbeat(Unknown Source) at org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:933) at org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1489) at java.lang.Thread.run(Thread.java:619) Caused by: java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:375) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:440) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:528) at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:209) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1188) at org.apache.hadoop.ipc.Client.call(Client.java:1032) ... 5 more 2010-06-06 20:15:27,835 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 0 time(s). 2010-06-06 20:15:31,843 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 1 time(s). 2010-06-06 20:15:35,851 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 2 time(s). 2010-06-06 20:15:39,860 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 3 time(s). 2010-06-06 20:15:43,868 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 4 time(s). 2010-06-06 20:15:47,876 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 5 time(s). 2010-06-06 20:15:51,884 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 6 time(s). 2010-06-06 20:15:55,893 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 7 time(s). 2010-06-06 20:15:59,901 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 8 time(s). 2010-06-06 20:16:03,909 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 9 time(s). 2010-06-06 20:16:06,914 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Call to /10.18.52.181:9000 failed on local exception: java.net.NoRouteToHostException: No route to host at org.apache.hadoop.ipc.Client.wrapException(Client.java:1087) at org.apache.hadoop.ipc.Client.call(Client.java:1055) at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:251) at $Proxy4.sendHeartbeat(Unknown Source) at org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:933) at org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1489) at java.lang.Thread.run(Thread.java:619) Caused by: java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:375) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:440) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:528) at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:209) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1188) at org.apache.hadoop.ipc.Client.call(Client.java:1032) ... 5 more 2010-06-06 20:16:10,922 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 0 time(s). 2010-06-06 20:16:14,930 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 1 time(s). 2010-06-06 20:16:18,938 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 2 time(s). 2010-06-06 20:16:22,946 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 3 time(s). 2010-06-06 20:16:26,955 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 4 time(s). 2010-06-06 20:16:30,963 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 5 time(s). 2010-06-06 20:16:34,971 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 6 time(s). 2010-06-06 20:16:38,979 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 7 time(s). 2010-06-06 20:16:42,988 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 8 time(s). 2010-06-06 20:16:46,996 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 9 time(s). 2010-06-06 20:16:50,001 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Call to /10.18.52.181:9000 failed on local exception: java.net.NoRouteToHostException: No route to host at org.apache.hadoop.ipc.Client.wrapException(Client.java:1087) at org.apache.hadoop.ipc.Client.call(Client.java:1055) at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:251) at $Proxy4.sendHeartbeat(Unknown Source) at org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:933) at org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1489) at java.lang.Thread.run(Thread.java:619) Caused by: java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:375) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:440) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:528) at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:209) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1188) at org.apache.hadoop.ipc.Client.call(Client.java:1032) ... 5 more 2010-06-06 20:16:54,008 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 0 time(s). 2010-06-06 20:16:58,017 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 1 time(s). 2010-06-06 20:17:02,025 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 2 time(s). 2010-06-06 20:17:06,033 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 3 time(s). 2010-06-06 20:17:10,041 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 4 time(s). 2010-06-06 20:17:14,050 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 5 time(s). 2010-06-06 20:17:18,058 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 6 time(s). 2010-06-06 20:17:22,066 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 7 time(s). 2010-06-06 20:17:26,074 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 8 time(s). 2010-06-06 20:17:30,083 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 9 time(s). 2010-06-06 20:17:33,088 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Call to /10.18.52.181:9000 failed on local exception: java.net.NoRouteToHostException: No route to host at org.apache.hadoop.ipc.Client.wrapException(Client.java:1087) at org.apache.hadoop.ipc.Client.call(Client.java:1055) at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:251) at $Proxy4.sendHeartbeat(Unknown Source) at org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:933) at org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1489) at java.lang.Thread.run(Thread.java:619) Caused by: java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:375) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:440) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:528) at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:209) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1188) at org.apache.hadoop.ipc.Client.call(Client.java:1032) ... 5 more 2010-06-06 20:17:37,095 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 0 time(s). 2010-06-06 20:17:41,103 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 1 time(s). 2010-06-06 20:17:45,111 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 2 time(s). 2010-06-06 20:17:49,120 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 3 time(s). 2010-06-06 20:17:53,128 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 4 time(s). 2010-06-06 20:17:57,136 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 5 time(s). 2010-06-06 20:18:01,144 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.18.52.181:9000. Already tried 6 time(s). 2010-06-06 20:18:04,163 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action: DNA_REGISTER 2010-06-06 20:18:04,178 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 0 blocks took 0 msec to generate and 8 msecs for RPC and NN processing
          Hide
          Steve Loughran added a comment -
          1. which version are you seeing this on?
          2. when you say unplugged, do you mean the ethernet port of your local machine came unplugged, or the connection to the remote server failed?
          3. Can you add the stack trace you see in the exceptions, to show where the problem is?

          This is an HDFS problem, so re-assigning there

          Show
          Steve Loughran added a comment - which version are you seeing this on? when you say unplugged, do you mean the ethernet port of your local machine came unplugged, or the connection to the remote server failed? Can you add the stack trace you see in the exceptions, to show where the problem is? This is an HDFS problem, so re-assigning there
          Hide
          ramkrishna.s.vasudevan added a comment -

          The problem was whenever network was unplugged the read operation was getting a timedout exception and it was trying again. This continued for almost 15 times and only then some connectionloss exception came and could come out.
          By this time it was taking around 45 mins.
          Hence we have done something like, configure the parameter
          "max.ping.retries.on.socket.timeout" to a value where you can configure a value after which it should come out after getting a socket time. So while retrying chk for this configured value and once reached come out.

          This problem comes only in unplug scenarios. So based on the scenario this value can be configured as to when how much time it should chk to get a connection.

          Show
          ramkrishna.s.vasudevan added a comment - The problem was whenever network was unplugged the read operation was getting a timedout exception and it was trying again. This continued for almost 15 times and only then some connectionloss exception came and could come out. By this time it was taking around 45 mins. Hence we have done something like, configure the parameter "max.ping.retries.on.socket.timeout" to a value where you can configure a value after which it should come out after getting a socket time. So while retrying chk for this configured value and once reached come out. This problem comes only in unplug scenarios. So based on the scenario this value can be configured as to when how much time it should chk to get a connection.

            People

            • Assignee:
              Uma Maheswara Rao G
              Reporter:
              Uma Maheswara Rao G
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development