Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-7539

Namenode can't leave safemode because of Datanodes' IPC socket timeout

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.5.1
    • None
    • datanode, namenode
    • None
    • 1 master, 1 secondary and 128 slaves, each node has x24 cores, 48GB memory. fsimage is 4GB.

    Description

      During the starting of namenode, data nodes seem waiting namenode's response through IPC to register block pools.

      here is DN's log -

       
      2014-12-16 20:28:09,064 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Acknowledging ACTIVE Namenode Block pool BP-877672386-10.114.130.143-1412666752827 (Datanode Uuid 2117395f-e034-4b4a-adec-8a28464f4796) service to NN.x.com/10.x.x143:9000 
      

      But namenode is too busy to responde it, and datanodes occur socket timeout - default is 1 minute.

      2014-12-16 20:29:09,857 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService
      java.net.SocketTimeoutException: Call From DN1.x.com/10.x.x.84 to NN.x.com:9000 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.x.x.84:57924 remote=NN.x.com/10.x.x.143:9000]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout 
      

      same events repeat and eventually NN drops most connecting trials from DN. So NN can't leave safemode.

      DN's log -

      2014-12-16 20:32:25,895 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService
      java.io.IOException: failed on local exception java.io.ioexception connection reset by peer
      

      There is no troubles in the network, configuration or servers. I think NN is too busy to respond to DN in a minute.

      I configured "ipc.ping.interval" to 15 mins In the core-site.xml, and that was helpful for my cluster.

      <property>
        <name>ipc.ping.interval</name>
        <value>900000</value>
      </property>
      

      In my cluster, namenode responded 1 min ~ 5 mins for the DNs' request.
      It will be helpful if there is more elegant solution.

      2014-12-16 23:28:16,598 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Acknowledging ACTIVE Namenode Block pool BP-877672386-10.x.x.143-1412666752827 (Datanode Uuid c4f7beea-b8e9-404f-bc81-6e87e37263d2) service to NN/10.x.x.143:9000
      2014-12-16 23:31:32,026 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Sent 1 blockreports 2090961 blocks total. Took 1690 msec to generate and 193738 msecs for RPC and NN processing.  Got back commands org.apache.hadoop.hdfs.server.protocol.FinalizeCommand@20e68e11
      2014-12-16 23:31:32,026 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Got finalize command for block pool BP-877672386-10.x.x.143-1412666752827
      2014-12-16 23:31:32,032 INFO org.apache.hadoop.util.GSet: Computing capacity for map BlockMap
      2014-12-16 23:31:32,032 INFO org.apache.hadoop.util.GSet: VM type       = 64-bit
      2014-12-16 23:31:32,044 INFO org.apache.hadoop.util.GSet: 0.5% max memory 3.6 GB = 18.2 MB
      2014-12-16 23:31:32,045 INFO org.apache.hadoop.util.GSet: capacity      = 2^21 = 2097152 entries
      2014-12-16 23:31:32,046 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Periodic Block Verification Scanner initialized with interval 504 hours for block pool BP-877672386-10.114.130.143-1412666752827
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            hoelog hoelog
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: