[HDFS-7539] Namenode can't leave safemode because of Datanodes' IPC socket timeout - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.5.1
Fix Version/s: None
Component/s: datanode, namenode
Labels:
None
Environment:

1 master, 1 secondary and 128 slaves, each node has x24 cores, 48GB memory. fsimage is 4GB.

Description

During the starting of namenode, data nodes seem waiting namenode's response through IPC to register block pools.

here is DN's log -

 
2014-12-16 20:28:09,064 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Acknowledging ACTIVE Namenode Block pool BP-877672386-10.114.130.143-1412666752827 (Datanode Uuid 2117395f-e034-4b4a-adec-8a28464f4796) service to NN.x.com/10.x.x143:9000

But namenode is too busy to responde it, and datanodes occur socket timeout - default is 1 minute.

2014-12-16 20:29:09,857 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService
java.net.SocketTimeoutException: Call From DN1.x.com/10.x.x.84 to NN.x.com:9000 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.x.x.84:57924 remote=NN.x.com/10.x.x.143:9000]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout

same events repeat and eventually NN drops most connecting trials from DN. So NN can't leave safemode.

DN's log -

2014-12-16 20:32:25,895 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService
java.io.IOException: failed on local exception java.io.ioexception connection reset by peer

There is no troubles in the network, configuration or servers. I think NN is too busy to respond to DN in a minute.

I configured "ipc.ping.interval" to 15 mins In the core-site.xml, and that was helpful for my cluster.

<property>
  <name>ipc.ping.interval</name>
  <value>900000</value>
</property>

In my cluster, namenode responded 1 min ~ 5 mins for the DNs' request.
It will be helpful if there is more elegant solution.

2014-12-16 23:28:16,598 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Acknowledging ACTIVE Namenode Block pool BP-877672386-10.x.x.143-1412666752827 (Datanode Uuid c4f7beea-b8e9-404f-bc81-6e87e37263d2) service to NN/10.x.x.143:9000
2014-12-16 23:31:32,026 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Sent 1 blockreports 2090961 blocks total. Took 1690 msec to generate and 193738 msecs for RPC and NN processing.  Got back commands org.apache.hadoop.hdfs.server.protocol.FinalizeCommand@20e68e11
2014-12-16 23:31:32,026 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Got finalize command for block pool BP-877672386-10.x.x.143-1412666752827
2014-12-16 23:31:32,032 INFO org.apache.hadoop.util.GSet: Computing capacity for map BlockMap
2014-12-16 23:31:32,032 INFO org.apache.hadoop.util.GSet: VM type       = 64-bit
2014-12-16 23:31:32,044 INFO org.apache.hadoop.util.GSet: 0.5% max memory 3.6 GB = 18.2 MB
2014-12-16 23:31:32,045 INFO org.apache.hadoop.util.GSet: capacity      = 2^21 = 2097152 entries
2014-12-16 23:31:32,046 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Periodic Block Verification Scanner initialized with interval 504 hours for block pool BP-877672386-10.114.130.143-1412666752827

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: hoelog

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 17/Dec/14 06:48

Updated:: 22/Dec/14 22:12