Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
0.7.2
-
None
-
None
Description
This bug contributed to the crash discussed in HADOOP-572.
ipc.Client is trying to establish connection with its server with an infinite timeout.
For an unknown to me reason infinity equals 3 minutes in this case.
I guess it is configured somewhere in the native socket implementation.
With this timeout data-nodes had only 3 chances to send heartbeats during the 10
minute expiration interval. And with a very busy name-node this makes their
chances to be accepted close to 0.
I included an explicit call of Socket.connect() with a timeout set to 1 min, which is
our default for all connections.
Modified a log message to include information that turned out to be useful for debugging.
Removed unnecessary imports.
Attachments
Attachments
Issue Links
- is related to
-
HADOOP-572 Chain reaction in a big cluster caused by simultaneous failure of only a few data-nodes.
- Closed