[HADOOP-572] Chain reaction in a big cluster caused by simultaneous failure of only a few data-nodes. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.6.2
Fix Version/s: 0.8.0
Component/s: None
Labels:
None
Environment:

Large dfs cluster

Description

I've observed a cluster crash caused by simultaneous failure of only 3 data-nodes.
The crash is reproducable. In order to reproduce it you need a rather large cluster.
To simplify calculations I'll consider a 600 node cluster as an example.
The cluster should also contain a substantial amount of data.
We will need at least 3 data-nodes containing 10,000+ blocks each.
Now suppose that these 3 data-nodes fail at the same time, and the name-node
started replicating all missing blocks belonging to the nodes.
The name-node can replicate 50 blocks per second on average based on experimental data.
Meaning, it will take more than 10 minutes, which is the heartbeat expiration interval,
to replicates all 30,000+ blocks.

With the 3 second heartbeat interval there are 600 / 3 = 200 heartbeats hitting the name-node every second.
Under heavy replication load the name-node accepts about 50 heartbeats per second.
So at most 3/4 of all heartbeats remain unserved.

Each node SHOULD send 200 heartbeats during the 10 minute interval, and every time the probability
of the heartbeat being unserved is 3/4 or less.
So the probability of failing of all 200 heartbeats is (3/4) ** 200 = 0 from the practical standpoint.

IN FACT since current implementation sets the rpc timeout to 1 minute, a failed heartbeat takes
1 minute and 8 seconds to complete, and under this circumstances each data-node can send only
9 heartbeats during the 10 minute interval. Thus, the probability of failing of all 9 of them is 0.075,
which means that we will loose 45 nodes out of 600 at the end of the 10 minute interval.
From this point the name-node will be constantly replicating blocks and loosing more nodes, and
becomes effectively dysfunctional.

A map-reduce framework running on top of it makes things deteriorate even faster, because failing
tasks and jobs are trying to remove files and re-create them again increasing the overall load on
the name-node.

I see at least 2 problems that contribute to the chain reaction described above.
1. A heartbeat failure takes too long (1'8").
2. Name-node synchronized operations should be fine-grained.

Attachments

Issue Links

is related to

HADOOP-255 Client Calls are not cancelled after a call timeout

Closed

HADOOP-725 chooseTargets method in FSNamesystem is very inefficient

Closed

relates to

HADOOP-641 Name-node should demand a block report from resurrected data-nodes.

Closed

HADOOP-642 Explicit timeout for ipc.Client

Closed

Activity

People

Assignee:: Sameer Paranjpye

Reporter:: Konstantin Shvachko

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 03/Oct/06 01:55

Updated:: 08/Jul/09 16:42

Resolved:: 03/Jan/07 21:40