[MAPREDUCE-378] Large-scale reliability tests - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Test
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

The fact that we do not have any large-scale reliability tests bothers me. I'll be first to admit that it isn't the easiest of tasks, but I'd like to start a discussion around this... especially given that the code-base is growing to an extent that interactions due to small changes are very hard to predict.

One of the simple scripts I run for every patch I work on does something very simple: run sort500 (or greater), then it randomly picks n tasktrackers from ${HADOOP_CONF_DIR}/conf/slaves and then kills them, a similar script one kills and restarts the tasktrackers.

This helps in checking a fair number of reliability stories: lost tasktrackers, task-failures etc. Clearly this isn't good enough to cover everything, but a start.

Lets discuss - What do we do for HDFS? We need more for Map-Reduce!

Attachments

Sub-Tasks

1.	Add framework hooks to get the running/completed/pending tasks for a given job. Add a way to query the list of currently active tasktrackers from the JobTracker.		Closed	Devaraj Das
2.	Create a test that would inject random failures for tasks in large jobs and would also inject TaskTracker failures		Closed	Devaraj Das

Activity

People

Assignee:: Devaraj Das

Reporter:: Arun Murthy

Votes:: 1 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 21/Dec/07 21:01

Updated:: 20/Jun/09 07:51