Issue Details (XML | Word | Printable)

Key: MAPREDUCE-382
Type: Sub-task Sub-task
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Devaraj Das
Reporter: Devaraj Das
Votes: 0
Watchers: 2
Operations

If you were logged in you would be able to see more operations.
Hadoop Map/Reduce
MAPREDUCE-378

Create a test that would inject random failures for tasks in large jobs and would also inject TaskTracker failures

Created: 24/Dec/08 06:47 AM   Updated: 20/Jun/09 07:51 AM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works 4939.1.patch 2009-01-02 07:19 PM Devaraj Das 27 kB
Text File Licensed for inclusion in ASF works 4939.2.patch 2009-01-19 07:18 PM Devaraj Das 29 kB
Text File Licensed for inclusion in ASF works 4939.patch 2008-12-29 02:39 PM Devaraj Das 23 kB

Resolution Date: 22/Jan/09 09:33 PM


 Description  « Hide
Create a test that would inject random failures for tasks in large jobs and would also inject TaskTracker failures

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Devaraj Das added a comment - 29/Dec/08 02:39 PM
Attaching a patch for review. I am still tuning some parameters in the tests but the patch can be reviewed as such.

Arun C Murthy added a comment - 31/Dec/08 03:21 AM
I only started looking through this, a nit: we shouldn't do 'ps | grep | ... ' etc. for suspend/resume. Rather we should use the pid file to get the daemon's pid.

Devaraj Das added a comment - 31/Dec/08 05:35 AM
Arun, the PID file is looked at if it exists. I have the 'ps | grep | etc' to take care of the case where the test is run using HOD (as is normally the case in Yahoo!). In that case, the PID file is not written. Also, note that stopping/resuming daemons through 'ps | grep | etc' will stop/resume only those processes that had been launched by the user running the test.

Vinod K V added a comment - 31/Dec/08 02:19 PM
In the minimum, we should do our best and ensure that we are sending signals to the right process. For this, we might want to grep process list for "java" AND the full class name, instead of just searching for the daemons' names. We are limiting to one user anyways, and further we are just stopping and continuing the process, and not really destroying them.

Gone throught the patch, a few code comments:

  • runSleepJobTest, runRandomWriterTest and runSortTest can be refactored, they share much code.
  • If we can somehow get TASKTRACKER_EXPIRY_INTERVAL from JT - either via a public API or may be via clusterStatus - it would make the tests better as compared to just relying on the user input.
  • In KillTrackerThread.{startTaskTrackers|stopTaskTrackers}, the output of the shellCommand is currently discarded. That, along with the return code will give more information about the success of the signal sent.
  • If configuration is not setup(mapred-site.xml), local jobrunner would be used and the test fails with little error reporting. I think we can check for jt configuration in the minimum.

Devaraj Das added a comment - 02/Jan/09 07:19 PM
Attached patch with the concerns addressed

Vinod K V added a comment - 07/Jan/09 09:42 AM
Few comments:
  • The patch is breaking compilation because of the change in ClusterStatus constructor:
    • src/mapred/org/apache/hadoop/mapred/LocalJobRunner.java +389
    • src/test/org/apache/hadoop/mapred/TestJobQueueTaskScheduler.java:138
  • When sleepJob(or rather examples) are not on the path, it fails but with the output as follows:
    JOB org.apache.hadoop.examples.SleepJob failed to run
                Waiting for the job org.apache.hadoop.examples.SleepJob to start

    We should avoid the last line, if we can.

  • We can report progress of the jobs every once in a while when running the tests. Now it just stays dumb till the progress reaches the threshold values.
  • I think writing statements to a LOG is better than printing on standard output.
  • With a HOD allocation, the lost TaskTrackers simulating testcase fails even though keys are setup. This is because hadoop-daemons.sh tries on remote nodes to change to the non-existend directory HADOOP_HOME.
    exec "$bin/slaves.sh" --config $HADOOP_CONF_DIR cd "$HADOOP_HOME" \; "$bin/hadoop-daemon.sh" --config $HADOOP_CONF_DIR "$@"

    A simple solution would be to throw away all changes from hadoop-daemon.sh and hadoop-daemons.sh and simpley use slaves.sh as follows:

    HOSTLIST=conf/_reliability_test_slaves_file_ ./bin/slaves.sh ls
  • the -ww flag to ps (ps auxw -ww) is not available on cygwin. It only modifies screen output and can be avoided. A side nit that I observed is that SIGCONT doesn't seem to work on cygwin. That would make the lost tasktracker simulation test completely useless on cygwin.
  • The randomness of failures is pretty peculiar in the tests. Though it can be admitted that it can be changed later if need be.

Vinod K V added a comment - 07/Jan/09 12:15 PM

We can report progress of the jobs every once in a while when running the tests. Now it just stays dumb till the progress reaches the threshold values.

Sorry about that, log4j jar was missing on my classpath. It actually DOES print the job progress.


Devaraj Das added a comment - 19/Jan/09 07:18 PM
Attached is the updated patch.

Vinod K V added a comment - 21/Jan/09 08:36 AM
+1 for the patch.

Devaraj Das added a comment - 22/Jan/09 09:33 PM
I just committed this.

Robert Chansler added a comment - 03/Mar/09 07:19 PM
Edit release note for publication; tests not user-facing.