|
[
Permlink
| « Hide
]
Vinod K V added a comment - 19/May/09 06:50 AM
Attaching patch to fix this issue.
Changes to JobInProgress looks fine.
Could you add an assertion to the test case to verify presence of empty output directories as well? That should fail without the patch and should pass with the patch. ant test-patch results:
[exec] +1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] +1 tests included. The patch appears to include 6 new or modified tests.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
[exec]
[exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
[exec]
[exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
[exec]
[exec]
[exec]
[exec]
[exec] ======================================================================
[exec] ======================================================================
[exec] Finished build.
[exec] ======================================================================
[exec] ======================================================================
Should the output directories be empty or should they produce zero-length part files ?
Attaching final patch. The previous patch had some problems because of which TestSetupAndCleanupFailure failed.
This patch fixes those problems. It passed ant test-patch, core and contrib tests except the following which failed/timeout even without this patch and are unrelated to the changes in this patch. Failed:
Timedout:
Forgot to summarize. With this patch,
Setting Map Progress and reduce progress to 1.0f should not be done in updateTaskStatus() method, it can be done when setup completes in completedTask().
Catching NullPointerException in testcase doesn't seem correct, just throw it out if there is any. The patch seems to introduce a new error. The taskdetails.jsp page of job setup and cleanup tasks throws a NullPointerException. This is the case with all the jobs' setup and cleanup tasks. Below is the Exception seen on the UI:
java.lang.NullPointerException at org.apache.hadoop.mapred.TaskInProgress.getSplitNodes(TaskInProgress.java:1034) at org.apache.hadoop.mapred.taskdetails_jsp._jspService(taskdetails_jsp.java:288) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:97) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:4 -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12408651/HADOOP-5850-20090520-svn-branch-20.v2.txt against trunk revision 777330. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 11 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/374/console This message is automatically generated. changes look fine to me
I just committed this. Thanks, Vinod!
With the above fix, when a job (writing to DFS) with 0 maps and >0 reduces is submitted, the cluster hangs completely.
The TT logs show "INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to <jt>' with reponseId 'ID" infinitely and the JT throws java.io.IOException: java.lang.ArithmeticException forever. Below is the stacktrace:
2009-05-25 08:13:00,124 INFO org.apache.hadoop.ipc.Server: IPC Server handler 37 on <portno>, call heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@14d128c, false, false, true, 3231) from <ip>:<port> error: java.io.IOException:
java.lang.ArithmeticException: / by zero
java.io.IOException: java.lang.ArithmeticException: / by zero
at org.apache.hadoop.mapred.ResourceEstimator.getEstimatedMapOutputSize(ResourceEstimator.java:85)
at org.apache.hadoop.mapred.JobInProgress.findNewMapTask(JobInProgress.java:1729)
at org.apache.hadoop.mapred.JobInProgress.obtainNewMapTask(JobInProgress.java:978)
at org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr.obtainNewTask(CapacityTaskScheduler.java:572)
at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.getTaskFromQueue(CapacityTaskScheduler.java:418)
at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.assignTasks(CapacityTaskScheduler.java:498)
at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.access$500(CapacityTaskScheduler.java:277)
at org.apache.hadoop.mapred.CapacityTaskScheduler.assignTasks(CapacityTaskScheduler.java:977)
at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:2605)
at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
In such a case, all the jobs hang infinitely without progressing and the cluster is completely down. This situation occurs if scheduler is invoked and it calls job.obtainNewMapTask() while the job-clean-up task of this job is still running. Discussed with this Devaraj who concurs that obtainNewMapTask()/obtainNewReduceTask() should return immediately, doing nothing, when job-cleanup is running. Will address this in a new JIRA.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||