Issue Details (XML | Word | Printable)

Key: HADOOP-5850
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Priority: Critical Critical
Assignee: Vinod K V
Reporter: Owen O'Malley
Votes: 1
Watchers: 8
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

map/reduce doesn't run jobs with 0 maps

Created: 15/May/09 04:59 PM   Updated: 08/Jul/09 04:53 PM
Return to search
Component/s: None
Affects Version/s: 0.20.0
Fix Version/s: 0.20.1

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works HADOOP-5850-20090519.1.txt 2009-05-19 10:38 AM Vinod K V 14 kB
Text File Licensed for inclusion in ASF works HADOOP-5850-20090519.txt 2009-05-19 06:50 AM Vinod K V 14 kB
Text File Licensed for inclusion in ASF works HADOOP-5850-20090520-svn-branch-20.v2.txt 2009-05-21 03:27 AM Vinod K V 18 kB
Text File Licensed for inclusion in ASF works HADOOP-5850-20090520-svn.1.txt 2009-05-20 08:14 PM Vinod K V 18 kB
Text File Licensed for inclusion in ASF works HADOOP-5850-20090522-branch-20-final.txt 2009-05-22 03:09 PM Vinod K V 20 kB
Text File Licensed for inclusion in ASF works HADOOP-5850-20090522.txt 2009-05-22 11:27 AM Vinod K V 21 kB
Image Attachments:

1. screenshot-1.jpg
(174 kB)

Hadoop Flags: Reviewed
Resolution Date: 22/May/09 03:29 PM


 Description  « Hide
Currently, the framework ignores jobs that have 0 maps. This is incorrect. Many pipelines need the job to run (if nothing else, to create the output directory!) so that subsequent jobs don't fail. Effectively, there will be no map tasks and the reduce tasks should immediately set up the Reducer and RecordWriter and then call close on both since there are no inputs to the reduce. I believe it should just work if we remove the check...

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Vinod K V added a comment - 19/May/09 06:50 AM
Attaching patch to fix this issue.
  • With this patch, jobs will 0 maps or no input still run JobSetUp, any number of reduces( which do nothing), and the JobCleanUp task.
  • Removed the 0-splits check in JobInProgress.initTasks() and added checks so that cleanup task doesn't launch before setup tasks when number of splits is zero.
  • Renamed TestEmptyJobWithDFS to TestEmptyJob, removed HDFS dependence to quicken the test, added checks for verifying the number of map and reduce tasks run for an empty-job.

Jothi Padmanabhan added a comment - 19/May/09 09:37 AM
Changes to JobInProgress looks fine.
Could you add an assertion to the test case to verify presence of empty output directories as well? That should fail without the patch and should pass with the patch.

Vinod K V added a comment - 19/May/09 10:38 AM
Attaching new patch with the suggested changes. The test now fails without the core changes and succeeds with.

Vinod K V added a comment - 19/May/09 11:03 AM
ant test-patch results:
[exec] +1 overall.
     [exec]
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec]
     [exec]     +1 tests included.  The patch appears to include 6 new or modified tests.
     [exec]
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec]
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec]
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec]
     [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
     [exec]
     [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
     [exec]
     [exec]
     [exec]
     [exec]
     [exec] ======================================================================
     [exec] ======================================================================
     [exec]     Finished build.
     [exec] ======================================================================
     [exec] ======================================================================

Milind Bhandarkar added a comment - 19/May/09 10:43 PM
Should the output directories be empty or should they produce zero-length part files ?

Vinod K V added a comment - 20/May/09 07:35 AM

Should the output directories be empty or should they produce zero-length part files ?

The job will run N number of reduces that the user intends and they should produce zero-length part files. The patch does the same.


Tom White added a comment - 20/May/09 09:00 AM
You should be able to use LazyOutputFormat if you wanted no part files to be produced.

Vinod K V added a comment - 20/May/09 08:14 PM
Attaching final patch. The previous patch had some problems because of which TestSetupAndCleanupFailure failed.

This patch fixes those problems. It passed ant test-patch, core and contrib tests except the following which failed/timeout even without this patch and are unrelated to the changes in this patch.

Failed:

  • org.apache.hadoop.streaming.TestMultipleCachefiles
  • org.apache.hadoop.streaming.TestStreamingBadRecords
  • org.apache.hadoop.streaming.TestSymLink

Timedout:

  • org.apache.hadoop.mapred.TestJobInProgressListener FAILED (timeout)
  • org.apache.hadoop.mapred.TestQueueCapacities FAILED (timeout)

Vinod K V added a comment - 20/May/09 08:18 PM
Forgot to summarize. With this patch,
  • Jobs with zero maps will still run job setup and cleanup tasks.
  • If number of reduces is non-zero, reduces are run leaving behind the corresponding number of empty part-files in the output directory.
  • If the number of reduces is also zero, an empty output directory is left behind.
  • The map progress(and reduce progress if number of reduces is zero) is set to 1.0 once the job cleanup task finishes.

Vinod K V added a comment - 21/May/09 03:27 AM
Patch for branch 0.20.

Amareshwari Sriramadasu added a comment - 21/May/09 08:57 AM
Setting Map Progress and reduce progress to 1.0f should not be done in updateTaskStatus() method, it can be done when setup completes in completedTask().
Catching NullPointerException in testcase doesn't seem correct, just throw it out if there is any.

Ramya R added a comment - 21/May/09 10:57 AM
The patch seems to introduce a new error. The taskdetails.jsp page of job setup and cleanup tasks throws a NullPointerException. This is the case with all the jobs' setup and cleanup tasks. Below is the Exception seen on the UI:
java.lang.NullPointerException
at org.apache.hadoop.mapred.TaskInProgress.getSplitNodes(TaskInProgress.java:1034)
at org.apache.hadoop.mapred.taskdetails_jsp._jspService(taskdetails_jsp.java:288)
at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:97)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:4

Hadoop QA added a comment - 22/May/09 12:52 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12408651/HADOOP-5850-20090520-svn-branch-20.v2.txt
against trunk revision 777330.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 11 new or modified tests.

-1 patch. The patch command could not apply the patch.

Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/374/console

This message is automatically generated.


Ramya R added a comment - 22/May/09 03:52 AM
Had offline discussion with Nigel. Attaching a screenshot for the Exception thrown in taskdetails.jsp page.

Vinod K V added a comment - 22/May/09 04:11 AM
Thank you Ramya for finding out a problem with the patch! I am working on fixing this and uploading a new patch.

Vinod K V added a comment - 22/May/09 11:27 AM
Attaching patch incorporating the review comments and fixing the issues.

Amareshwari Sriramadasu added a comment - 22/May/09 11:40 AM
changes look fine to me

Vinod K V added a comment - 22/May/09 02:50 PM
ant test-patch and run-test-mapred targets passed with the patch.

Vinod K V added a comment - 22/May/09 03:09 PM
Patch for branch-20.

Devaraj Das added a comment - 22/May/09 03:29 PM
I just committed this. Thanks, Vinod!

Ramya R added a comment - 25/May/09 10:43 AM
With the above fix, when a job (writing to DFS) with 0 maps and >0 reduces is submitted, the cluster hangs completely.
The TT logs show "INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to <jt>' with reponseId 'ID" infinitely and the JT throws java.io.IOException: java.lang.ArithmeticException forever.
Below is the stacktrace:
 
2009-05-25 08:13:00,124 INFO org.apache.hadoop.ipc.Server: IPC Server handler 37 on <portno>, call heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@14d128c, false, false, true, 3231) from <ip>:<port> error: java.io.IOException:
 java.lang.ArithmeticException: / by zero
java.io.IOException: java.lang.ArithmeticException: / by zero
        at org.apache.hadoop.mapred.ResourceEstimator.getEstimatedMapOutputSize(ResourceEstimator.java:85)
        at org.apache.hadoop.mapred.JobInProgress.findNewMapTask(JobInProgress.java:1729)
        at org.apache.hadoop.mapred.JobInProgress.obtainNewMapTask(JobInProgress.java:978)
        at org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr.obtainNewTask(CapacityTaskScheduler.java:572)
        at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.getTaskFromQueue(CapacityTaskScheduler.java:418)
        at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.assignTasks(CapacityTaskScheduler.java:498)
        at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.access$500(CapacityTaskScheduler.java:277)
        at org.apache.hadoop.mapred.CapacityTaskScheduler.assignTasks(CapacityTaskScheduler.java:977)
        at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:2605)
        at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)

In such a case, all the jobs hang infinitely without progressing and the cluster is completely down.
This problem is solved only when the no-map job is killed. Once the job is killed the cluster is back running and the other jobs proceed smoothly.


Vinod K V added a comment - 25/May/09 11:02 AM
This situation occurs if scheduler is invoked and it calls job.obtainNewMapTask() while the job-clean-up task of this job is still running. Discussed with this Devaraj who concurs that obtainNewMapTask()/obtainNewReduceTask() should return immediately, doing nothing, when job-cleanup is running. Will address this in a new JIRA.

Vinod K V added a comment - 25/May/09 12:05 PM

[...] Will address this in a new JIRA.

HADOOP-5908