Issue Details (XML | Word | Printable)

Key: HADOOP-2778
Type: New Feature New Feature
Status: Open Open
Priority: Major Major
Assignee: Srikanth Kakani
Reporter: Srikanth Kakani
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Hadoop job submission via ant using HMRExec

Created: 04/Feb/08 09:23 PM   Updated: 22/May/09 11:26 AM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works 2778-1.patch 2008-07-29 11:08 PM Chris Douglas 36 kB
Text File Licensed for inclusion in ASF works hadoop-hmrexec.patch 2008-02-04 09:25 PM Srikanth Kakani 37 kB
Environment: Submit/monitor hadoop map-reduce jobs via ant
Issue Links:
Reference
 


 Description  « Hide
Patch attached please check

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Srikanth Kakani added a comment - 04/Feb/08 09:25 PM
Initial patch for HMRExec

Srikanth Kakani made changes - 04/Feb/08 09:25 PM
Field Original Value New Value
Attachment hadoop-hmrexec.patch [ 12374706 ]
Nigel Daley added a comment - 04/Feb/08 09:31 PM
Is this a duplicate (or very similar) to HADOOP-435?

Mahadev konar added a comment - 04/Feb/08 09:40 PM
HADOOP-435 has more to do with installing and running hadoop daemons with just a single jar file installed. This one is different and has more to do with monitoring of map reduce jobs.

Chris Douglas added a comment - 29/Jul/08 11:08 PM
Unfortunately, this no longer compiles against trunk. HADOOP-2818 removed the deprecated Counter.Group::getDisplayName(String) and Counter.Group::getCounterNames() and HADOOP-3162 changed the way input and output paths are specified in 0.17. Also, RunningJob::getJobID(), JobClient::getJob(String), and FileSystem::delete(Path) have been deprecated in 0.18.

The attached updates the counter code, removes the deprecation warnings, and adjusts some of the spacing to be closer to the standards. Anything bringing it closer to said standards (spacing around operators, etc.) would be appreciated.

Other comments:

  • The javadoc explaining how to use this should probably be attached to the class, rather than the private HMRExec::getPropertiesFile. Surrounding the xml in {@code ... } will, with luck, avoid the requirement to escape all the reserved chars.
  • Some of the code is commented out, other parts disabled (e.g. line 884). If any feature is only partially supported/implemented, it should probably be removed (e.g. getTaskLogs)
  • Speaking of the retry code, what are the rules for this? It looks like a job will be re-executed if a parent was re-executed, but with different rules for the status files. With the retry failure logic disabled, is this distinction necessary?
  • Any specification or documentation on this would be invaluable. A testcase would be difficult to write, but is there an example or some other way this can be validated to ensure it's kept up to date? A reference job would also be very helpful to prospective users (and reviewers )
  • doPostSubmitStuff could use RunningJob::waitForCompletion() instead of reimplementing it, but I couldn't find any callers in the framework so I don't know its status. Neither should be swallowing the InterruptedException...
  • wasParentRerun returns true if any parent completed after this job began, but false if it's the first time this job was run (is that really the only case where the DateFormat::parse can throw?)? Could the check for exceptions from the parent tasks be separated from this logic, so the +100 years tweak isn't necessary?
  • loadProperties looks like it's being used for a number of checks incidental to its purpose, and each use of it appears to rely on a subset of what it does. For example, registerJob ignores its return value completely, though the call is still relevant because it could generate a log message; similarly, checks for null return values are inconsistently enforced, and exceptions are used in its control logic. It's difficult to tease out exactly what role it plays. Is it possible to refactor this section a bit?
  • The number of maps/reduces failed or killed is set to 0% if the job is successful, which is probably too optimistic
  • Instead of:
           throw new IOException(e.getMessage());
    

    It's usually more helpful to preserve the cause:

           throw (IOException)new IOException().initCause(e);
    

Chris Douglas made changes - 29/Jul/08 11:08 PM
Attachment 2778-1.patch [ 12387147 ]
Chris Douglas made changes - 29/Jul/08 11:09 PM
Assignee Srikanth Kakani [ srikantk ]
Chris Douglas made changes - 26/Jan/09 09:15 PM
Link This issue relates to HADOOP-5123 [ HADOOP-5123 ]
Steve Loughran added a comment - 22/May/09 11:26 AM
Putting my Ant team hat on here, we dont think people should run builds designed to take days. We aren't that kind of workflow engine, we deliberately don't have all the failure handling logic you need in a big workflow. Something to submit jobs, return the URL as a string and an ant property -fine. Blocking your build for 4 days while you wait for a job to complete, not so good.