Issue Details (XML | Word | Printable)

Key: HADOOP-5123
Type: New Feature New Feature
Status: Open Open
Priority: Minor Minor
Assignee: Steve Loughran
Reporter: Steve Loughran
Votes: 1
Watchers: 6
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Ant tasks for job submission

Created: 26/Jan/09 03:30 PM   Updated: 25/Jun/09 04:36 PM
Return to search
Component/s: None
Affects Version/s: 0.21.0
Fix Version/s: None

Time Tracking:
Original Estimate: 168h
Original Estimate - 168h
Remaining Estimate: 168h
Remaining Estimate - 168h
Time Spent: Not Specified
Remaining Estimate - 168h

File Attachments:
  Size
Java Source File Licensed for inclusion in ASF works JobSubmitTask.java 2009-01-27 02:23 PM Steve Loughran 7 kB
Environment: Both platforms, Linux and Windows
Issue Links:
Dependants
 
Reference
 


 Description  « Hide
Ant tasks to make it easy to work with hadoop filesystem and submit jobs.

<submit> : uploads JAR, submits job as user, with various settings

filesystem operations: mkdir, copyin, copyout, delete
-We could maybe use Ant1.7 "resources" here, and so use hdfs as a source or dest in Ant's own tasks

  1. security. Need to specify user; pick up user.name from JVM as default?
  2. cluster binding: namenode/job tracker (hostname,port) or url are all that is needed?
    #job conf: how to configure the job that is submitted? support a list of <property name="name" value="something"> children
  3. testing. AntUnit to generate <junitreport> compatible XML files
  4. Documentation. With an example using Ivy to fetch the JARs for the tasks and hadoop client.
  5. Polling: ant task to block for a job finished?


 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Steve Loughran added a comment - 26/Jan/09 03:36 PM
Antunit probably doesnt integrate well with tests that need to set up a mini cluster for the test run; use the "legacy" junit test case integration JARs instead.

Steve Loughran added a comment - 26/Jan/09 08:41 PM
The use case for the <submit> ant task is to submit a job as part of a build; print
out enough information for you to track it's progress. Upload the JAR file.
<hadoop:submit tracker="http://jobtracker:50030" 
    in="hdfs://host:port/tmp/in/something"
    out="hdfs://host:port/tmp/out/something"
    jobProperty="myJob"
    jar="dist/myapp.jar"
>
  <property name="dfs.replication.factor" value="4" />
  <mapper classname="org.example.identity" /> 
  <reducer classname="org.example.count" />
 </hadoop:submit>
  1. No attempt to do a block for the job submission. The task will print out
    the jobID.
  2. jobProperty names a property to set for the job ID
  3. list zero or more JAR files. No attempt to do sanity checks like loading classes -the far end can do that.
  4. No separate configuration files for the map/reduce/combine
  5. Maybe, a configuration file attribute conf; defines a conf file to use. If set, no other properties can be set (would force the ant task to parse the XML, edit it, save it etc.
  6. JAR file is optional, but if listed, it had better be there

Tests without cluster

  • fail to submit if the JAR is missing
  • fail to submit if there is no tracker
  • error if the mapper or reducer is not defined

Tests with MiniMR up

  • submit a job

Steve Loughran added a comment - 26/Jan/09 09:10 PM
File operations

*Touch, copy in, copy out. Not using distcp, so for small data.

  • Rename,
  • add a condition for a file existing, maybe minimum size.
  • DfsMkDir: create a directory

A first pass would use resources, [http://ant.apache.org/manual/CoreTypes/resources.html#resource], which can be used in existing Ant tasks; they extend the resource class
[https://svn.apache.org/viewvc/ant/core/trunk/src/main/org/apache/tools/ant/types/Resource.java?view=markup]
and can be used in the existing, <copy>, <touch> tasks, and the like.
The resource would need to implement the getOutputStream() and getInputStream() operations, also, ideally, Touchable, for the touch() operation.

Tests without a cluster

  • Some meaningful failure if the hdfs:// URLS don't work

Tests with a cluster

  • Copy in, copy-out, copy inside
  • touch
  • delete
  • test for a resource existing
  • some of the resource selection operations

Tests against other file systems

  • S3:// URLs? Test that it works, but then assume that it stays working.
  • Test that s3 urls fail gracefully if the URL is missing/forbidden

Chris Douglas added a comment - 26/Jan/09 09:15 PM
This sounds like a more thorough integration than the ant tasks added/proposed in HADOOP-1508 and HADOOP-2778. Would either of those be a reasonable base for some of the work you're considering?

Steve Loughran added a comment - 27/Jan/09 09:12 AM
I'll take a look at the codebase in both of these. I'd initially expect to start with the minimal set of operations needed to get work into a cluster from a developer's desktop; let it evolve from there. While I know less about Hadoop than the other contributions, I do know more about Ant and how to test build files under JUnit, so what's really going to be new here are the regression tests. I have some job submit code of my own that I was going to start with, but HADOOP-2788 could be a good starting point.

What worries me is the whole configuration problem; I think the client settings are minimal enough now that the JT URL should be enough.

The other problem is versioning; I will handle that by requiring tasks and cluster to be in sync, at least for now.


Steve Loughran added a comment - 27/Jan/09 02:18 PM
Looking at HADOOP-2788, its actually more advanced than what I was thinking, as it
  1. tries to block the Ant run until the job is finished; extracts counters afterwards
  2. does some classloader tricks to work out the JAR to include

I'm against the latter; more reliable to let the build file author point to the right place.

The blocking-the-job thing is also something I'm doubtful about, at least initially. Why? Because people will end up trying to use Ant as a long-lived workflow tool and it isn't optimised for that, either in availability or even memory management. People do try this - GridAnt is a case in point http://www.globus.org/cog/projects/gridant/, but we don't encourage it. Better to move the workflow into the cluster and have some HA scheduler manage the sequence.


Steve Loughran added a comment - 27/Jan/09 02:23 PM
This is a first draft of a JobSubmit client.

1. no declaration/setting up of the inputs and outputs
2. no setup, yet, of the configuration above the default values.

For #2 there's a choice.
(a) refer to an ant resource (including, once its in there, a resource in an HDFS filesystem)
(b) let you declare the various properties in Ant itself.

(b) is more Ant-like, but less compatible with the rest of the hadoop configuration design, and may still need to support reading in XML files just to get the base configuration together. But a mixed-configuration is hardest to get right.

Thoughts?


Steve Loughran added a comment - 22/May/09 11:24 AM
looking at this and the configuration options, assuming everything is left to the XML files themselves.
  1. set it up on the classpath that you declare the task. Easiest to do, and what I would start with
  2. with a confdir attribute that points you at a configuration directory

the second option is more flexible.

If I were to do this (and one of my colleagues is pestering me for it), I'd do it as a contrib -it depends on both core and mapred, so once they get split up, it should be downstream of them. Nicely self contained, just need a cluster for testing. This could be done, incidentally, if the MiniMR cluster classes were moved from test/mapred to mapred, so I could add a <minimrcluster> task too.


Steve Loughran added a comment - 25/Jun/09 04:36 PM
Good ant tasks should be independent of hadoop versions, submit to remote clusters, etc. Hence clients to a RESTy job submission API