|
[
Permlink
| « Hide
]
Steve Loughran added a comment - 26/Jan/09 03:36 PM
Antunit probably doesnt integrate well with tests that need to set up a mini cluster for the test run; use the "legacy" junit test case integration JARs instead.
The use case for the <submit> ant task is to submit a job as part of a build; print
out enough information for you to track it's progress. Upload the JAR file. <hadoop:submit tracker="http://jobtracker:50030" in="hdfs://host:port/tmp/in/something" out="hdfs://host:port/tmp/out/something" jobProperty="myJob" jar="dist/myapp.jar" > <property name="dfs.replication.factor" value="4" /> <mapper classname="org.example.identity" /> <reducer classname="org.example.count" /> </hadoop:submit>
Tests without cluster
Tests with MiniMR up
File operations
*Touch, copy in, copy out. Not using distcp, so for small data.
A first pass would use resources, [http://ant.apache.org/manual/CoreTypes/resources.html#resource Tests without a cluster
Tests with a cluster
Tests against other file systems
Chris Douglas made changes - 26/Jan/09 09:15 PM
This sounds like a more thorough integration than the ant tasks added/proposed in
Chris Douglas made changes - 26/Jan/09 09:15 PM
I'll take a look at the codebase in both of these. I'd initially expect to start with the minimal set of operations needed to get work into a cluster from a developer's desktop; let it evolve from there. While I know less about Hadoop than the other contributions, I do know more about Ant and how to test build files under JUnit, so what's really going to be new here are the regression tests. I have some job submit code of my own that I was going to start with, but
What worries me is the whole configuration problem; I think the client settings are minimal enough now that the JT URL should be enough. The other problem is versioning; I will handle that by requiring tasks and cluster to be in sync, at least for now. Looking at
I'm against the latter; more reliable to let the build file author point to the right place. The blocking-the-job thing is also something I'm doubtful about, at least initially. Why? Because people will end up trying to use Ant as a long-lived workflow tool and it isn't optimised for that, either in availability or even memory management. People do try this - GridAnt is a case in point http://www.globus.org/cog/projects/gridant/ This is a first draft of a JobSubmit client.
1. no declaration/setting up of the inputs and outputs For #2 there's a choice. (b) is more Ant-like, but less compatible with the rest of the hadoop configuration design, and may still need to support reading in XML files just to get the base configuration together. But a mixed-configuration is hardest to get right. Thoughts?
Steve Loughran made changes - 27/Jan/09 02:23 PM
looking at this and the configuration options, assuming everything is left to the XML files themselves.
the second option is more flexible. If I were to do this (and one of my colleagues is pestering me for it), I'd do it as a contrib -it depends on both core and mapred, so once they get split up, it should be downstream of them. Nicely self contained, just need a cluster for testing. This could be done, incidentally, if the MiniMR cluster classes were moved from test/mapred to mapred, so I could add a <minimrcluster> task too.
Steve Loughran made changes - 25/Jun/09 04:36 PM
Good ant tasks should be independent of hadoop versions, submit to remote clusters, etc. Hence clients to a RESTy job submission API
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||