Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Affects Version/s: 0.22.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:

      Both platforms, Linux and Windows

      Description

      Ant tasks to make it easy to work with hadoop filesystem and submit jobs.

      <submit> : uploads JAR, submits job as user, with various settings

      filesystem operations: mkdir, copyin, copyout, delete
      -We could maybe use Ant1.7 "resources" here, and so use hdfs as a source or dest in Ant's own tasks

      1. security. Need to specify user; pick up user.name from JVM as default?
      2. cluster binding: namenode/job tracker (hostname,port) or url are all that is needed?
        #job conf: how to configure the job that is submitted? support a list of <property name="name" value="something"> children
      3. testing. AntUnit to generate <junitreport> compatible XML files
      4. Documentation. With an example using Ivy to fetch the JARs for the tasks and hadoop client.
      5. Polling: ant task to block for a job finished?
      1. JobSubmitTask.java
        7 kB
        Steve Loughran

        Issue Links

          Activity

          Hide
          Steve Loughran added a comment -

          closing as wontfix unless anyone wants to do it.

          Better to submit a workflow to something like oozie or use cascading

          Show
          Steve Loughran added a comment - closing as wontfix unless anyone wants to do it. Better to submit a workflow to something like oozie or use cascading
          Hide
          Karthik K added a comment -

          On a related note - have put up a maven hadoop mojo here at - http://github.com/akkumar/maven-hadoop .

          Pretty rudimentary at this point with a target for packing jar files , for submission . Hopefully , down the road should be polished more.

          Interested people can follow more at - http://maven-hadoop.blogspot.com/ .

          Show
          Karthik K added a comment - On a related note - have put up a maven hadoop mojo here at - http://github.com/akkumar/maven-hadoop . Pretty rudimentary at this point with a target for packing jar files , for submission . Hopefully , down the road should be polished more. Interested people can follow more at - http://maven-hadoop.blogspot.com/ .
          Hide
          Karthik K added a comment -

          So - what is the current status / interest in this ? A non-blocking ant task would be very useful.

          Show
          Karthik K added a comment - So - what is the current status / interest in this ? A non-blocking ant task would be very useful.
          Hide
          Steve Loughran added a comment -

          Good ant tasks should be independent of hadoop versions, submit to remote clusters, etc. Hence clients to a RESTy job submission API

          Show
          Steve Loughran added a comment - Good ant tasks should be independent of hadoop versions, submit to remote clusters, etc. Hence clients to a RESTy job submission API
          Hide
          Steve Loughran added a comment -

          looking at this and the configuration options, assuming everything is left to the XML files themselves.

          1. set it up on the classpath that you declare the task. Easiest to do, and what I would start with
          2. with a confdir attribute that points you at a configuration directory

          the second option is more flexible.

          If I were to do this (and one of my colleagues is pestering me for it), I'd do it as a contrib -it depends on both core and mapred, so once they get split up, it should be downstream of them. Nicely self contained, just need a cluster for testing. This could be done, incidentally, if the MiniMR cluster classes were moved from test/mapred to mapred, so I could add a <minimrcluster> task too.

          Show
          Steve Loughran added a comment - looking at this and the configuration options, assuming everything is left to the XML files themselves. set it up on the classpath that you declare the task. Easiest to do, and what I would start with with a confdir attribute that points you at a configuration directory the second option is more flexible. If I were to do this (and one of my colleagues is pestering me for it), I'd do it as a contrib -it depends on both core and mapred, so once they get split up, it should be downstream of them. Nicely self contained, just need a cluster for testing. This could be done, incidentally, if the MiniMR cluster classes were moved from test/mapred to mapred, so I could add a <minimrcluster> task too.
          Hide
          Steve Loughran added a comment -

          This is a first draft of a JobSubmit client.

          1. no declaration/setting up of the inputs and outputs
          2. no setup, yet, of the configuration above the default values.

          For #2 there's a choice.
          (a) refer to an ant resource (including, once its in there, a resource in an HDFS filesystem)
          (b) let you declare the various properties in Ant itself.

          (b) is more Ant-like, but less compatible with the rest of the hadoop configuration design, and may still need to support reading in XML files just to get the base configuration together. But a mixed-configuration is hardest to get right.

          Thoughts?

          Show
          Steve Loughran added a comment - This is a first draft of a JobSubmit client. 1. no declaration/setting up of the inputs and outputs 2. no setup, yet, of the configuration above the default values. For #2 there's a choice. (a) refer to an ant resource (including, once its in there, a resource in an HDFS filesystem) (b) let you declare the various properties in Ant itself. (b) is more Ant-like, but less compatible with the rest of the hadoop configuration design, and may still need to support reading in XML files just to get the base configuration together. But a mixed-configuration is hardest to get right. Thoughts?
          Hide
          Steve Loughran added a comment -

          Looking at HADOOP-2788, its actually more advanced than what I was thinking, as it

          1. tries to block the Ant run until the job is finished; extracts counters afterwards
          2. does some classloader tricks to work out the JAR to include

          I'm against the latter; more reliable to let the build file author point to the right place.

          The blocking-the-job thing is also something I'm doubtful about, at least initially. Why? Because people will end up trying to use Ant as a long-lived workflow tool and it isn't optimised for that, either in availability or even memory management. People do try this - GridAnt is a case in point http://www.globus.org/cog/projects/gridant/, but we don't encourage it. Better to move the workflow into the cluster and have some HA scheduler manage the sequence.

          Show
          Steve Loughran added a comment - Looking at HADOOP-2788 , its actually more advanced than what I was thinking, as it tries to block the Ant run until the job is finished; extracts counters afterwards does some classloader tricks to work out the JAR to include I'm against the latter; more reliable to let the build file author point to the right place. The blocking-the-job thing is also something I'm doubtful about, at least initially. Why? Because people will end up trying to use Ant as a long-lived workflow tool and it isn't optimised for that, either in availability or even memory management. People do try this - GridAnt is a case in point http://www.globus.org/cog/projects/gridant/ , but we don't encourage it. Better to move the workflow into the cluster and have some HA scheduler manage the sequence.
          Hide
          Steve Loughran added a comment -

          I'll take a look at the codebase in both of these. I'd initially expect to start with the minimal set of operations needed to get work into a cluster from a developer's desktop; let it evolve from there. While I know less about Hadoop than the other contributions, I do know more about Ant and how to test build files under JUnit, so what's really going to be new here are the regression tests. I have some job submit code of my own that I was going to start with, but HADOOP-2788 could be a good starting point.

          What worries me is the whole configuration problem; I think the client settings are minimal enough now that the JT URL should be enough.

          The other problem is versioning; I will handle that by requiring tasks and cluster to be in sync, at least for now.

          Show
          Steve Loughran added a comment - I'll take a look at the codebase in both of these. I'd initially expect to start with the minimal set of operations needed to get work into a cluster from a developer's desktop; let it evolve from there. While I know less about Hadoop than the other contributions, I do know more about Ant and how to test build files under JUnit, so what's really going to be new here are the regression tests. I have some job submit code of my own that I was going to start with, but HADOOP-2788 could be a good starting point. What worries me is the whole configuration problem; I think the client settings are minimal enough now that the JT URL should be enough. The other problem is versioning; I will handle that by requiring tasks and cluster to be in sync, at least for now.
          Hide
          Chris Douglas added a comment -

          This sounds like a more thorough integration than the ant tasks added/proposed in HADOOP-1508 and HADOOP-2778. Would either of those be a reasonable base for some of the work you're considering?

          Show
          Chris Douglas added a comment - This sounds like a more thorough integration than the ant tasks added/proposed in HADOOP-1508 and HADOOP-2778 . Would either of those be a reasonable base for some of the work you're considering?
          Hide
          Steve Loughran added a comment -

          File operations

          *Touch, copy in, copy out. Not using distcp, so for small data.

          • Rename,
          • add a condition for a file existing, maybe minimum size.
          • DfsMkDir: create a directory

          A first pass would use resources, [http://ant.apache.org/manual/CoreTypes/resources.html#resource], which can be used in existing Ant tasks; they extend the resource class
          [https://svn.apache.org/viewvc/ant/core/trunk/src/main/org/apache/tools/ant/types/Resource.java?view=markup]
          and can be used in the existing, <copy>, <touch> tasks, and the like.
          The resource would need to implement the getOutputStream() and getInputStream() operations, also, ideally, Touchable, for the touch() operation.

          Tests without a cluster

          • Some meaningful failure if the hdfs:// URLS don't work

          Tests with a cluster

          • Copy in, copy-out, copy inside
          • touch
          • delete
          • test for a resource existing
          • some of the resource selection operations

          Tests against other file systems

          • S3:// URLs? Test that it works, but then assume that it stays working.
          • Test that s3 urls fail gracefully if the URL is missing/forbidden
          Show
          Steve Loughran added a comment - File operations *Touch, copy in, copy out. Not using distcp, so for small data. Rename, add a condition for a file existing, maybe minimum size. DfsMkDir: create a directory A first pass would use resources, [ http://ant.apache.org/manual/CoreTypes/resources.html#resource ], which can be used in existing Ant tasks; they extend the resource class [ https://svn.apache.org/viewvc/ant/core/trunk/src/main/org/apache/tools/ant/types/Resource.java?view=markup ] and can be used in the existing, <copy> , <touch> tasks, and the like. The resource would need to implement the getOutputStream() and getInputStream() operations, also, ideally, Touchable , for the touch() operation. Tests without a cluster Some meaningful failure if the hdfs:// URLS don't work Tests with a cluster Copy in, copy-out, copy inside touch delete test for a resource existing some of the resource selection operations Tests against other file systems S3:// URLs? Test that it works, but then assume that it stays working. Test that s3 urls fail gracefully if the URL is missing/forbidden
          Hide
          Steve Loughran added a comment -

          The use case for the <submit> ant task is to submit a job as part of a build; print
          out enough information for you to track it's progress. Upload the JAR file.

          <hadoop:submit tracker="http://jobtracker:50030" 
              in="hdfs://host:port/tmp/in/something"
              out="hdfs://host:port/tmp/out/something"
              jobProperty="myJob"
              jar="dist/myapp.jar"
          >
            <property name="dfs.replication.factor" value="4" />
            <mapper classname="org.example.identity" /> 
            <reducer classname="org.example.count" />
           </hadoop:submit>
          
          1. No attempt to do a block for the job submission. The task will print out
            the jobID.
          2. jobProperty names a property to set for the job ID
          3. list zero or more JAR files. No attempt to do sanity checks like loading classes -the far end can do that.
          4. No separate configuration files for the map/reduce/combine
          5. Maybe, a configuration file attribute conf; defines a conf file to use. If set, no other properties can be set (would force the ant task to parse the XML, edit it, save it etc.
          6. JAR file is optional, but if listed, it had better be there

          Tests without cluster

          • fail to submit if the JAR is missing
          • fail to submit if there is no tracker
          • error if the mapper or reducer is not defined

          Tests with MiniMR up

          • submit a job
          Show
          Steve Loughran added a comment - The use case for the <submit> ant task is to submit a job as part of a build; print out enough information for you to track it's progress. Upload the JAR file. <hadoop:submit tracker= "http: //jobtracker:50030" in= "hdfs: //host:port/tmp/in/something" out= "hdfs: //host:port/tmp/out/something" jobProperty= "myJob" jar= "dist/myapp.jar" > <property name= "dfs.replication.factor" value= "4" /> <mapper classname= "org.example.identity" /> <reducer classname= "org.example.count" /> </hadoop:submit> No attempt to do a block for the job submission. The task will print out the jobID. jobProperty names a property to set for the job ID list zero or more JAR files. No attempt to do sanity checks like loading classes -the far end can do that. No separate configuration files for the map/reduce/combine Maybe, a configuration file attribute conf ; defines a conf file to use. If set, no other properties can be set (would force the ant task to parse the XML, edit it, save it etc. JAR file is optional, but if listed, it had better be there Tests without cluster fail to submit if the JAR is missing fail to submit if there is no tracker error if the mapper or reducer is not defined Tests with MiniMR up submit a job
          Hide
          Steve Loughran added a comment -

          Antunit probably doesnt integrate well with tests that need to set up a mini cluster for the test run; use the "legacy" junit test case integration JARs instead.

          Show
          Steve Loughran added a comment - Antunit probably doesnt integrate well with tests that need to set up a mini cluster for the test run; use the "legacy" junit test case integration JARs instead.

            People

            • Assignee:
              Steve Loughran
              Reporter:
              Steve Loughran
            • Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 168h
                168h
                Remaining:
                Remaining Estimate - 168h
                168h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development