Hadoop Common
  1. Hadoop Common
  2. HADOOP-1622

Hadoop should provide a way to allow the user to specify jar file(s) the user job depends on

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.17.0
    • Component/s: None
    • Labels:
      None
    • Release Note:
      Hide
      This patch allows new command line options for

      hadoop jar
      which are

      hadoop jar -files <comma seperated list of files> -libjars <comma seperated list of jars> -archives <comma seperated list of archives>

      -files options allows you to speficy comma seperated list of path which would be present in your current working directory of your task
      -libjars option allows you to add jars to the classpaths of the maps and reduces.
      -archives allows you to pass archives as arguments that are unzipped/unjarred and a link with name of the jar/zip are created in the current working directory if tasks.
      Show
      This patch allows new command line options for hadoop jar which are hadoop jar -files <comma seperated list of files> -libjars <comma seperated list of jars> -archives <comma seperated list of archives> -files options allows you to speficy comma seperated list of path which would be present in your current working directory of your task -libjars option allows you to add jars to the classpaths of the maps and reduces. -archives allows you to pass archives as arguments that are unzipped/unjarred and a link with name of the jar/zip are created in the current working directory if tasks.

      Description

      More likely than not, a user's job may depend on multiple jars.
      Right now, when submitting a job through bin/hadoop, there is no way for the user to specify that.
      A walk around for that is to re-package all the dependent jars into a new jar or put the dependent jar files in the lib dir of the new jar.
      This walk around causes unnecessary inconvenience to the user. Furthermore, if the user does not own the main function
      (like the case when the user uses Aggregate, or datajoin, streaming), the user has to re-package those system jar files too.
      It is much desired that hadoop provides a clean and simple way for the user to specify a list of dependent jar files at the time
      of job submission. Someting like:

      bin/hadoop .... --depending_jars j1.jar:j2.jar

      1. HADOOP-1622_6.patch
        30 kB
        Mahadev konar
      2. HADOOP-1622_5.patch
        30 kB
        Mahadev konar
      3. HADOOP-1622_4.patch
        28 kB
        Mahadev konar
      4. HADOOP-1622_3.patch
        29 kB
        Mahadev konar
      5. HADOOP-1622_2.patch
        28 kB
        Mahadev konar
      6. HADOOP-1622_1.patch
        20 kB
        Mahadev konar
      7. HADOOP-1622-9.patch
        46 kB
        Dennis Kubes
      8. HADOOP-1622-8.patch
        45 kB
        Dennis Kubes
      9. HADOOP-1622-7.patch
        44 kB
        Doug Cutting
      10. HADOOP-1622-6.patch
        46 kB
        Doug Cutting
      11. HADOOP-1622-5.patch
        46 kB
        Doug Cutting
      12. hadoop-1622-4-20071008.patch
        48 kB
        Dennis Kubes
      13. multipleJobResources2.patch
        44 kB
        Dennis Kubes
      14. multipleJobResources.patch
        43 kB
        Dennis Kubes
      15. multipleJobJars.patch
        8 kB
        Dennis Kubes

        Issue Links

          Activity

          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #443 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/443/ )
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12378649/HADOOP-1622_6.patch
          against trunk revision 619744.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 16 new or modified tests.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2067/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2067/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2067/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2067/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12378649/HADOOP-1622_6.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 16 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2067/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2067/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2067/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2067/console This message is automatically generated.
          Hide
          dhruba borthakur added a comment -

          I just committed this. Thanks Mahadev!

          Show
          dhruba borthakur added a comment - I just committed this. Thanks Mahadev!
          Hide
          Mahadev konar added a comment -

          looks like the previous patch got stale with some commits yesterday. attaching a new patch.

          Show
          Mahadev konar added a comment - looks like the previous patch got stale with some commits yesterday. attaching a new patch.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12378580/HADOOP-1622_5.patch
          against trunk revision 619744.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 16 new or modified tests.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2053/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2053/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2053/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2053/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12378580/HADOOP-1622_5.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 16 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2053/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2053/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2053/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2053/console This message is automatically generated.
          Hide
          Mahadev konar added a comment -

          this is the patch implementing devaraj's coment about host resolution. I will add another jira for this feature to be used by pipes.

          Show
          Mahadev konar added a comment - this is the patch implementing devaraj's coment about host resolution. I will add another jira for this feature to be used by pipes.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12378527/HADOOP-1622_4.patch
          against trunk revision 619744.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 16 new or modified tests.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2042/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2042/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2042/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2042/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12378527/HADOOP-1622_4.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 16 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2042/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2042/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2042/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2042/console This message is automatically generated.
          Hide
          Mahadev konar added a comment -

          got rid of the findbugs warning.

          Show
          Mahadev konar added a comment - got rid of the findbugs warning.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12378510/HADOOP-1622_3.patch
          against trunk revision 619744.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 16 new or modified tests.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs -1. The patch appears to introduce 1 new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2040/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2040/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2040/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2040/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12378510/HADOOP-1622_3.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 16 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs -1. The patch appears to introduce 1 new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2040/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2040/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2040/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2040/console This message is automatically generated.
          Hide
          Mahadev konar added a comment -

          fixed findbugs warnings.

          Show
          Mahadev konar added a comment - fixed findbugs warnings.
          Hide
          Mahadev konar added a comment -
          • I think it might not be a big overhead... I just wanted to avoid it since it would be a common utility and should be filed as a seperate jira ... (since finding out if two filesystems are the same seems like a nice thing to have). I wnated to keep this patch simple ..
          • I dont think pipes can make use of it ... Ill create another jira for that as well.
          Show
          Mahadev konar added a comment - I think it might not be a big overhead... I just wanted to avoid it since it would be a common utility and should be filed as a seperate jira ... (since finding out if two filesystems are the same seems like a nice thing to have). I wnated to keep this patch simple .. I dont think pipes can make use of it ... Ill create another jira for that as well.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12378428/HADOOP-1622_2.patch
          against trunk revision 619744.

          @author +1. The patch does not contain any @author tags.

          tests included +1. The patch appears to include 16 new or modified tests.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs -1. The patch appears to introduce 3 new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2030/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2030/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2030/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2030/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12378428/HADOOP-1622_2.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 16 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs -1. The patch appears to introduce 3 new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2030/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2030/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2030/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2030/console This message is automatically generated.
          Hide
          Devaraj Das added a comment -

          Do you think the dns resolution is going to be a big hit. I don't think so with dns caching in place, etc.
          Can Pipes make use of this feature (this patch doesn't support pipes). I am ok with having a separate issue to address pipes if required.
          Otherwise, the patch looks fine.

          Show
          Devaraj Das added a comment - Do you think the dns resolution is going to be a big hit. I don't think so with dns caching in place, etc. Can Pipes make use of this feature (this patch doesn't support pipes). I am ok with having a separate issue to address pipes if required. Otherwise, the patch looks fine.
          Hide
          Mahadev konar added a comment -

          attaching a patch with the unit test. passes the tests on my machine.

          Show
          Mahadev konar added a comment - attaching a patch with the unit test. passes the tests on my machine.
          Hide
          Mahadev konar added a comment -

          attaching a patch for this feature. It does not have unit tests included. I am still writing unit tests and will upload a patch by the end of the day.

          this patch enhances the hadoop command line for job submission:

          so you can say:

          • bin/hadoop jar -files <commaseperated files> -libjars <comma seperated libs> -archives <comma seperated archives>
          • these options are all optional and the command line is backwards compatible
          • the patch uses cli for command line parsing
          • it uses DistributedCache for copying files locally to the tasks
          • it supports uri's in the command line arguments
          • if the files are already uploaded do the hdfs used by jobtracker then it does not recopy the files – there is a tiny catch here ... since the uri's are matched as string for the remote file system and the one jt uses, it might be possible that the files are copied even though its the same dfs (ex: hdfs://hostname1:port != hdfs://hostname1.fullyqualifiedname:port)
          • the command line files, archives, libajrs are stored temporarurly in the hdfs job directory from where they are copied locally.
          Show
          Mahadev konar added a comment - attaching a patch for this feature. It does not have unit tests included. I am still writing unit tests and will upload a patch by the end of the day. this patch enhances the hadoop command line for job submission: so you can say: bin/hadoop jar -files <commaseperated files> -libjars <comma seperated libs> -archives <comma seperated archives> these options are all optional and the command line is backwards compatible the patch uses cli for command line parsing it uses DistributedCache for copying files locally to the tasks it supports uri's in the command line arguments if the files are already uploaded do the hdfs used by jobtracker then it does not recopy the files – there is a tiny catch here ... since the uri's are matched as string for the remote file system and the one jt uses, it might be possible that the files are copied even though its the same dfs (ex: hdfs://hostname1:port != hdfs://hostname1.fullyqualifiedname:port) the command line files, archives, libajrs are stored temporarurly in the hdfs job directory from where they are copied locally.
          Hide
          Mahadev konar added a comment -

          how about

          hadoop jar -file <..> -libjar <comma sep jars> -archives<comma seperated>

          Show
          Mahadev konar added a comment - how about hadoop jar -file <..> -libjar <comma sep jars> -archives<comma seperated>
          Hide
          Mahadev konar added a comment -

          ill leave the jar option to keep it backwards compatible. I dont want to break backwards compatibilty for users.

          • as for the job directory changes this is the directory structure in HDFS... the local job directory structure would not change.
          Show
          Mahadev konar added a comment - ill leave the jar option to keep it backwards compatible. I dont want to break backwards compatibilty for users. as for the job directory changes this is the directory structure in HDFS... the local job directory structure would not change.
          Hide
          Runping Qi added a comment -

          Sounds good.

          A couple comments:

          It seems weird to have jar and -jar as arguments/option
          in the command line "hadoop jar -file <comma seperated files> -jar <comma seperated jars>"
          Will it be better to use "-classpath" instead?

          When the job dir changes to

          jobdir/jars/urischeme/<jarfiles>
          jobdir/archives/urischeme/<archivefiles>
          jobdir/file/urischeme/<files>

          will that break the current applications that assume their files loaded using -file and -archive options in the jobdir?

          Show
          Runping Qi added a comment - Sounds good. A couple comments: It seems weird to have jar and -jar as arguments/option in the command line "hadoop jar -file <comma seperated files> -jar <comma seperated jars>" Will it be better to use "-classpath" instead? When the job dir changes to jobdir/jars/urischeme/<jarfiles> jobdir/archives/urischeme/<archivefiles> jobdir/file/urischeme/<files> will that break the current applications that assume their files loaded using -file and -archive options in the jobdir?
          Hide
          Mahadev konar added a comment -

          i like owens idea. its simple and gives the users the flexibility they need.

          here is how I am implementing this –

          the hadoop command line will have the following options

          hadoop jar -file <comma seperated files> -jar <comma seperated jars> -archive <comma seperated archives>

          all of these can be comma seperated uri's – defaulting to local file system if not specified.

          jobclient uploads the files / jars / archives onto HDFS ..... or the filesystem used by mapreduce. ... under the job directory

          given that these files/jars/archives might have the same name and different uris....
          example : hadoop jar -file file:///file1,hdfs://somehost:port/file1
          we would store these files as
          jobdir/file/file/file1
          jobdir/hdfs_somehost_port/file1

          To keep these files in different directories with the directory name as the uri would give us the ability to just use DistributedCache as it is.

          so we could say DistributedCache.addFiles(jobdir/file/file/file1, jobdir/hdfs_somehost_port/file1);
          something like this ...

          so the job directory would like

          jobdir/jars/urischeme/<jarfiles>
          jobdir/archives/urischeme/<archivefiles>
          jobdir/file/urischeme/<files>

          the one in jars will be added to the classpath of all the tasks in order they were mentioned.
          the others will be copied once per job and symlinked from the current working directory of the task..

          comments?

          Show
          Mahadev konar added a comment - i like owens idea. its simple and gives the users the flexibility they need. here is how I am implementing this – the hadoop command line will have the following options hadoop jar -file <comma seperated files> -jar <comma seperated jars> -archive <comma seperated archives> all of these can be comma seperated uri's – defaulting to local file system if not specified. jobclient uploads the files / jars / archives onto HDFS ..... or the filesystem used by mapreduce. ... under the job directory given that these files/jars/archives might have the same name and different uris.... example : hadoop jar -file file:///file1,hdfs://somehost:port/file1 we would store these files as jobdir/file/file/file1 jobdir/hdfs_somehost_port/file1 To keep these files in different directories with the directory name as the uri would give us the ability to just use DistributedCache as it is. so we could say DistributedCache.addFiles(jobdir/file/file/file1, jobdir/hdfs_somehost_port/file1); something like this ... so the job directory would like jobdir/jars/urischeme/<jarfiles> jobdir/archives/urischeme/<archivefiles> jobdir/file/urischeme/<files> the one in jars will be added to the classpath of all the tasks in order they were mentioned. the others will be copied once per job and symlinked from the current working directory of the task.. comments?
          Hide
          Mahadev konar added a comment -

          alos owen, what would the command line look like with your suggestions?

          hadoop jar -file <files> -jar <jars> -archive <archives> ?

          Also, if that is the case then we could make it generic for streaming which uses its own options for -file , -archives and others .... though we do not need to do that in this patch...

          Show
          Mahadev konar added a comment - alos owen, what would the command line look like with your suggestions? hadoop jar -file <files> -jar <jars> -archive <archives> ? Also, if that is the case then we could make it generic for streaming which uses its own options for -file , -archives and others .... though we do not need to do that in this patch...
          Hide
          Dennis Kubes added a comment -

          I have not resumed working on this as of yet. Am currently neck deep in reworking NIO for hadoop RPC. I was planning on finishing on this as soon as I had completed the NIO code in the next 2-3 days. I would like to continue working on this if possible. When is 0.17 scheduled for release?

          Owen, the first pass at this didn't distinguish between jar or regular files on the command line. Instead there was detection code that identified files as such. Also the first pass supported directories as well as files (don't know if you are including that in file). I think the ability to include directories for job input is extremely important. What were the special cases that you were seeing?

          The idea behind this code is much like streaming you could upload and cache files from any type of resource (file, directory, jar, etc.) from any file system. So, for instance people could store common jars or file resources on S3 and pull them down into a job.

          Show
          Dennis Kubes added a comment - I have not resumed working on this as of yet. Am currently neck deep in reworking NIO for hadoop RPC. I was planning on finishing on this as soon as I had completed the NIO code in the next 2-3 days. I would like to continue working on this if possible. When is 0.17 scheduled for release? Owen, the first pass at this didn't distinguish between jar or regular files on the command line. Instead there was detection code that identified files as such. Also the first pass supported directories as well as files (don't know if you are including that in file). I think the ability to include directories for job input is extremely important. What were the special cases that you were seeing? The idea behind this code is much like streaming you could upload and cache files from any type of resource (file, directory, jar, etc.) from any file system. So, for instance people could store common jars or file resources on S3 and pull them down into a job.
          Hide
          Mahadev konar added a comment -

          am starting working on this ... dennis if you are already working on this please let me know..

          Show
          Mahadev konar added a comment - am starting working on this ... dennis if you are already working on this please let me know..
          Hide
          Mahadev konar added a comment -

          marking this for 0.17 release.

          Show
          Mahadev konar added a comment - marking this for 0.17 release.
          Hide
          Owen O'Malley added a comment -

          Dennis,
          Upon looking at this, I'm getting worried. This looks like a lot of special cases. What we really need is to support 3 kinds of files:

          • simple files
          • archives
          • jar files

          for each of these things, we would like them to be able to come from a URI and most convenient would be a default of a local file. So, I propose something like:

          -file foo,bar,hdfs:baz
          

          will upload foo and bar to an upload area and download foo, bar, and baz to the slave nodes as the tasks are run on them.

          -archive foo.zip,hdfs:baz.zip
          

          will download foo.zip and baz.zip and expand them.

          Finally, the -jar option would download them and put them on the class path. So,

          -jar myjar.jar,hadoop-0.16.1-streaming.jar
          

          would upload the files in the job client, download them to the slaves, and add them to the class path in the given order.

          I think I'd leave the rsync functionality out and just use hdfs:_upload/$jobid/... as transient storage and delete it when the job is done. If the user wants to save the bandwidth they can upload the files to hdfs themselves, in which case they don't need to be uploaded.
          Thoughts?

          Show
          Owen O'Malley added a comment - Dennis, Upon looking at this, I'm getting worried. This looks like a lot of special cases. What we really need is to support 3 kinds of files: simple files archives jar files for each of these things, we would like them to be able to come from a URI and most convenient would be a default of a local file. So, I propose something like: -file foo,bar,hdfs:baz will upload foo and bar to an upload area and download foo, bar, and baz to the slave nodes as the tasks are run on them. -archive foo.zip,hdfs:baz.zip will download foo.zip and baz.zip and expand them. Finally, the -jar option would download them and put them on the class path. So, -jar myjar.jar,hadoop-0.16.1-streaming.jar would upload the files in the job client, download them to the slaves, and add them to the class path in the given order. I think I'd leave the rsync functionality out and just use hdfs:_upload/$jobid/... as transient storage and delete it when the job is done. If the user wants to save the bandwidth they can upload the files to hdfs themselves, in which case they don't need to be uploaded. Thoughts?
          Hide
          Mahadev konar added a comment -

          great.... we also need this feature to get into 0.17. let me know if you need any hep getting this into 0.17...

          Show
          Mahadev konar added a comment - great.... we also need this feature to get into 0.17. let me know if you need any hep getting this into 0.17...
          Hide
          Dennis Kubes added a comment -

          No updates yet, but I should have time to start working on this again in the next couple of days, right after I finish some working on converting hadoop RPC to NIO.

          Show
          Dennis Kubes added a comment - No updates yet, but I should have time to start working on this again in the next couple of days, right after I finish some working on converting hadoop RPC to NIO.
          Hide
          Mahadev konar added a comment -

          dennis, any updates on this bug?

          Show
          Mahadev konar added a comment - dennis, any updates on this bug?
          Hide
          Dennis Kubes added a comment - - edited

          I have only gotten a chance to design not to develop this as I have been launching the Search Wikia site. Here is what I have come up with in terms of a more generalized design after talking with both Doug and Owen about this enhancement:

          1.A runjob utility. runjar is not affected as it is made to only run a single jar.

          2.The options parser will be extended to to support resources, upload, classpath, noclasspath, compress, decompress, and cache.

          • Items that at cached are added to the distributed cache.
          • Items uploaded are by default not added to the classpath
          • Items cached are by default added to the classpath
          • Resources are by default added to the classpath
          • Compress will choose resources to compress before adding to job.jar file
          • Decompress will choose resources to be decompress before adding to job.jar file.
          • Compress and decompress will only take action on resources being added to job. This will include non-local resources and will need to be handled in slave local job resources.
          • Classpath is ignored for any resource that is being uploaded as it will already be added to the classpath due to it being in resources.
          • All options support multiple elements in comma separated format.
          • No classpath will removed cached and non-cached resources from the classpath. For example a jar can be added to resources, included in the local job.jar resources but not included in its local classpath. (I don't know if this functionality is useful?)

          3.Resources

          • Resources are one or more items that are jarred up into the single job.jar file
          • Resources can be files (compressed or uncompressed) or directories
          • Resources can be from any file system.
          • Resources paths support relative and absolute paths
          • Resources support URL type paths to support multiple file systems
          • If the path in not in a URL format then it is assumed to be on the local file system as either an absolute or relative path.
          • Only resources that exist will be included. This is true for any file system. The resource must exist at the beginning of the job to be uploaded. If the resources exists at the beginning of the job but not when the local job starts its processing an error will be thrown and that task will cease operation.
          • A global configuration variable exists to choose to decompress any compressed file that is added as a resource.
          • Non-local resources will be pulled down into the local job resources from the resources given file system. This can include DFS and S3 resources added dynamically.
          • Local resources that are added to the job.jar will be resources from the resources configuration variable passed to the local jobs. Remaing resources will be the non-local resources that need to be added to local job resources.

          4.Uploads

          • Uploads by default are put into the users home directory on the jobtracker file system.
          • Upload directories can be set either through a configuration variable for a global default upload folder or through a colon path structure in the upload. Something like path:uploadto.
          • Upload resources can be added to the classpath by the classpath option
          • If upload resources are added to the classpath, they will be pulled down into the resources for each job and added to the local job classpath.
          • Uploads are independent of resources. An upload doesn't have to be a resource. A resource can be an uploaded element. In this case it would be uploaded (not included in local job.jar) and then pulled down from the job tracker file system as a resource.
          • Uploads will check modified date/time and size before uploading elements. If the upload is a directory, the upload will recursively check all files in that directory before upload and only upload modified files. This should give an rsync type functionality to uploading resources and should decrease bandwidth consumption.
          • Upload will support URL type paths as well. This will allow transferring resources from one type of file system (i.e. S3) to the job trackers file system. Again resources without a URL type structure will be considered local file system and will support relative and absolute paths. Only absolute paths will be supported on non-local file systems.
          Show
          Dennis Kubes added a comment - - edited I have only gotten a chance to design not to develop this as I have been launching the Search Wikia site. Here is what I have come up with in terms of a more generalized design after talking with both Doug and Owen about this enhancement: 1.A runjob utility. runjar is not affected as it is made to only run a single jar. 2.The options parser will be extended to to support resources, upload, classpath, noclasspath, compress, decompress, and cache. Items that at cached are added to the distributed cache. Items uploaded are by default not added to the classpath Items cached are by default added to the classpath Resources are by default added to the classpath Compress will choose resources to compress before adding to job.jar file Decompress will choose resources to be decompress before adding to job.jar file. Compress and decompress will only take action on resources being added to job. This will include non-local resources and will need to be handled in slave local job resources. Classpath is ignored for any resource that is being uploaded as it will already be added to the classpath due to it being in resources. All options support multiple elements in comma separated format. No classpath will removed cached and non-cached resources from the classpath. For example a jar can be added to resources, included in the local job.jar resources but not included in its local classpath. (I don't know if this functionality is useful?) 3.Resources Resources are one or more items that are jarred up into the single job.jar file Resources can be files (compressed or uncompressed) or directories Resources can be from any file system. Resources paths support relative and absolute paths Resources support URL type paths to support multiple file systems If the path in not in a URL format then it is assumed to be on the local file system as either an absolute or relative path. Only resources that exist will be included. This is true for any file system. The resource must exist at the beginning of the job to be uploaded. If the resources exists at the beginning of the job but not when the local job starts its processing an error will be thrown and that task will cease operation. A global configuration variable exists to choose to decompress any compressed file that is added as a resource. Non-local resources will be pulled down into the local job resources from the resources given file system. This can include DFS and S3 resources added dynamically. Local resources that are added to the job.jar will be resources from the resources configuration variable passed to the local jobs. Remaing resources will be the non-local resources that need to be added to local job resources. 4.Uploads Uploads by default are put into the users home directory on the jobtracker file system. Upload directories can be set either through a configuration variable for a global default upload folder or through a colon path structure in the upload. Something like path:uploadto. Upload resources can be added to the classpath by the classpath option If upload resources are added to the classpath, they will be pulled down into the resources for each job and added to the local job classpath. Uploads are independent of resources. An upload doesn't have to be a resource. A resource can be an uploaded element. In this case it would be uploaded (not included in local job.jar) and then pulled down from the job tracker file system as a resource. Uploads will check modified date/time and size before uploading elements. If the upload is a directory, the upload will recursively check all files in that directory before upload and only upload modified files. This should give an rsync type functionality to uploading resources and should decrease bandwidth consumption. Upload will support URL type paths as well. This will allow transferring resources from one type of file system (i.e. S3) to the job trackers file system. Again resources without a URL type structure will be considered local file system and will support relative and absolute paths. Only absolute paths will be supported on non-local file systems.
          Hide
          Milind Bhandarkar added a comment -

          Dennis, Did you get a chance to work on this after your last comment ? We would love to have this available to our users in 0.16.

          Show
          Milind Bhandarkar added a comment - Dennis, Did you get a chance to work on this after your last comment ? We would love to have this available to our users in 0.16.
          Hide
          Dennis Kubes added a comment -

          I was thinking about this last night. Right now the command line only takes a single jar even though multiple resources are supported through the patch. We could add checking an system property for multiple comma separated resources when the job is submitted. The command line could behind the scenes set this variable. We could have a -resources or a -jobResources a,b,c switch. The current patch handles finding resources from multiple locations including full paths and jars/classes in classpath. We could add in a relative path structure. If this seems reasonable I will work up a patch for it ASAP.

          Show
          Dennis Kubes added a comment - I was thinking about this last night. Right now the command line only takes a single jar even though multiple resources are supported through the patch. We could add checking an system property for multiple comma separated resources when the job is submitted. The command line could behind the scenes set this variable. We could have a -resources or a -jobResources a,b,c switch. The current patch handles finding resources from multiple locations including full paths and jars/classes in classpath. We could add in a relative path structure. If this seems reasonable I will work up a patch for it ASAP.
          Hide
          Doug Cutting added a comment -

          Owen & I talked a bit about this last week. We determined three commonly useful types of job resources:

          • archives already present in the cluster that will be unpacked in the task dir
          • archives already present in the cluster that will be intact in the task dir
          • resources in the local filesystem that will be added to the task's classpath
            This issue primarily concerns the last, but we should attempt to have a somewhat uniform mechanism. The primary differences between the first and the third are (a) that unqualified paths are resolved relative to different filesystems; and (b) resources may or may not be visible on the classpath.

          All of these should be available from the command line, with -archive, -file and -jar respectively.

          Owen, does that capture our discussion? What would need to change in the current patch to be consistent with that proposal? Should we file another issue to improve command-line support for these, or should this be done as a part of this issue?

          Show
          Doug Cutting added a comment - Owen & I talked a bit about this last week. We determined three commonly useful types of job resources: archives already present in the cluster that will be unpacked in the task dir archives already present in the cluster that will be intact in the task dir resources in the local filesystem that will be added to the task's classpath This issue primarily concerns the last, but we should attempt to have a somewhat uniform mechanism. The primary differences between the first and the third are (a) that unqualified paths are resolved relative to different filesystems; and (b) resources may or may not be visible on the classpath. All of these should be available from the command line, with -archive, -file and -jar respectively. Owen, does that capture our discussion? What would need to change in the current patch to be consistent with that proposal? Should we file another issue to improve command-line support for these, or should this be done as a part of this issue?
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-Nightly #286 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/286/ )
          Hide
          Dennis Kubes added a comment -

          Updated to most recent trunk and added requested changes.

          Show
          Dennis Kubes added a comment - Updated to most recent trunk and added requested changes.
          Hide
          Dennis Kubes added a comment -

          1. Could you please remove the mention of 'final' and 'default' config resources from the javadoc for JobConf.

          {get|set}

          JobResources? They are no longer relevant vis-a-vis hadoop Configuration.

          I have removed the mention of final and default resources.

          2. Should we also have a JobConf.setJobResource along with JobConf.addJobResource, ala {{DistributedCache} apis?

          I had debated about set vs add resources. The current behavior is when you add a resource you are appending it to a list of resources as opposed to setting a resource which would clear anything previously added and add only that resource. Since many times jar resources are added by including the jar file which contains a given class, I thought it better to NOT allow clearing and resetting of job resources.

          3. Should we move the private JobClient.createJobJar method to JarUtils to make it available as a useful utility?

          I debated about this too. JarUtils was generic jaring and unjaring utilities. But I don't see harm in putting createJobJar in and I think you are right we may need that somewhere else in the future. I have remvoed from JobClient and added to JarUtils.

          Unrelated: Does it make sense to rename Configuration.addResource to Configuration.addConfigResource? I wonder how confusing these unrelated api names are, given JobConf is a Configuration to

          Yeah, debated about this one too. In the end we weren't just adding jars but multiple things such as classes, exe, files. Couldn't find a better name for that then resource. I put it as jobResource to be a little less confusing. Changing Configuration over to configResource would be good I think, Although we should probably deprecate because a lot of things rely on that method.

          I am currently testing patch 9, will have it posted shortly.

          Show
          Dennis Kubes added a comment - 1. Could you please remove the mention of 'final' and 'default' config resources from the javadoc for JobConf. {get|set} JobResources? They are no longer relevant vis-a-vis hadoop Configuration. I have removed the mention of final and default resources. 2. Should we also have a JobConf.setJobResource along with JobConf.addJobResource, ala {{DistributedCache} apis? I had debated about set vs add resources. The current behavior is when you add a resource you are appending it to a list of resources as opposed to setting a resource which would clear anything previously added and add only that resource. Since many times jar resources are added by including the jar file which contains a given class, I thought it better to NOT allow clearing and resetting of job resources. 3. Should we move the private JobClient.createJobJar method to JarUtils to make it available as a useful utility? I debated about this too. JarUtils was generic jaring and unjaring utilities. But I don't see harm in putting createJobJar in and I think you are right we may need that somewhere else in the future. I have remvoed from JobClient and added to JarUtils. Unrelated: Does it make sense to rename Configuration.addResource to Configuration.addConfigResource? I wonder how confusing these unrelated api names are, given JobConf is a Configuration to Yeah, debated about this one too. In the end we weren't just adding jars but multiple things such as classes, exe, files. Couldn't find a better name for that then resource. I put it as jobResource to be a little less confusing. Changing Configuration over to configResource would be good I think, Although we should probably deprecate because a lot of things rely on that method. I am currently testing patch 9, will have it posted shortly.
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-Nightly #284 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/284/ )
          Hide
          Arun C Murthy added a comment -

          Dennis, I'm sorry to come in late on this... a couple of comments:

          1. Could you please remove the mention of 'final' and 'default' config resources from the javadoc for {{JobConf.

          {get|set}

          JobResources}}? They are no longer relevant vis-a-vis hadoop Configuration.
          2. Should we also have a JobConf.setJobResource along with JobConf.addJobResource, ala {{DistributedCache} apis?
          3. Should we move the private JobClient.createJobJar method to JarUtils to make it available as a useful utility?

          Unrelated: Does it make sense to rename Configuration.addResource to Configuration.addConfigResource? I wonder how confusing these unrelated api names are, given JobConf is a Configuration too ...

          Show
          Arun C Murthy added a comment - Dennis, I'm sorry to come in late on this... a couple of comments: 1. Could you please remove the mention of 'final' and 'default' config resources from the javadoc for {{JobConf. {get|set} JobResources}}? They are no longer relevant vis-a-vis hadoop Configuration. 2. Should we also have a JobConf.setJobResource along with JobConf.addJobResource , ala {{DistributedCache} apis? 3. Should we move the private JobClient.createJobJar method to JarUtils to make it available as a useful utility? Unrelated: Does it make sense to rename Configuration.addResource to Configuration.addConfigResource ? I wonder how confusing these unrelated api names are, given JobConf is a Configuration too ...
          Hide
          Dennis Kubes added a comment -

          This patch brings the code up to the current trunk. It also fixes a bug in createJobJar in which we needed to using the context classloader to search for classes. This patch passes all unit tests and successfully runs the RandomWriter example.

          Show
          Dennis Kubes added a comment - This patch brings the code up to the current trunk. It also fixes a bug in createJobJar in which we needed to using the context classloader to search for classes. This patch passes all unit tests and successfully runs the RandomWriter example.
          Hide
          Owen O'Malley added a comment -

          I had to revert the patch, because it broke HADOOP-2107

          Show
          Owen O'Malley added a comment - I had to revert the patch, because it broke HADOOP-2107
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-Nightly #283 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/283/ )
          Hide
          Doug Cutting added a comment -

          I just committed this. Thanks, Dennis!

          Show
          Doug Cutting added a comment - I just committed this. Thanks, Dennis!
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12368389/HADOOP-1622-7.patch
          against trunk revision r588300.

          @author +1. The patch does not contain any @author tags.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new compiler warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/982/testReport/
          Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/982/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/982/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/982/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12368389/HADOOP-1622-7.patch against trunk revision r588300. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/982/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/982/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/982/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/982/console This message is automatically generated.
          Hide
          Doug Cutting added a comment -

          I've updated this to trunk and will commit it later today barring objections.

          Show
          Doug Cutting added a comment - I've updated this to trunk and will commit it later today barring objections.
          Hide
          Devaraj Das added a comment -

          Sorry, this patch doesn't apply anymore. Dennis, could you please regenerate the patch with the current trunk.

          Show
          Devaraj Das added a comment - Sorry, this patch doesn't apply anymore. Dennis, could you please regenerate the patch with the current trunk.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12368037/HADOOP-1622-6.patch
          against trunk revision r586264.

          @author +1. The patch does not contain any @author tags.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new compiler warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/972/testReport/
          Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/972/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/972/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/972/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12368037/HADOOP-1622-6.patch against trunk revision r586264. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/972/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/972/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/972/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/972/console This message is automatically generated.
          Hide
          Doug Cutting added a comment -

          Fixes for findbugs problems.

          Show
          Doug Cutting added a comment - Fixes for findbugs problems.
          Hide
          Doug Cutting added a comment -

          Findbugs issues need to be addressed.

          Show
          Doug Cutting added a comment - Findbugs issues need to be addressed.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12367974/HADOOP-1622-5.patch
          against trunk revision r586003.

          @author +1. The patch does not contain any @author tags.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new compiler warnings.

          findbugs -1. The patch appears to introduce 3 new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/969/testReport/
          Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/969/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/969/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/969/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12367974/HADOOP-1622-5.patch against trunk revision r586003. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs -1. The patch appears to introduce 3 new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/969/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/969/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/969/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/969/console This message is automatically generated.
          Hide
          Doug Cutting added a comment -

          Here's a new version with setJar/getJar deprecated.

          Show
          Doug Cutting added a comment - Here's a new version with setJar/getJar deprecated.
          Hide
          Dennis Kubes added a comment -

          Any updates on this. Do we need to do anything to get this into Trunk? It has been running successfully for us in production (35 node hadoop cluster) for over a week now.

          Show
          Dennis Kubes added a comment - Any updates on this. Do we need to do anything to get this into Trunk? It has been running successfully for us in production (35 node hadoop cluster) for over a week now.
          Hide
          Dennis Kubes added a comment -

          I am good with deprecating them, just didn't want to break anything with the current patch. Let me know if you want me to build it into the patch, small change.

          Show
          Dennis Kubes added a comment - I am good with deprecating them, just didn't want to break anything with the current patch. Let me know if you want me to build it into the patch, small change.
          Hide
          Doug Cutting added a comment -

          This patch makes it so there are two ways to specify resources, either with job.setJar(), or job.addResource(). This seems overly complicated. I wonder if we shouldn't rather deprecate getJar/setJar. Instead of removing them later, we might simply make these package-private, since they'll still be the means of passing the compound jar name to tasktrackers, but users should only have to know about addResource(), no?

          Show
          Doug Cutting added a comment - This patch makes it so there are two ways to specify resources, either with job.setJar(), or job.addResource(). This seems overly complicated. I wonder if we shouldn't rather deprecate getJar/setJar. Instead of removing them later, we might simply make these package-private, since they'll still be the means of passing the compound jar name to tasktrackers, but users should only have to know about addResource(), no?
          Hide
          Dennis Kubes added a comment -

          Updated patch now passes all unit tests and is updated for curent 0.15 source. Build.xml has also been changed to allow tests for jarutils. Licenses have been added where applicable.

          Show
          Dennis Kubes added a comment - Updated patch now passes all unit tests and is updated for curent 0.15 source. Build.xml has also been changed to allow tests for jarutils. Licenses have been added where applicable.
          Hide
          Dennis Kubes added a comment -

          Sorry, I have been caught up in other development for the past couple of months. I will fix any problem with the current code, make sure it passes current unit tests, and upload a revised patch in the next couple of days.

          Show
          Dennis Kubes added a comment - Sorry, I have been caught up in other development for the past couple of months. I will fix any problem with the current code, make sure it passes current unit tests, and upload a revised patch in the next couple of days.
          Hide
          Doug Cutting added a comment -

          This does not currently pass its own unit tests.

          Show
          Doug Cutting added a comment - This does not currently pass its own unit tests.
          Hide
          stack added a comment -

          I tried running latest patch and failed with below:

          durruti:~/Documents/checkouts/hadoop-commit stack$ more build/test/TEST-org.apache.hadoop.util.TestJarUtils.txt 
          Testsuite: org.apache.hadoop.util.TestJarUtils
          Tests run: 5, Failures: 4, Errors: 0, Time elapsed: 0.258 sec
          ------------- Standard Output ---------------
          2007-08-08 08:00:01,272 INFO  util.JarUtils (JarUtils.java:jarAll(311)) - Adding file2:102 to test-out2.jar.
          2007-08-08 08:00:01,276 INFO  util.JarUtils (JarUtils.java:jarAll(229)) - Adding dir1/dir1-1/dir1-2/file1:0 to test-out2.jar.
          2007-08-08 08:00:01,279 INFO  util.JarUtils (JarUtils.java:jarAll(229)) - Adding dir2/dir2-1/file2:0 to test-out2.jar.
          2007-08-08 08:00:01,280 INFO  util.JarUtils (JarUtils.java:jarAll(229)) - Adding dir3/file3:0 to test-out2.jar.
          2007-08-08 08:00:01,281 INFO  util.JarUtils (JarUtils.java:jarAll(257)) - Adding file4.txt:0 to test-out2.jar.
          ------------- ---------------- ---------------
          
          Testcase: testGetJarPath took 0.175 sec
                  FAILED
          expected:<.../dir1-2/...> but was:<...\dir1-2\...>
          junit.framework.ComparisonFailure: expected:<.../dir1-2/...> but was:<...\dir1-2\...>
                  at org.apache.hadoop.util.TestJarUtils.testGetJarPath(TestJarUtils.java:88)
          
          Testcase: testJar took 0.022 sec
                  FAILED
          null
          junit.framework.AssertionFailedError
                  at org.apache.hadoop.util.TestJarUtils.testJar(TestJarUtils.java:111)
          
          Testcase: testIsJarOrZip took 0.015 sec
          Testcase: testJarAll took 0.024 sec
                  FAILED
          null
          junit.framework.AssertionFailedError
                  at org.apache.hadoop.util.TestJarUtils.testJarAll(TestJarUtils.java:164)
          
          Testcase: testCopyJarContents took 0.018 sec
                  FAILED
          null
          junit.framework.AssertionFailedError
                  at org.apache.hadoop.util.TestJarUtils.testCopyJarContents(TestJarUtils.java:218)
          

          If you are going to make a new patch, here's a couple of other things you could fix:

          License missing on unit test.

          Creating files in tests, other folks seem to do something like following to keep them under configured test directory: System.getProperty("test.build.data","."). You might want to do the same.

          Next time, to save yourself a bit of typing, you could use the local file system in hdfs to do the recursive delete of a directory (I'm guessing thats why you individually remove each of the items in the teardown):

            FileSystem fs = FileSystem.getLocal(new Configuration());
            fs.delete(new Path(f.toString()));
          

          In isJarOrZipFile, you could wrap your read of 5 bytes in a try/finally so the close always happens. In fact there are a bunch of places that could do w/ try/finally blocks (Its not critical in the usual case. The job jar will just error out w/o leaving hanging open files).

          Show
          stack added a comment - I tried running latest patch and failed with below: durruti:~/Documents/checkouts/hadoop-commit stack$ more build/test/TEST-org.apache.hadoop.util.TestJarUtils.txt Testsuite: org.apache.hadoop.util.TestJarUtils Tests run: 5, Failures: 4, Errors: 0, Time elapsed: 0.258 sec ------------- Standard Output --------------- 2007-08-08 08:00:01,272 INFO util.JarUtils (JarUtils.java:jarAll(311)) - Adding file2:102 to test-out2.jar. 2007-08-08 08:00:01,276 INFO util.JarUtils (JarUtils.java:jarAll(229)) - Adding dir1/dir1-1/dir1-2/file1:0 to test-out2.jar. 2007-08-08 08:00:01,279 INFO util.JarUtils (JarUtils.java:jarAll(229)) - Adding dir2/dir2-1/file2:0 to test-out2.jar. 2007-08-08 08:00:01,280 INFO util.JarUtils (JarUtils.java:jarAll(229)) - Adding dir3/file3:0 to test-out2.jar. 2007-08-08 08:00:01,281 INFO util.JarUtils (JarUtils.java:jarAll(257)) - Adding file4.txt:0 to test-out2.jar. ------------- ---------------- --------------- Testcase: testGetJarPath took 0.175 sec FAILED expected:<.../dir1-2/...> but was:<...\dir1-2\...> junit.framework.ComparisonFailure: expected:<.../dir1-2/...> but was:<...\dir1-2\...> at org.apache.hadoop.util.TestJarUtils.testGetJarPath(TestJarUtils.java:88) Testcase: testJar took 0.022 sec FAILED null junit.framework.AssertionFailedError at org.apache.hadoop.util.TestJarUtils.testJar(TestJarUtils.java:111) Testcase: testIsJarOrZip took 0.015 sec Testcase: testJarAll took 0.024 sec FAILED null junit.framework.AssertionFailedError at org.apache.hadoop.util.TestJarUtils.testJarAll(TestJarUtils.java:164) Testcase: testCopyJarContents took 0.018 sec FAILED null junit.framework.AssertionFailedError at org.apache.hadoop.util.TestJarUtils.testCopyJarContents(TestJarUtils.java:218) If you are going to make a new patch, here's a couple of other things you could fix: License missing on unit test. Creating files in tests, other folks seem to do something like following to keep them under configured test directory: System.getProperty("test.build.data","."). You might want to do the same. Next time, to save yourself a bit of typing, you could use the local file system in hdfs to do the recursive delete of a directory (I'm guessing thats why you individually remove each of the items in the teardown): FileSystem fs = FileSystem.getLocal( new Configuration()); fs.delete( new Path(f.toString())); In isJarOrZipFile, you could wrap your read of 5 bytes in a try/finally so the close always happens. In fact there are a bunch of places that could do w/ try/finally blocks (Its not critical in the usual case. The job jar will just error out w/o leaving hanging open files).
          Hide
          Raghu Angadi added a comment -

          Which Hadoop release is this meant for? I would like to use to create an archive file for HADOOP-1629.

          Show
          Raghu Angadi added a comment - Which Hadoop release is this meant for? I would like to use to create an archive file for HADOOP-1629 .
          Hide
          Dennis Kubes added a comment -

          This patch includes the multiple job resources patch and fixes some issues we were seeing in a production environment with JarUtils NOT correctly copying some jar resources.

          Show
          Dennis Kubes added a comment - This patch includes the multiple job resources patch and fixes some issues we were seeing in a production environment with JarUtils NOT correctly copying some jar resources.
          Hide
          Dennis Kubes added a comment -

          Is there anything else I need to do with this? If this patch works for the community I would love to get it included as it would help with some tutorials I am writing on doing custom nutch development (plugins, etc.). If I need to make any changes please suggest.

          Show
          Dennis Kubes added a comment - Is there anything else I need to do with this? If this patch works for the community I would love to get it included as it would help with some tutorials I am writing on doing custom nutch development (plugins, etc.). If I need to make any changes please suggest.
          Hide
          Enis Soztutar added a comment -

          +1 for the patch. I have not tested it but the code seems good.

          Show
          Enis Soztutar added a comment - +1 for the patch. I have not tested it but the code seems good.
          Hide
          Dennis Kubes added a comment -

          The multipleJobResources.patch has all of the jar code refactored and tested. It as functionality to allow multiple type of job resources including jar files, classpath jars and classes, files, and most importantly directories. All resources are merged into a single job.jar before the job file is submitted. Options were added to ToolBase to support multiple job resources. Unit tests for jar utilities are included. Current patch passed all unit tests and worked successfully with a job run.

          Show
          Dennis Kubes added a comment - The multipleJobResources.patch has all of the jar code refactored and tested. It as functionality to allow multiple type of job resources including jar files, classpath jars and classes, files, and most importantly directories. All resources are merged into a single job.jar before the job file is submitted. Options were added to ToolBase to support multiple job resources. Unit tests for jar utilities are included. Current patch passed all unit tests and worked successfully with a job run.
          Hide
          Dennis Kubes added a comment -

          I got to thinking, always a dangerous thing, and I thought if we are extending this for multiple jar file, why not other resources like jars on the classpath, jars that contain a given class, and directories. Let's say that we could specify one or more directories as a resource to be included in the job jar, then when we do the merge we would copy all resources from that directory into the job jar. This would allow us to do thing like deploy executables, resource files, or multiple jar files across the cluster to be used in the jobs. So say you have a custom executable you need to call in your MR job, you just drop it in a directory, include the directory as a job resource and that executable would get deployed out onto the cluster and would be available for that single job.

          I went back and refactored the code to allow job resources as opposed to just jar files. A resource would be either an absolute path to a jar file, a jar file on the classpath, a directory, or the name of a class that is contained in a jar on the classpath. As an added bonus getJars and addJar now become getJobResources and addJobResource (we may need to come up with a different name as this might be too easily confused with default and final resouces in configuration), and we can keep getJar and setJar as they now apply only to the final job jar file.

          I am doing final testing of this code right now and will have a patch up in just a little while.

          Show
          Dennis Kubes added a comment - I got to thinking, always a dangerous thing, and I thought if we are extending this for multiple jar file, why not other resources like jars on the classpath, jars that contain a given class, and directories. Let's say that we could specify one or more directories as a resource to be included in the job jar, then when we do the merge we would copy all resources from that directory into the job jar. This would allow us to do thing like deploy executables, resource files, or multiple jar files across the cluster to be used in the jobs. So say you have a custom executable you need to call in your MR job, you just drop it in a directory, include the directory as a job resource and that executable would get deployed out onto the cluster and would be available for that single job. I went back and refactored the code to allow job resources as opposed to just jar files. A resource would be either an absolute path to a jar file, a jar file on the classpath, a directory, or the name of a class that is contained in a jar on the classpath. As an added bonus getJars and addJar now become getJobResources and addJobResource (we may need to come up with a different name as this might be too easily confused with default and final resouces in configuration), and we can keep getJar and setJar as they now apply only to the final job jar file. I am doing final testing of this code right now and will have a patch up in just a little while.
          Hide
          Doug Cutting added a comment -

          > So if we are going to deprecate getJar and setJar, then we will need to change how we set and get the final job jar

          Good point. Maybe we can make these (or equivalent new methods) package-private instead of removing them, only used by the core?

          Show
          Doug Cutting added a comment - > So if we are going to deprecate getJar and setJar, then we will need to change how we set and get the final job jar Good point. Maybe we can make these (or equivalent new methods) package-private instead of removing them, only used by the core?
          Hide
          Dennis Kubes added a comment -

          Remember that multiple jars is really just a front end merge before a single merged job jar file is submitted to the MR system. We still have to set a single job jar (currently through setJar) and the various job runners and task trackers need to get that single jar (currently through getJar). So if we are going to deprecate getJar and setJar, then we will need to change how we set and get the final job jar throughout the system.

          The other changes and recommendations have been made and I am doing final testing. If we can get a consensus on the get and set jar methods I can have a new candidate patch uploaded today.

          Show
          Dennis Kubes added a comment - Remember that multiple jars is really just a front end merge before a single merged job jar file is submitted to the MR system. We still have to set a single job jar (currently through setJar) and the various job runners and task trackers need to get that single jar (currently through getJar). So if we are going to deprecate getJar and setJar, then we will need to change how we set and get the final job jar throughout the system. The other changes and recommendations have been made and I am doing final testing. If we can get a consensus on the get and set jar methods I can have a new candidate patch uploaded today.
          Hide
          Doug Cutting added a comment -

          > Could we just deprecate getJar(), and return an array of length one in getJars()?

          +1 We should deprecate both setJar() and getJar(), replacing them with addJar() and getJars().

          Show
          Doug Cutting added a comment - > Could we just deprecate getJar(), and return an array of length one in getJars()? +1 We should deprecate both setJar() and getJar(), replacing them with addJar() and getJars().
          Hide
          Enis Soztutar added a comment -

          For the patch, IMO having two functions JobConf#getJar() and JobConf#getJars() is a bit confusing. Could we just deprecate getJar(), and return an array of length one in getJars. And it would be good to make a new util class JarUtils, and refactor RunJar#jar(), RunJar#unjar and RunJar#getToBeJared().

          Show
          Enis Soztutar added a comment - For the patch, IMO having two functions JobConf#getJar() and JobConf#getJars() is a bit confusing. Could we just deprecate getJar(), and return an array of length one in getJars . And it would be good to make a new util class JarUtils, and refactor RunJar#jar() , RunJar#unjar and RunJar#getToBeJared() .
          Hide
          Enis Soztutar added a comment -

          >Just wondering where to put the command line parsing and how it would affect applications like Injector.
          At the risk of repeating my previous comment, i should say : Adding such generic options can easily be achieved with the proposed implementation in HADOOP-1436. We would parse the argument(s) in HadoopCommandLineParser.

          Dennis, for this issue, i think you could just implement command line parsing in ToolBase#processGeneralOptions, and ToolBase#buildGeneralOptions. You can add a method to get the additional jars.

          >then it affects any child of ToolBase
          Since it is a generic functionality it is perfectly fine.

          Show
          Enis Soztutar added a comment - >Just wondering where to put the command line parsing and how it would affect applications like Injector. At the risk of repeating my previous comment, i should say : Adding such generic options can easily be achieved with the proposed implementation in HADOOP-1436 . We would parse the argument(s) in HadoopCommandLineParser. Dennis, for this issue, i think you could just implement command line parsing in ToolBase#processGeneralOptions , and ToolBase#buildGeneralOptions . You can add a method to get the additional jars. >then it affects any child of ToolBase Since it is a generic functionality it is perfectly fine.
          Hide
          Hadoop QA added a comment -
          Show
          Hadoop QA added a comment - +0, new Findbugs warnings http://issues.apache.org/jira/secure/attachment/12362196/multipleJobJars.patch applied and successfully tested against trunk revision r558150, but there appear to be new Findbugs warnings introduced by this patch. New Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/445/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/445/testReport/ Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/445/console
          Hide
          Dennis Kubes added a comment -

          Just confirm this for me. Each application (i.e. Injector, Generator, etc.) would be responsible for dealing with the command line options in their own way, even though they extend ToolBase. If I modify JobClient#run(String[]) then that only sets additional jars for the JobClient. I could add the command line parsing to ToolBase#processGeneralOptions, and then it affects any child of ToolBase. Just wondering where to put the command line parsing and how it would affect applications like Injector.

          Show
          Dennis Kubes added a comment - Just confirm this for me. Each application (i.e. Injector, Generator, etc.) would be responsible for dealing with the command line options in their own way, even though they extend ToolBase. If I modify JobClient#run(String[]) then that only sets additional jars for the JobClient. I could add the command line parsing to ToolBase#processGeneralOptions, and then it affects any child of ToolBase. Just wondering where to put the command line parsing and how it would affect applications like Injector.
          Hide
          Doug Cutting added a comment -

          My intuition is that addJar() should add things onto the front of the classpath, overriding what's there.

          Show
          Doug Cutting added a comment - My intuition is that addJar() should add things onto the front of the classpath, overriding what's there.
          Hide
          Dennis Kubes added a comment -

          I hadn't thought about it that way and yes it does go against java classloader semantics. I will make the changes to make it consistent with normal classpath behavior . I will also make the other changes mentioned above.

          My intention here was really to allow custom development without changing standard jars, but yes that can be handled by semanics that we define. What would be a good way to do this, something like a addPriorityJar() method in JobConf? Currently setJar will take priority over addJar and addJars.

          Show
          Dennis Kubes added a comment - I hadn't thought about it that way and yes it does go against java classloader semantics. I will make the changes to make it consistent with normal classpath behavior . I will also make the other changes mentioned above. My intention here was really to allow custom development without changing standard jars, but yes that can be handled by semanics that we define. What would be a good way to do this, something like a addPriorityJar() method in JobConf? Currently setJar will take priority over addJar and addJars.
          Hide
          Doug Cutting added a comment -

          > it is more useful if classes in later jars overwrite classes in earlier jars

          That's inconsistent with normal CLASSPATH behavior, no? And, in any case, shouldn't an application (like Nutch) be able to easily order jars according to whatever convention we implement?

          > command line options

          Yes, it would be gppd to add command-line options in this patch. No changes should be required to the scripts, but rather just to JobClient#run(String[]) and/or RunJar#main(String[]).

          Show
          Doug Cutting added a comment - > it is more useful if classes in later jars overwrite classes in earlier jars That's inconsistent with normal CLASSPATH behavior, no? And, in any case, shouldn't an application (like Nutch) be able to easily order jars according to whatever convention we implement? > command line options Yes, it would be gppd to add command-line options in this patch. No changes should be required to the scripts, but rather just to JobClient#run(String[]) and/or RunJar#main(String[]).
          Hide
          Dennis Kubes added a comment -

          Would it be helpful to make changes to JobClient to be able to support the command line options, such as -additionalJars, at the same time we make introduce the new functionality. Also what do we need to do to support these options within the command line scripts?

          Show
          Dennis Kubes added a comment - Would it be helpful to make changes to JobClient to be able to support the command line options, such as -additionalJars, at the same time we make introduce the new functionality. Also what do we need to do to support these options within the command line scripts?
          Hide
          Doug Cutting added a comment -

          This is looking good. A few comments:

          1. Just because most of the public methods in JobConf don't have javadoc is no excuse not to add javadoc to new methods there.

          2. The logic added to JobClient is big enough that it should go in a new private method.

          3. The temporary directory should probably be created under hadoop.tmp.dir and should have a unique name. It should be removed either in a 'finally' clause or in a shutdown hook, preferably a 'finally' clause.

          Show
          Doug Cutting added a comment - This is looking good. A few comments: 1. Just because most of the public methods in JobConf don't have javadoc is no excuse not to add javadoc to new methods there. 2. The logic added to JobClient is big enough that it should go in a new private method. 3. The temporary directory should probably be created under hadoop.tmp.dir and should have a unique name. It should be removed either in a 'finally' clause or in a shutdown hook, preferably a 'finally' clause.
          Hide
          Dennis Kubes added a comment -

          For Nutch development at least (I don't know about others), it is more useful if classes in later jars overwrite classes in earlier jars. This will enable someone to do Nutch development, overriding or reworking core classes, without touching the main Nutch source code base. For many of the Nutch programs, a NutchJob is created that automatically sets the job jar file, which would now be the first jar file. We wanted to be able to override that when necessary.

          Show
          Dennis Kubes added a comment - For Nutch development at least (I don't know about others), it is more useful if classes in later jars overwrite classes in earlier jars. This will enable someone to do Nutch development, overriding or reworking core classes, without touching the main Nutch source code base. For many of the Nutch programs, a NutchJob is created that automatically sets the job jar file, which would now be the first jar file. We wanted to be able to override that when necessary.
          Hide
          Runping Qi added a comment -

          One enhancement will be to unjar the jars in the reversed order of the jars in the option spec.
          That way, if there is any class colission, the one in the front in the original list will win.

          Show
          Runping Qi added a comment - One enhancement will be to unjar the jars in the reversed order of the jars in the option spec. That way, if there is any class colission, the one in the front in the original list will win.
          Hide
          Dennis Kubes added a comment -

          patch available

          Show
          Dennis Kubes added a comment - patch available
          Hide
          Dennis Kubes added a comment -

          The multipleJobJars.patch file is available for review. While this patch does add the ability to have multiple jars it does not change any script options.

          Show
          Dennis Kubes added a comment - The multipleJobJars.patch file is available for review. While this patch does add the ability to have multiple jars it does not change any script options.
          Hide
          Dennis Kubes added a comment -

          Adds the ability to load multiple jar files for a single job. All jar files are merged into a single master job jar file that is then submitted as the job.jar to hadoop.

          Show
          Dennis Kubes added a comment - Adds the ability to load multiple jar files for a single job. All jar files are merged into a single master job jar file that is then submitted as the job.jar to hadoop.
          Hide
          Enis Soztutar added a comment -

          > I'd also suggest to change bin/hadoop or the underliying java code to accept options like --additionalJars
          Adding such generic options can easily be achieved with the proposed implementation in HADOOP-1436. We would parse the argument(s) in HadoopCommandLineParser, set some parameter to be the name of the additional jar(s), then change JobConf(Configuration) to respect that.

          Show
          Enis Soztutar added a comment - > I'd also suggest to change bin/hadoop or the underliying java code to accept options like --additionalJars Adding such generic options can easily be achieved with the proposed implementation in HADOOP-1436 . We would parse the argument(s) in HadoopCommandLineParser, set some parameter to be the name of the additional jar(s), then change JobConf(Configuration) to respect that.
          Hide
          Runping Qi added a comment -

          I think your proposal of letting JobClient to jar multiple jars into a single one is reasonable.
          I'd also suggest to change bin/hadoop or the underliying java code to accept options like --additionalJars

          Show
          Runping Qi added a comment - I think your proposal of letting JobClient to jar multiple jars into a single one is reasonable. I'd also suggest to change bin/hadoop or the underliying java code to accept options like --additionalJars
          Hide
          Doug Cutting added a comment -

          I don't disagree with any of your statements in the previous message: currently we encourage the main to be in the top-level jar specified, which can be awkward; and, yes, it would be more convenient to let users list multiple jars when submitting jobs.

          I'm suggesting that JobClient should jar things together. This would change the way that the job jar is determined, and thus the relationship between the main() and user jar files can be altered at the same time.

          Users should be able to submit jobs specifying a set of jars. That's the crux of this issue, and I agree we ought to support it. But I suggest that the way we ought to implement this is to change JobClient to pack together the user's jars into a single jar, and submit this. Few if any changes should be required to the JobTracker or TaskTracker. Does that make sense? Do you have an alternate implementation proposal?

          Show
          Doug Cutting added a comment - I don't disagree with any of your statements in the previous message: currently we encourage the main to be in the top-level jar specified, which can be awkward; and, yes, it would be more convenient to let users list multiple jars when submitting jobs. I'm suggesting that JobClient should jar things together. This would change the way that the job jar is determined, and thus the relationship between the main() and user jar files can be altered at the same time. Users should be able to submit jobs specifying a set of jars. That's the crux of this issue, and I agree we ought to support it. But I suggest that the way we ought to implement this is to change JobClient to pack together the user's jars into a single jar, and submit this. Few if any changes should be required to the JobTracker or TaskTracker. Does that make sense? Do you have an alternate implementation proposal?
          Hide
          Runping Qi added a comment -

          If the user's jar does not have the main function (when the user uses Aggregate, Streaming, DataJoin, etc.), his/her jar file(s) will not be picked up.

          Plus, in many cases, the user's job may also depend on some third party jars.
          It is much more convenient for the user to just list the jars at job submission time than to jar them together .

          Show
          Runping Qi added a comment - If the user's jar does not have the main function (when the user uses Aggregate, Streaming, DataJoin, etc.), his/her jar file(s) will not be picked up. Plus, in many cases, the user's job may also depend on some third party jars. It is much more convenient for the user to just list the jars at job submission time than to jar them together .
          Hide
          Doug Cutting added a comment -

          This can easily be implemented by packing the user jars together into a single jar, then submitting the job with that jar, right? I don't think the mapred kernel needs to deal with more than a single jar per job, but making the user API accept multiple jars is fine.

          Show
          Doug Cutting added a comment - This can easily be implemented by packing the user jars together into a single jar, then submitting the job with that jar, right? I don't think the mapred kernel needs to deal with more than a single jar per job, but making the user API accept multiple jars is fine.

            People

            • Assignee:
              Mahadev konar
              Reporter:
              Runping Qi
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development