Issue Details (XML | Word | Printable)

Key: HADOOP-1622
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Mahadev konar
Reporter: Runping Qi
Votes: 0
Watchers: 9
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Hadoop should provide a way to allow the user to specify jar file(s) the user job depends on

Created: 17/Jul/07 06:51 PM   Updated: 08/Jul/09 04:52 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: 0.17.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works hadoop-1622-4-20071008.patch 2007-10-09 07:11 PM Dennis Kubes 48 kB
Text File Licensed for inclusion in ASF works HADOOP-1622-5.patch 2007-10-18 08:40 PM Doug Cutting 46 kB
Text File Licensed for inclusion in ASF works HADOOP-1622-6.patch 2007-10-19 06:58 PM Doug Cutting 46 kB
Text File Licensed for inclusion in ASF works HADOOP-1622-7.patch 2007-10-25 04:31 PM Doug Cutting 44 kB
Text File Licensed for inclusion in ASF works HADOOP-1622-8.patch 2007-10-27 06:06 AM Dennis Kubes 45 kB
Text File Licensed for inclusion in ASF works HADOOP-1622-9.patch 2007-10-27 09:06 PM Dennis Kubes 46 kB
Text File Licensed for inclusion in ASF works HADOOP-1622_1.patch 2008-03-20 11:31 PM Mahadev konar 20 kB
Text File Licensed for inclusion in ASF works HADOOP-1622_2.patch 2008-03-22 02:21 AM Mahadev konar 28 kB
Text File Licensed for inclusion in ASF works HADOOP-1622_3.patch 2008-03-24 09:30 PM Mahadev konar 29 kB
Text File Licensed for inclusion in ASF works HADOOP-1622_4.patch 2008-03-25 12:00 AM Mahadev konar 28 kB
Text File Licensed for inclusion in ASF works HADOOP-1622_5.patch 2008-03-25 04:55 PM Mahadev konar 30 kB
Text File Licensed for inclusion in ASF works HADOOP-1622_6.patch 2008-03-26 05:53 PM Mahadev konar 30 kB
Text File Licensed for inclusion in ASF works multipleJobJars.patch 2007-07-20 04:13 AM Dennis Kubes 8 kB
Text File Licensed for inclusion in ASF works multipleJobResources.patch 2007-07-25 07:30 AM Dennis Kubes 43 kB
Text File Licensed for inclusion in ASF works multipleJobResources2.patch 2007-07-30 09:25 PM Dennis Kubes 44 kB
Issue Links:
Duplicate
 
Reference
 

Release Note:
This patch allows new command line options for

hadoop jar
which are

hadoop jar -files <comma seperated list of files> -libjars <comma seperated list of jars> -archives <comma seperated list of archives>

-files options allows you to speficy comma seperated list of path which would be present in your current working directory of your task
-libjars option allows you to add jars to the classpaths of the maps and reduces.
-archives allows you to pass archives as arguments that are unzipped/unjarred and a link with name of the jar/zip are created in the current working directory if tasks.
Resolution Date: 26/Mar/08 09:08 PM


 Description  « Hide
More likely than not, a user's job may depend on multiple jars.
Right now, when submitting a job through bin/hadoop, there is no way for the user to specify that.
A walk around for that is to re-package all the dependent jars into a new jar or put the dependent jar files in the lib dir of the new jar.
This walk around causes unnecessary inconvenience to the user. Furthermore, if the user does not own the main function
(like the case when the user uses Aggregate, or datajoin, streaming), the user has to re-package those system jar files too.
It is much desired that hadoop provides a clean and simple way for the user to specify a list of dependent jar files at the time
of job submission. Someting like:

bin/hadoop .... --depending_jars j1.jar:j2.jar



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Doug Cutting added a comment - 17/Jul/07 07:30 PM
This can easily be implemented by packing the user jars together into a single jar, then submitting the job with that jar, right? I don't think the mapred kernel needs to deal with more than a single jar per job, but making the user API accept multiple jars is fine.

Runping Qi added a comment - 18/Jul/07 03:35 PM

If the user's jar does not have the main function (when the user uses Aggregate, Streaming, DataJoin, etc.), his/her jar file(s) will not be picked up.

Plus, in many cases, the user's job may also depend on some third party jars.
It is much more convenient for the user to just list the jars at job submission time than to jar them together .


Doug Cutting added a comment - 18/Jul/07 04:17 PM
I don't disagree with any of your statements in the previous message: currently we encourage the main to be in the top-level jar specified, which can be awkward; and, yes, it would be more convenient to let users list multiple jars when submitting jobs.

I'm suggesting that JobClient should jar things together. This would change the way that the job jar is determined, and thus the relationship between the main() and user jar files can be altered at the same time.

Users should be able to submit jobs specifying a set of jars. That's the crux of this issue, and I agree we ought to support it. But I suggest that the way we ought to implement this is to change JobClient to pack together the user's jars into a single jar, and submit this. Few if any changes should be required to the JobTracker or TaskTracker. Does that make sense? Do you have an alternate implementation proposal?


Runping Qi added a comment - 18/Jul/07 05:26 PM

I think your proposal of letting JobClient to jar multiple jars into a single one is reasonable.
I'd also suggest to change bin/hadoop or the underliying java code to accept options like --additionalJars


Enis Soztutar added a comment - 19/Jul/07 10:10 AM
> I'd also suggest to change bin/hadoop or the underliying java code to accept options like --additionalJars
Adding such generic options can easily be achieved with the proposed implementation in HADOOP-1436. We would parse the argument(s) in HadoopCommandLineParser, set some parameter to be the name of the additional jar(s), then change JobConf(Configuration) to respect that.

Dennis Kubes added a comment - 20/Jul/07 04:13 AM
Adds the ability to load multiple jar files for a single job. All jar files are merged into a single master job jar file that is then submitted as the job.jar to hadoop.

Dennis Kubes added a comment - 20/Jul/07 04:15 AM
The multipleJobJars.patch file is available for review. While this patch does add the ability to have multiple jars it does not change any script options.

Dennis Kubes added a comment - 20/Jul/07 04:18 AM
patch available

Runping Qi added a comment - 20/Jul/07 05:30 PM

One enhancement will be to unjar the jars in the reversed order of the jars in the option spec.
That way, if there is any class colission, the one in the front in the original list will win.


Dennis Kubes added a comment - 20/Jul/07 05:54 PM
For Nutch development at least (I don't know about others), it is more useful if classes in later jars overwrite classes in earlier jars. This will enable someone to do Nutch development, overriding or reworking core classes, without touching the main Nutch source code base. For many of the Nutch programs, a NutchJob is created that automatically sets the job jar file, which would now be the first jar file. We wanted to be able to override that when necessary.

Doug Cutting added a comment - 20/Jul/07 06:16 PM
This is looking good. A few comments:

1. Just because most of the public methods in JobConf don't have javadoc is no excuse not to add javadoc to new methods there.

2. The logic added to JobClient is big enough that it should go in a new private method.

3. The temporary directory should probably be created under hadoop.tmp.dir and should have a unique name. It should be removed either in a 'finally' clause or in a shutdown hook, preferably a 'finally' clause.


Dennis Kubes added a comment - 20/Jul/07 06:17 PM
Would it be helpful to make changes to JobClient to be able to support the command line options, such as -additionalJars, at the same time we make introduce the new functionality. Also what do we need to do to support these options within the command line scripts?

Doug Cutting added a comment - 20/Jul/07 06:51 PM
> it is more useful if classes in later jars overwrite classes in earlier jars

That's inconsistent with normal CLASSPATH behavior, no? And, in any case, shouldn't an application (like Nutch) be able to easily order jars according to whatever convention we implement?

> command line options

Yes, it would be gppd to add command-line options in this patch. No changes should be required to the scripts, but rather just to JobClient#run(String[]) and/or RunJar#main(String[]).


Dennis Kubes added a comment - 20/Jul/07 08:10 PM
I hadn't thought about it that way and yes it does go against java classloader semantics. I will make the changes to make it consistent with normal classpath behavior . I will also make the other changes mentioned above.

My intention here was really to allow custom development without changing standard jars, but yes that can be handled by semanics that we define. What would be a good way to do this, something like a addPriorityJar() method in JobConf? Currently setJar will take priority over addJar and addJars.


Doug Cutting added a comment - 20/Jul/07 08:17 PM
My intuition is that addJar() should add things onto the front of the classpath, overriding what's there.

Dennis Kubes added a comment - 20/Jul/07 09:54 PM
Just confirm this for me. Each application (i.e. Injector, Generator, etc.) would be responsible for dealing with the command line options in their own way, even though they extend ToolBase. If I modify JobClient#run(String[]) then that only sets additional jars for the JobClient. I could add the command line parsing to ToolBase#processGeneralOptions, and then it affects any child of ToolBase. Just wondering where to put the command line parsing and how it would affect applications like Injector.

Hadoop QA added a comment - 21/Jul/07 04:36 AM

Enis Soztutar added a comment - 23/Jul/07 06:23 AM
>Just wondering where to put the command line parsing and how it would affect applications like Injector.
At the risk of repeating my previous comment, i should say : Adding such generic options can easily be achieved with the proposed implementation in HADOOP-1436. We would parse the argument(s) in HadoopCommandLineParser.

Dennis, for this issue, i think you could just implement command line parsing in ToolBase#processGeneralOptions, and ToolBase#buildGeneralOptions. You can add a method to get the additional jars.

>then it affects any child of ToolBase
Since it is a generic functionality it is perfectly fine.


Enis Soztutar added a comment - 23/Jul/07 06:47 AM
For the patch, IMO having two functions JobConf#getJar() and JobConf#getJars() is a bit confusing. Could we just deprecate getJar(), and return an array of length one in getJars. And it would be good to make a new util class JarUtils, and refactor RunJar#jar(), RunJar#unjar and RunJar#getToBeJared().

Doug Cutting added a comment - 23/Jul/07 05:37 PM
> Could we just deprecate getJar(), and return an array of length one in getJars()?

+1 We should deprecate both setJar() and getJar(), replacing them with addJar() and getJars().


Dennis Kubes added a comment - 24/Jul/07 03:12 PM
Remember that multiple jars is really just a front end merge before a single merged job jar file is submitted to the MR system. We still have to set a single job jar (currently through setJar) and the various job runners and task trackers need to get that single jar (currently through getJar). So if we are going to deprecate getJar and setJar, then we will need to change how we set and get the final job jar throughout the system.

The other changes and recommendations have been made and I am doing final testing. If we can get a consensus on the get and set jar methods I can have a new candidate patch uploaded today.


Doug Cutting added a comment - 24/Jul/07 07:14 PM
> So if we are going to deprecate getJar and setJar, then we will need to change how we set and get the final job jar

Good point. Maybe we can make these (or equivalent new methods) package-private instead of removing them, only used by the core?


Dennis Kubes added a comment - 24/Jul/07 09:24 PM
I got to thinking, always a dangerous thing, and I thought if we are extending this for multiple jar file, why not other resources like jars on the classpath, jars that contain a given class, and directories. Let's say that we could specify one or more directories as a resource to be included in the job jar, then when we do the merge we would copy all resources from that directory into the job jar. This would allow us to do thing like deploy executables, resource files, or multiple jar files across the cluster to be used in the jobs. So say you have a custom executable you need to call in your MR job, you just drop it in a directory, include the directory as a job resource and that executable would get deployed out onto the cluster and would be available for that single job.

I went back and refactored the code to allow job resources as opposed to just jar files. A resource would be either an absolute path to a jar file, a jar file on the classpath, a directory, or the name of a class that is contained in a jar on the classpath. As an added bonus getJars and addJar now become getJobResources and addJobResource (we may need to come up with a different name as this might be too easily confused with default and final resouces in configuration), and we can keep getJar and setJar as they now apply only to the final job jar file.

I am doing final testing of this code right now and will have a patch up in just a little while.


Dennis Kubes added a comment - 25/Jul/07 07:30 AM
The multipleJobResources.patch has all of the jar code refactored and tested. It as functionality to allow multiple type of job resources including jar files, classpath jars and classes, files, and most importantly directories. All resources are merged into a single job.jar before the job file is submitted. Options were added to ToolBase to support multiple job resources. Unit tests for jar utilities are included. Current patch passed all unit tests and worked successfully with a job run.

Enis Soztutar added a comment - 25/Jul/07 01:11 PM
+1 for the patch. I have not tested it but the code seems good.

Dennis Kubes added a comment - 28/Jul/07 01:45 PM
Is there anything else I need to do with this? If this patch works for the community I would love to get it included as it would help with some tutorials I am writing on doing custom nutch development (plugins, etc.). If I need to make any changes please suggest.

Dennis Kubes added a comment - 30/Jul/07 09:25 PM
This patch includes the multiple job resources patch and fixes some issues we were seeing in a production environment with JarUtils NOT correctly copying some jar resources.

Raghu Angadi added a comment - 01/Aug/07 06:26 PM
Which Hadoop release is this meant for? I would like to use to create an archive file for HADOOP-1629.

stack added a comment - 08/Aug/07 03:28 PM
I tried running latest patch and failed with below:
durruti:~/Documents/checkouts/hadoop-commit stack$ more build/test/TEST-org.apache.hadoop.util.TestJarUtils.txt 
Testsuite: org.apache.hadoop.util.TestJarUtils
Tests run: 5, Failures: 4, Errors: 0, Time elapsed: 0.258 sec
------------- Standard Output ---------------
2007-08-08 08:00:01,272 INFO  util.JarUtils (JarUtils.java:jarAll(311)) - Adding file2:102 to test-out2.jar.
2007-08-08 08:00:01,276 INFO  util.JarUtils (JarUtils.java:jarAll(229)) - Adding dir1/dir1-1/dir1-2/file1:0 to test-out2.jar.
2007-08-08 08:00:01,279 INFO  util.JarUtils (JarUtils.java:jarAll(229)) - Adding dir2/dir2-1/file2:0 to test-out2.jar.
2007-08-08 08:00:01,280 INFO  util.JarUtils (JarUtils.java:jarAll(229)) - Adding dir3/file3:0 to test-out2.jar.
2007-08-08 08:00:01,281 INFO  util.JarUtils (JarUtils.java:jarAll(257)) - Adding file4.txt:0 to test-out2.jar.
------------- ---------------- ---------------

Testcase: testGetJarPath took 0.175 sec
        FAILED
expected:<.../dir1-2/...> but was:<...\dir1-2\...>
junit.framework.ComparisonFailure: expected:<.../dir1-2/...> but was:<...\dir1-2\...>
        at org.apache.hadoop.util.TestJarUtils.testGetJarPath(TestJarUtils.java:88)

Testcase: testJar took 0.022 sec
        FAILED
null
junit.framework.AssertionFailedError
        at org.apache.hadoop.util.TestJarUtils.testJar(TestJarUtils.java:111)

Testcase: testIsJarOrZip took 0.015 sec
Testcase: testJarAll took 0.024 sec
        FAILED
null
junit.framework.AssertionFailedError
        at org.apache.hadoop.util.TestJarUtils.testJarAll(TestJarUtils.java:164)

Testcase: testCopyJarContents took 0.018 sec
        FAILED
null
junit.framework.AssertionFailedError
        at org.apache.hadoop.util.TestJarUtils.testCopyJarContents(TestJarUtils.java:218)

If you are going to make a new patch, here's a couple of other things you could fix:

License missing on unit test.

Creating files in tests, other folks seem to do something like following to keep them under configured test directory: System.getProperty("test.build.data","."). You might want to do the same.

Next time, to save yourself a bit of typing, you could use the local file system in hdfs to do the recursive delete of a directory (I'm guessing thats why you individually remove each of the items in the teardown):

FileSystem fs = FileSystem.getLocal(new Configuration());
  fs.delete(new Path(f.toString()));

In isJarOrZipFile, you could wrap your read of 5 bytes in a try/finally so the close always happens. In fact there are a bunch of places that could do w/ try/finally blocks (Its not critical in the usual case. The job jar will just error out w/o leaving hanging open files).


Doug Cutting added a comment - 16/Aug/07 05:01 PM
This does not currently pass its own unit tests.

Dennis Kubes added a comment - 08/Oct/07 03:35 PM
Sorry, I have been caught up in other development for the past couple of months. I will fix any problem with the current code, make sure it passes current unit tests, and upload a revised patch in the next couple of days.

Dennis Kubes added a comment - 09/Oct/07 07:11 PM
Updated patch now passes all unit tests and is updated for curent 0.15 source. Build.xml has also been changed to allow tests for jarutils. Licenses have been added where applicable.

Doug Cutting added a comment - 10/Oct/07 09:18 PM
This patch makes it so there are two ways to specify resources, either with job.setJar(), or job.addResource(). This seems overly complicated. I wonder if we shouldn't rather deprecate getJar/setJar. Instead of removing them later, we might simply make these package-private, since they'll still be the means of passing the compound jar name to tasktrackers, but users should only have to know about addResource(), no?

Dennis Kubes added a comment - 10/Oct/07 09:52 PM
I am good with deprecating them, just didn't want to break anything with the current patch. Let me know if you want me to build it into the patch, small change.

Dennis Kubes added a comment - 18/Oct/07 05:02 PM
Any updates on this. Do we need to do anything to get this into Trunk? It has been running successfully for us in production (35 node hadoop cluster) for over a week now.

Doug Cutting added a comment - 18/Oct/07 08:40 PM
Here's a new version with setJar/getJar deprecated.

Hadoop QA added a comment - 18/Oct/07 11:22 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12367974/HADOOP-1622-5.patch
against trunk revision r586003.

@author +1. The patch does not contain any @author tags.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new compiler warnings.

findbugs -1. The patch appears to introduce 3 new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/969/testReport/
Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/969/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/969/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/969/console

This message is automatically generated.


Doug Cutting added a comment - 19/Oct/07 06:20 PM
Findbugs issues need to be addressed.

Doug Cutting added a comment - 19/Oct/07 06:58 PM
Fixes for findbugs problems.

Hadoop QA added a comment - 20/Oct/07 03:53 AM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12368037/HADOOP-1622-6.patch
against trunk revision r586264.

@author +1. The patch does not contain any @author tags.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new compiler warnings.

findbugs +1. The patch does not introduce any new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/972/testReport/
Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/972/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/972/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/972/console

This message is automatically generated.


Devaraj Das added a comment - 25/Oct/07 08:33 AM
Sorry, this patch doesn't apply anymore. Dennis, could you please regenerate the patch with the current trunk.

Doug Cutting added a comment - 25/Oct/07 04:31 PM
I've updated this to trunk and will commit it later today barring objections.

Hadoop QA added a comment - 25/Oct/07 07:34 PM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12368389/HADOOP-1622-7.patch
against trunk revision r588300.

@author +1. The patch does not contain any @author tags.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new compiler warnings.

findbugs +1. The patch does not introduce any new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/982/testReport/
Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/982/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/982/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/982/console

This message is automatically generated.


Doug Cutting added a comment - 25/Oct/07 08:31 PM
I just committed this. Thanks, Dennis!

Hudson added a comment - 26/Oct/07 07:43 PM

Owen O'Malley added a comment - 26/Oct/07 08:41 PM
I had to revert the patch, because it broke HADOOP-2107

Dennis Kubes added a comment - 27/Oct/07 06:06 AM
This patch brings the code up to the current trunk. It also fixes a bug in createJobJar in which we needed to using the context classloader to search for classes. This patch passes all unit tests and successfully runs the RandomWriter example.

Arun C Murthy added a comment - 27/Oct/07 12:22 PM
Dennis, I'm sorry to come in late on this... a couple of comments:

1. Could you please remove the mention of 'final' and 'default' config resources from the javadoc for JobConf.{get|set}JobResources? They are no longer relevant vis-a-vis hadoop Configuration.
2. Should we also have a JobConf.setJobResource along with JobConf.addJobResource, ala {{DistributedCache} apis?
3. Should we move the private JobClient.createJobJar method to JarUtils to make it available as a useful utility?

Unrelated: Does it make sense to rename Configuration.addResource to Configuration.addConfigResource? I wonder how confusing these unrelated api names are, given JobConf is a Configuration too ...


Hudson added a comment - 27/Oct/07 04:03 PM

Dennis Kubes added a comment - 27/Oct/07 07:21 PM
1. Could you please remove the mention of 'final' and 'default' config resources from the javadoc for JobConf.{get|set}JobResources? They are no longer relevant vis-a-vis hadoop Configuration.

I have removed the mention of final and default resources.

2. Should we also have a JobConf.setJobResource along with JobConf.addJobResource, ala {{DistributedCache} apis?

I had debated about set vs add resources. The current behavior is when you add a resource you are appending it to a list of resources as opposed to setting a resource which would clear anything previously added and add only that resource. Since many times jar resources are added by including the jar file which contains a given class, I thought it better to NOT allow clearing and resetting of job resources.

3. Should we move the private JobClient.createJobJar method to JarUtils to make it available as a useful utility?

I debated about this too. JarUtils was generic jaring and unjaring utilities. But I don't see harm in putting createJobJar in and I think you are right we may need that somewhere else in the future. I have remvoed from JobClient and added to JarUtils.

Unrelated: Does it make sense to rename Configuration.addResource to Configuration.addConfigResource? I wonder how confusing these unrelated api names are, given JobConf is a Configuration to

Yeah, debated about this one too. In the end we weren't just adding jars but multiple things such as classes, exe, files. Couldn't find a better name for that then resource. I put it as jobResource to be a little less confusing. Changing Configuration over to configResource would be good I think, Although we should probably deprecate because a lot of things rely on that method.

I am currently testing patch 9, will have it posted shortly.


Dennis Kubes added a comment - 27/Oct/07 09:06 PM
Updated to most recent trunk and added requested changes.

Hudson added a comment - 29/Oct/07 12:37 PM

Doug Cutting added a comment - 29/Oct/07 10:25 PM
Owen & I talked a bit about this last week. We determined three commonly useful types of job resources:
  • archives already present in the cluster that will be unpacked in the task dir
  • archives already present in the cluster that will be intact in the task dir
  • resources in the local filesystem that will be added to the task's classpath
    This issue primarily concerns the last, but we should attempt to have a somewhat uniform mechanism. The primary differences between the first and the third are (a) that unqualified paths are resolved relative to different filesystems; and (b) resources may or may not be visible on the classpath.

All of these should be available from the command line, with -archive, -file and -jar respectively.

Owen, does that capture our discussion? What would need to change in the current patch to be consistent with that proposal? Should we file another issue to improve command-line support for these, or should this be done as a part of this issue?


Dennis Kubes added a comment - 01/Nov/07 07:03 PM
I was thinking about this last night. Right now the command line only takes a single jar even though multiple resources are supported through the patch. We could add checking an system property for multiple comma separated resources when the job is submitted. The command line could behind the scenes set this variable. We could have a -resources or a -jobResources a,b,c switch. The current patch handles finding resources from multiple locations including full paths and jars/classes in classpath. We could add in a relative path structure. If this seems reasonable I will work up a patch for it ASAP.

Milind Bhandarkar added a comment - 27/Dec/07 12:19 AM
Dennis, Did you get a chance to work on this after your last comment ? We would love to have this available to our users in 0.16.

Dennis Kubes added a comment - 27/Dec/07 04:10 PM - edited
I have only gotten a chance to design not to develop this as I have been launching the Search Wikia site. Here is what I have come up with in terms of a more generalized design after talking with both Doug and Owen about this enhancement:

1.A runjob utility. runjar is not affected as it is made to only run a single jar.

2.The options parser will be extended to to support resources, upload, classpath, noclasspath, compress, decompress, and cache.

  • Items that at cached are added to the distributed cache.
  • Items uploaded are by default not added to the classpath
  • Items cached are by default added to the classpath
  • Resources are by default added to the classpath
  • Compress will choose resources to compress before adding to job.jar file
  • Decompress will choose resources to be decompress before adding to job.jar file.
  • Compress and decompress will only take action on resources being added to job. This will include non-local resources and will need to be handled in slave local job resources.
  • Classpath is ignored for any resource that is being uploaded as it will already be added to the classpath due to it being in resources.
  • All options support multiple elements in comma separated format.
  • No classpath will removed cached and non-cached resources from the classpath. For example a jar can be added to resources, included in the local job.jar resources but not included in its local classpath. (I don't know if this functionality is useful?)

3.Resources

  • Resources are one or more items that are jarred up into the single job.jar file
  • Resources can be files (compressed or uncompressed) or directories
  • Resources can be from any file system.
  • Resources paths support relative and absolute paths
  • Resources support URL type paths to support multiple file systems
  • If the path in not in a URL format then it is assumed to be on the local file system as either an absolute or relative path.
  • Only resources that exist will be included. This is true for any file system. The resource must exist at the beginning of the job to be uploaded. If the resources exists at the beginning of the job but not when the local job starts its processing an error will be thrown and that task will cease operation.
  • A global configuration variable exists to choose to decompress any compressed file that is added as a resource.
  • Non-local resources will be pulled down into the local job resources from the resources given file system. This can include DFS and S3 resources added dynamically.
  • Local resources that are added to the job.jar will be resources from the resources configuration variable passed to the local jobs. Remaing resources will be the non-local resources that need to be added to local job resources.

4.Uploads

  • Uploads by default are put into the users home directory on the jobtracker file system.
  • Upload directories can be set either through a configuration variable for a global default upload folder or through a colon path structure in the upload. Something like path:uploadto.
  • Upload resources can be added to the classpath by the classpath option
  • If upload resources are added to the classpath, they will be pulled down into the resources for each job and added to the local job classpath.
  • Uploads are independent of resources. An upload doesn't have to be a resource. A resource can be an uploaded element. In this case it would be uploaded (not included in local job.jar) and then pulled down from the job tracker file system as a resource.
  • Uploads will check modified date/time and size before uploading elements. If the upload is a directory, the upload will recursively check all files in that directory before upload and only upload modified files. This should give an rsync type functionality to uploading resources and should decrease bandwidth consumption.
  • Upload will support URL type paths as well. This will allow transferring resources from one type of file system (i.e. S3) to the job trackers file system. Again resources without a URL type structure will be considered local file system and will support relative and absolute paths. Only absolute paths will be supported on non-local file systems.

Mahadev konar added a comment - 07/Mar/08 09:51 PM
dennis, any updates on this bug?

Dennis Kubes added a comment - 09/Mar/08 01:06 AM
No updates yet, but I should have time to start working on this again in the next couple of days, right after I finish some working on converting hadoop RPC to NIO.

Mahadev konar added a comment - 09/Mar/08 01:14 AM
great.... we also need this feature to get into 0.17. let me know if you need any hep getting this into 0.17...

Owen O'Malley added a comment - 12/Mar/08 05:00 AM
Dennis,
Upon looking at this, I'm getting worried. This looks like a lot of special cases. What we really need is to support 3 kinds of files:
  • simple files
  • archives
  • jar files

for each of these things, we would like them to be able to come from a URI and most convenient would be a default of a local file. So, I propose something like:

-file foo,bar,hdfs:baz

will upload foo and bar to an upload area and download foo, bar, and baz to the slave nodes as the tasks are run on them.

-archive foo.zip,hdfs:baz.zip

will download foo.zip and baz.zip and expand them.

Finally, the -jar option would download them and put them on the class path. So,

-jar myjar.jar,hadoop-0.16.1-streaming.jar

would upload the files in the job client, download them to the slaves, and add them to the class path in the given order.

I think I'd leave the rsync functionality out and just use hdfs:_upload/$jobid/... as transient storage and delete it when the job is done. If the user wants to save the bandwidth they can upload the files to hdfs themselves, in which case they don't need to be uploaded.
Thoughts?


Mahadev konar added a comment - 13/Mar/08 06:20 PM
marking this for 0.17 release.

Mahadev konar added a comment - 13/Mar/08 06:23 PM
am starting working on this ... dennis if you are already working on this please let me know..

Dennis Kubes added a comment - 13/Mar/08 07:15 PM
I have not resumed working on this as of yet. Am currently neck deep in reworking NIO for hadoop RPC. I was planning on finishing on this as soon as I had completed the NIO code in the next 2-3 days. I would like to continue working on this if possible. When is 0.17 scheduled for release?

Owen, the first pass at this didn't distinguish between jar or regular files on the command line. Instead there was detection code that identified files as such. Also the first pass supported directories as well as files (don't know if you are including that in file). I think the ability to include directories for job input is extremely important. What were the special cases that you were seeing?

The idea behind this code is much like streaming you could upload and cache files from any type of resource (file, directory, jar, etc.) from any file system. So, for instance people could store common jars or file resources on S3 and pull them down into a job.


Mahadev konar added a comment - 13/Mar/08 08:51 PM
alos owen, what would the command line look like with your suggestions?

hadoop jar -file <files> -jar <jars> -archive <archives> ?

Also, if that is the case then we could make it generic for streaming which uses its own options for -file , -archives and others .... though we do not need to do that in this patch...


Mahadev konar added a comment - 19/Mar/08 04:45 AM
i like owens idea. its simple and gives the users the flexibility they need.

here is how I am implementing this –

the hadoop command line will have the following options

hadoop jar -file <comma seperated files> -jar <comma seperated jars> -archive <comma seperated archives>

all of these can be comma seperated uri's – defaulting to local file system if not specified.

jobclient uploads the files / jars / archives onto HDFS ..... or the filesystem used by mapreduce. ... under the job directory

given that these files/jars/archives might have the same name and different uris....
example : hadoop jar -file file:///file1,hdfs://somehost:port/file1
we would store these files as
jobdir/file/file/file1
jobdir/hdfs_somehost_port/file1

To keep these files in different directories with the directory name as the uri would give us the ability to just use DistributedCache as it is.

so we could say DistributedCache.addFiles(jobdir/file/file/file1, jobdir/hdfs_somehost_port/file1);
something like this ...

so the job directory would like

jobdir/jars/urischeme/<jarfiles>
jobdir/archives/urischeme/<archivefiles>
jobdir/file/urischeme/<files>

the one in jars will be added to the classpath of all the tasks in order they were mentioned.
the others will be copied once per job and symlinked from the current working directory of the task..

comments?


Runping Qi added a comment - 19/Mar/08 01:33 PM
Sounds good.

A couple comments:

It seems weird to have jar and -jar as arguments/option
in the command line "hadoop jar -file <comma seperated files> -jar <comma seperated jars>"
Will it be better to use "-classpath" instead?

When the job dir changes to

jobdir/jars/urischeme/<jarfiles>
jobdir/archives/urischeme/<archivefiles>
jobdir/file/urischeme/<files>

will that break the current applications that assume their files loaded using -file and -archive options in the jobdir?


Mahadev konar added a comment - 19/Mar/08 06:10 PM
ill leave the jar option to keep it backwards compatible. I dont want to break backwards compatibilty for users.
  • as for the job directory changes this is the directory structure in HDFS... the local job directory structure would not change.

Mahadev konar added a comment - 19/Mar/08 06:11 PM
how about

hadoop jar -file <..> -libjar <comma sep jars> -archives<comma seperated>


Mahadev konar added a comment - 20/Mar/08 11:31 PM
attaching a patch for this feature. It does not have unit tests included. I am still writing unit tests and will upload a patch by the end of the day.

this patch enhances the hadoop command line for job submission:

so you can say:

  • bin/hadoop jar -files <commaseperated files> -libjars <comma seperated libs> -archives <comma seperated archives>
  • these options are all optional and the command line is backwards compatible
  • the patch uses cli for command line parsing
  • it uses DistributedCache for copying files locally to the tasks
  • it supports uri's in the command line arguments
  • if the files are already uploaded do the hdfs used by jobtracker then it does not recopy the files – there is a tiny catch here ... since the uri's are matched as string for the remote file system and the one jt uses, it might be possible that the files are copied even though its the same dfs (ex: hdfs://hostname1:port != hdfs://hostname1.fullyqualifiedname:port)
  • the command line files, archives, libajrs are stored temporarurly in the hdfs job directory from where they are copied locally.

Mahadev konar added a comment - 22/Mar/08 02:21 AM
attaching a patch with the unit test. passes the tests on my machine.

Devaraj Das added a comment - 23/Mar/08 12:50 PM
Do you think the dns resolution is going to be a big hit. I don't think so with dns caching in place, etc.
Can Pipes make use of this feature (this patch doesn't support pipes). I am ok with having a separate issue to address pipes if required.
Otherwise, the patch looks fine.

Hadoop QA added a comment - 23/Mar/08 01:52 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12378428/HADOOP-1622_2.patch
against trunk revision 619744.

@author +1. The patch does not contain any @author tags.

tests included +1. The patch appears to include 16 new or modified tests.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new javac compiler warnings.

release audit +1. The applied patch does not generate any new release audit warnings.

findbugs -1. The patch appears to introduce 3 new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2030/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2030/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2030/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2030/console

This message is automatically generated.


Mahadev konar added a comment - 24/Mar/08 09:30 PM
  • I think it might not be a big overhead... I just wanted to avoid it since it would be a common utility and should be filed as a seperate jira ... (since finding out if two filesystems are the same seems like a nice thing to have). I wnated to keep this patch simple ..
  • I dont think pipes can make use of it ... Ill create another jira for that as well.

Mahadev konar added a comment - 24/Mar/08 09:30 PM
fixed findbugs warnings.

Hadoop QA added a comment - 24/Mar/08 11:00 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12378510/HADOOP-1622_3.patch
against trunk revision 619744.

@author +1. The patch does not contain any @author tags.

tests included +1. The patch appears to include 16 new or modified tests.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new javac compiler warnings.

release audit +1. The applied patch does not generate any new release audit warnings.

findbugs -1. The patch appears to introduce 1 new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2040/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2040/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2040/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2040/console

This message is automatically generated.


Mahadev konar added a comment - 25/Mar/08 12:00 AM
got rid of the findbugs warning.

Hadoop QA added a comment - 25/Mar/08 02:34 AM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12378527/HADOOP-1622_4.patch
against trunk revision 619744.

@author +1. The patch does not contain any @author tags.

tests included +1. The patch appears to include 16 new or modified tests.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new javac compiler warnings.

release audit +1. The applied patch does not generate any new release audit warnings.

findbugs +1. The patch does not introduce any new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2042/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2042/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2042/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2042/console

This message is automatically generated.


Mahadev konar added a comment - 25/Mar/08 04:55 PM
this is the patch implementing devaraj's coment about host resolution. I will add another jira for this feature to be used by pipes.

Hadoop QA added a comment - 25/Mar/08 10:46 PM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12378580/HADOOP-1622_5.patch
against trunk revision 619744.

@author +1. The patch does not contain any @author tags.

tests included +1. The patch appears to include 16 new or modified tests.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new javac compiler warnings.

release audit +1. The applied patch does not generate any new release audit warnings.

findbugs +1. The patch does not introduce any new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2053/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2053/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2053/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2053/console

This message is automatically generated.


Mahadev konar added a comment - 26/Mar/08 05:53 PM
looks like the previous patch got stale with some commits yesterday. attaching a new patch.

dhruba borthakur added a comment - 26/Mar/08 09:08 PM
I just committed this. Thanks Mahadev!

Hadoop QA added a comment - 26/Mar/08 09:49 PM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12378649/HADOOP-1622_6.patch
against trunk revision 619744.

@author +1. The patch does not contain any @author tags.

tests included +1. The patch appears to include 16 new or modified tests.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new javac compiler warnings.

release audit +1. The applied patch does not generate any new release audit warnings.

findbugs +1. The patch does not introduce any new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2067/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2067/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2067/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2067/console

This message is automatically generated.


Hudson added a comment - 27/Mar/08 12:18 PM