Issue Details (XML | Word | Printable)

Key: HADOOP-3722
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Enis Soztutar
Reporter: Matei Zaharia
Votes: 0
Watchers: 11
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Provide a unified way to pass jobconf options from bin/hadoop

Created: 09/Jul/08 01:03 AM   Updated: 20/Nov/08 11:38 PM
Return to search
Component/s: conf
Affects Version/s: 0.19.0
Fix Version/s: 0.19.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works HADOOP-3722.patch 2008-07-10 12:01 AM Matei Zaharia 2 kB
Text File Licensed for inclusion in ASF works jobconfoptions_v1.patch 2008-08-29 03:20 PM Enis Soztutar 47 kB
Text File Licensed for inclusion in ASF works jobconfoptions_v2.patch 2008-09-03 03:39 PM Enis Soztutar 48 kB
Issue Links:
Blocker
 
Reference
 

Hadoop Flags: Reviewed, Incompatible change
Release Note:
Changed streaming StreamJob and Submitter to implement Tool and Configurable, and to use GenericOptionsParser arguments -fs, -jt, -conf, -D, -libjars, -files, and -archives. Deprecated -jobconf, -cacheArchive, -dfs, -cacheArchive, -additionalconfspec, from streaming and pipes in favor of the generic options. Removed from streaming -config, -mapred.job.tracker, and -cluster.
Resolution Date: 18/Sep/08 03:04 AM


 Description  « Hide
Often when running a job it is useful to override some jobconf parameters from jobconf.xml for that particular job - for example, setting the job priority, setting the number of reduce tasks, setting the HDFS replication level, etc. Currently the Hadoop examples, streaming, pipes, etc take these extra jobconf parameters in different was: the examples in hadoop-examples.jar use -Dkey=value, streaming uses -jobconf key=value, and pipes uses -jobconf key1=value1,key2=value2,etc. Things would be simpler if bin/hadoop could take the jobconf parameters itself, so that you could run for example bin/hadoop -Dkey=value jar [whatever] as well as bin/hadoop -Dkey=value pipes [whatever]. This is especially useful when an organization needs to require users to use a particular property, e.g. the name of a queue to use for scheduling in HADOOP-3445. Otherwise, users may confuse one way of passing parameters with another and may not notice that they forgot to include certain properties.

I propose adding support in bin/hadoop for jobconf options to be specified with -C key=value. This would have the effect of setting hadoop.jobconf.key=value in Java's system properties. The Configuration class would then be modified to read any system properties that begin with hadoop.jobconf and override the values in hadoop-site.xml.

I can write a patch for this pretty quickly if the design is sound. If there's a better way of specifying jobconf parameters uniformly across Hadoop commands, let me know.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Matei Zaharia added a comment - 10/Jul/08 12:01 AM
Here's a patch that lets you use bin/hadoop -C property=value [command].

Chris Douglas added a comment - 10/Jul/08 02:11 AM
Good idea. Since the -D key=value syntax is managed by the Tool/ToolRunner, er, toolchain (see HADOOP-1425 and HADOOP-1436), it might make more sense to make streaming, pipes, etc. use that instead of pushing this functionality into the bash script and Java properties. Similarly, replacing the bash script with a Java launcher (per work in/related to HADOOP-3281, HADOOP-435) and using the aforementioned classes would also solve this issue, no?

Matei Zaharia added a comment - 10/Jul/08 11:23 PM
Regarding making streaming, pipes, etc use ToolRunner - I think that could be more complicated than it seems because you'd need to change the existing argument parsing in those libraries. People who have modified their streaming or pipes implementations would also have trouble (for example, we have a modified streaming at Facebook). Any new tool implementers can choose to use ToolRunner if they want, but this method lets you just write a simple Java class that calls submitJob and still be able to send parameters from bin/hadoop.

Arun C Murthy added a comment - 11/Jul/08 07:32 AM
I'm with Chris on this one, I don't think we need yet another way to pass config options along with -Dkey=value and -jobconf. Rather we need to standardize. So, it does make sense to pick one (-D or -jobconf) and stick with it. Yes, it means we will need to fix streaming/pipes or ToolRunner - we should.

Enis Soztutar added a comment - 11/Jul/08 12:30 PM
Ideally we should :
  1. change StreamJob(of streaming) and Submitter(of pipes) to implement Tool and Configurable.
  2. keep the configuration modifying codes in StreamJob and Submitter, but change them to display a deprecation warning about their use, in favor of -D name=value pairs.
  3. remove compatible -jt, -fs configurations from the StreamJob/Submitter, deprecate incompatible ones(for example -dfs)
  4. add a call to GenericOptionsParser#printGenericCommandUsage() in the StreamJob and Submitter's printUsage() methods.
  5. remove the -jobconf parameters at a later stage.

Chris Douglas added a comment - 11/Jul/08 09:40 PM
+1 for Enis's solution.

That said, a solution like the one in Matei's patch might be a tolerable, short-term bridge between 0.17 and 0.18 for user code affected by HADOOP-3417 (discussion in HADOOP-3743).


Hadoop QA added a comment - 12/Jul/08 12:53 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12385687/HADOOP-3722.patch
against trunk revision 676069.

+1 @author. The patch does not contain any @author tags.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2837/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2837/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2837/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2837/console

This message is automatically generated.


Enis Soztutar added a comment - 29/Aug/08 03:20 PM
This patch
  1. changes StreamJob(of streaming) and Submitter(of pipes) to implement Tool and Configurable.
  2. deprecates -jobconf, -cacheArchive, -dfs,
  3. removes some never used parameters from streaming
  4. add a call to GenericOptionsParser#printGenericCommandUsage() in the StreamJob and Submitter's printUsage() methods.
    #updates pipes and streaming docs

I will really appreciate if someone with real streaming / pipes usage can test this out.


Hadoop QA added a comment - 01/Sep/08 05:31 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12389176/jobconfoptions_v1.patch
against trunk revision 690641.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 6 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

-1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3149/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3149/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3149/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3149/console

This message is automatically generated.


Enis Soztutar added a comment - 03/Sep/08 03:39 PM
Fixed findbugs warning.

Hadoop QA added a comment - 05/Sep/08 04:28 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12389435/jobconfoptions_v2.patch
against trunk revision 692409.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 6 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

-1 core tests. The patch failed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3189/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3189/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3189/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3189/console

This message is automatically generated.


Enis Soztutar added a comment - 08/Sep/08 01:43 PM
Failing test is not related to this patch.

Arun C Murthy added a comment - 18/Sep/08 02:19 AM
+1, this is looking great!

I'll get some 'expert' Streaming users to take a brief look and then go ahead and commit this.


Arun C Murthy added a comment - 18/Sep/08 03:03 AM
OTOH, I've changed my mind - I believe it's fine to commit this as-is and deal with the consequences later since this is an important cleanup.

Arun C Murthy added a comment - 18/Sep/08 03:04 AM
I just committed this. Thanks, Enis!

dhruba borthakur added a comment - 18/Sep/08 12:23 PM
This appears to be an incompatible change. I am wondering if the older job-parameters-submitting -methods were deprecated (but still works with 0.19) or have they been removed completely?

Enis Soztutar added a comment - 18/Sep/08 12:56 PM
The patch only deprecates parameters, issuing a warning, and introduces new ones. However in streaming, there were some parameters, like -cluster, which were not working so I just removed them.

Arun C Murthy added a comment - 18/Sep/08 04:37 PM
Enis, could you please add a detailed 'Release Note' for this jira? Thanks!

Enis Soztutar added a comment - 19/Sep/08 10:59 AM
Added a release note.

Hudson added a comment - 22/Sep/08 03:18 PM

Robert Chansler added a comment - 21/Oct/08 10:54 PM
This issue
1. changed StreamJob(of streaming) and Submitter(of pipes) to implement Tool and Configurable. Streaming and submitter now accepts GenericOptionsParser arguments :
-fs, -jt, -conf, -D, -libjars, -files, -archives

2. Deprecated -jobconf, -cacheArchive, -dfs, -cacheArchive, -additionalconfspec, from streaming and pipes(where applicable) in favor of the generic options. The options still work issuing a warning as a side effect, however they may be later removed in the following releases.

3. removed from streaming :
-config : since it is not documented anywhere
-mapred.job.tracker : it sets the wrong property, so it not used currently.
-cluster : because setting -cluster gives "Unexpected -cluster while processing" error, so it is not used currently.