Hi Jason Lowe, Mithun Radhakrishnan
We have also been hit by this recently. I spent some time investigating this.
Distcp has two modes of execution. 1 from the cmdline and the other is programmatically.
The patch will work correctly for programmatical usage if settings from mapred-site.xml have already been applied to the input Configuration parameter as the properties set by distcp-default.xml will not be overridden again since mapred-site (and also mapred-default/yarn-default/yarn-site) is loaded as a default resource before job submission.
For command line usage Distcp adds distcp-default.xml as a resource (and not as a default resource) which would take higher precedence than default/site files mentioned before as they are loaded as default resources . Even if Distcp adds distcp-default.xml as a default resource, the code will be brittle and prone to which default resources are loaded first since mapred-site/mapred-default/yarn-site/yarn-default are all loaded in static blocks in classes org.apache.hadoop.mapreduce.
Since distcp is just like any other MR job I think the best way would be to get rid of un-needed conf from distcp-default.xml.
Below are the properties mentioned in distcp-default.xml
Seems like getting rid of
is all we need as the rest are required by distcp.
Any other configuration the user wants to specify in distcp can very well be specified as jvm opts for cmd line usage and as simple parameters to Configuration option for programmatical usage.
Please update with your thoughts/concerns.