Request for Comments
Separating Server and Client Configuration
Current mechanisms for configuring Hadoop daemons and specifying job-specific
details using a single Configuration object is confusing and error-prone. The
overall goal of this proposal is to make it more intuitive, and thus, less
prone to errors.
1. Separate configuration variables according to the contexts in which they are
used. There are two contexts in which the configuration variables are used
currently. Those that are used in the server context by Hadoop daemons
(Namenode, Datanodes, Jobtracker and TaskTrackers) and those that are used in
the client context by running jobs (either jobs that use the MapReduce
framework or standalone jobs that are DFSClients).
2. Allow job-specific configuration as a way to pass job-wide parameters from
JobClient to individual tasks that belong to the job. This also includes
frameworks built on top of the MapReduce framework, such as Hadoop Streaming.
3. Provide documentation for all parameters used in both server and client
contexts in default configuration resources.
4. Examining the need for each configuration parameter used in Hadoop code,
and eliminating unnecessary parameters for which we see no need of overriding
the default values.
5. Provide mechanisms to detect configuration errors as early as possible.
Configuration Parameters Used In Hadoop
Configuration parameters used in Hadoop codebase are either used in
server-context (dfs.name.dir, mapred.local.dir), in Client context
(dfs.replication, mapred.map.tasks), or both (fs.default.name). All the
configuration parameters should have default values specified in the default
configuration files. In addition, we need to enforce that the server-context
parameters cannot be overridden from the client-context, and vice-versa.
Client configurations have effect during the lifetime of the client and the
artifacts that it created. For example, the replication factor configured in
the HDFS client would remain the default for only that client and for the files
that the client created during its lifetime. Similarly, configuration of the
JobClient would remain effective for the jobs that the JobClient created during
Apart from the configuration parameters used in Hadoop, individual jobs or
frameworks built on top of Hadoop may use their own configuration parameters as
means of communication from job-client to the job. We need to make sure that
these parameters do not conflict with parameters used in Hadoop.
For common parameters, such as dfs.replication, which are used in the
server-context and can be overridden in the client-context per file, we need to
make sure that such parameters are bounded by upper and lower bounds specified
in the server configuration.
In order to implement the requirements outlined above, we propose the following
class hierarchy, and the default and final resources that they load.
Configuration (common-defaults.xml, common-final.xml)
+---ServerConfiguration (common-defauls.xml, server-defaults.xml, server-final.xml, common-final.xml)
+---ClientConfiguration (common-defauls.xml, client-defaults.xml, common-final.xml)
+---AppConfiguration (common-defauls.xml, client-defaults.xml, common-final.xml)
New configuration parameters and default-overrides are specified between the
default resources and final resources. If a parameter exists in final resource
already, then it cannot be overridden. Thus, server-final and common-final
corresponds to current hadoop-site.xml.
common-defaults.xml should contains parameters that are used in both server and
client contexts, such as ipc., io., fs., user. parameters. common-final.xml
overrides selected parameters in common-defaults.xml.
Generated job.xml file would contain parameters not specified in *-defaults.xml
In order to ensure that all configuration parameters used in the Hadoop
codebase are documented in the configuration files, the default value specified
in the Configuration.get* methods should be eliminated. This ensures that ALL
configuration parameters have exactly one default value in the configuration
files. If a given parameter is somehow not defined in any of the configuration
resources, these methods would throw an exception called ConfigurationException.
Direct use of Configuration.get* and Configuration.set* methods be allowed only
from classes that derive from Configuration. That is, these methods should be
protected. One should use static methods such as
JobConf.setNumMapTasks(ClientConfiguration conf, int num);
HdfsClient.setReplication(ClientConfiguration, int num);
in order to access or modify Configuration. This allows us to change parameter
names used in Hadoop without changing the application codes.
The AppConfiguration class is the only configuration class that allows usage of
get* and set* methods directly. However, the ClientConfiguration class is the
only way to communicate from JobClient to the Application. We would provide a
to merge the application (or framework configuration) with JobConf. This
allows us to check that the application or framework configuration does not try
to reuse the same configuration parameters for different purposes.