Hadoop Common
  1. Hadoop Common
  2. HADOOP-785

Divide the server and client configurations

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.9.0
    • Fix Version/s: 0.15.0
    • Component/s: conf
    • Labels:
      None

      Description

      The configuration system is easy to misconfigure and I think we need to strongly divide the server from client configs.

      An example of the problem was a configuration where the task tracker has a hadoop-site.xml that set mapred.reduce.tasks to 1. Therefore, the job tracker had the right number of reduces, but the map task thought there was a single reduce. This lead to a hard to find diagnose failure.

      Therefore, I propose separating out the configuration types as:

      class Configuration;
      // reads site-default.xml, hadoop-default.xml

      class ServerConf extends Configuration;
      // reads hadoop-server.xml, $super

      class DfsServerConf extends ServerConf;
      // reads dfs-server.xml, $super

      class MapRedServerConf extends ServerConf;
      // reads mapred-server.xml, $super

      class ClientConf extends Configuration;
      // reads hadoop-client.xml, $super

      class JobConf extends ClientConf;
      // reads job.xml, $super

      Note in particular, that nothing corresponds to hadoop-site.xml, which overrides both client and server configs. Furthermore, the properties from the *-default.xml files should never be saved into the job.xml.

      1. HADOOP-785_1_20070903.patch
        9 kB
        Arun C Murthy
      2. HADOOP-785_2_20070906.patch
        17 kB
        Arun C Murthy
      3. HADOOP-785_3_20070908.patch
        16 kB
        Arun C Murthy
      4. HADOOP-785_4.patch
        19 kB
        Doug Cutting

        Issue Links

          Activity

          Hide
          Doug Cutting added a comment -

          I think this is the right direction. We logically have a tree. Each node corresponds to a config file that inherits and overrides its parent's files.

          The need is that users be able to easily (1) remember the tree, (2) know where to specify a property within the tree.

          I propose that the tree is organized around where in the cluster things are used, not what part of the code they configure (that's determined by the parameter name). This addresses the primary source of confusion, and thus is what we must clarify. In particular we should distinguish between things used only by servers, and things that clients may specify.

          I propose the following tree:

          default --read-only defaults for things that clients can override
          site – site-specific defaults
          server-default – read-only defaults for server-only configuration
          server – server overrides for this site
          client – user overrides

          The read-only default files serve as documentation of what parameters can be added to files lower in the tree. It is a configuration error to specify something that does not have a default value above it.

          Some examples of what might be in the three non read-only files:

          site - - site-specific defaults
          dfs.namenode.host&port
          dfs.block.size
          dfs.replication
          mapred.jobtracker.host&port
          mapred.map.tasks
          mapred.reduce.tasks

          server – server-specifics
          dfs.name.dir
          dfs.data.dir
          mapred.local.dir

          client – user can override defaults and site here, but not server
          dfs.replication – user overrides site
          mapred.map.tasks – user overrides site

          Following from this, we'd have the following instantiable classes:

          ServerConfiguration
          reads default, site, server-default, server, in that order.
          used by daemons

          ClientConfiguration
          reads default, site, client, in that order.
          used by client applications

          Rather than provide subclasses for different parts of the system, we should instead use static methods. For example, we might have:

          JobConf.setNumMapTasks(ClientConfiguration conf, int count);
          HdfsConf.setReplication(ClientConfiguration conf, int replicas);

          The point of these is compile-time checking of names and values while keeping the code well partitioned. When we add a new HDFS parameter we should not have to change code outside of HDFS, yet, without multiple-inheritance, we cannot have a single object that permits configuration of HDFS, MapReduce, etc.

          Thoughts?

          Show
          Doug Cutting added a comment - I think this is the right direction. We logically have a tree. Each node corresponds to a config file that inherits and overrides its parent's files. The need is that users be able to easily (1) remember the tree, (2) know where to specify a property within the tree. I propose that the tree is organized around where in the cluster things are used, not what part of the code they configure (that's determined by the parameter name). This addresses the primary source of confusion, and thus is what we must clarify. In particular we should distinguish between things used only by servers, and things that clients may specify. I propose the following tree: default --read-only defaults for things that clients can override site – site-specific defaults server-default – read-only defaults for server-only configuration server – server overrides for this site client – user overrides The read-only default files serve as documentation of what parameters can be added to files lower in the tree. It is a configuration error to specify something that does not have a default value above it. Some examples of what might be in the three non read-only files: site - - site-specific defaults dfs.namenode.host&port dfs.block.size dfs.replication mapred.jobtracker.host&port mapred.map.tasks mapred.reduce.tasks server – server-specifics dfs.name.dir dfs.data.dir mapred.local.dir client – user can override defaults and site here, but not server dfs.replication – user overrides site mapred.map.tasks – user overrides site Following from this, we'd have the following instantiable classes: ServerConfiguration reads default, site, server-default, server, in that order. used by daemons ClientConfiguration reads default, site, client, in that order. used by client applications Rather than provide subclasses for different parts of the system, we should instead use static methods. For example, we might have: JobConf.setNumMapTasks(ClientConfiguration conf, int count); HdfsConf.setReplication(ClientConfiguration conf, int replicas); The point of these is compile-time checking of names and values while keeping the code well partitioned. When we add a new HDFS parameter we should not have to change code outside of HDFS, yet, without multiple-inheritance, we cannot have a single object that permits configuration of HDFS, MapReduce, etc. Thoughts?
          Hide
          Arun C Murthy added a comment -

          +1

          Another view-point: One of the things I frequently wish I had goes something like this - for a particular job/task I'd like to tweak the log-level to 'debug' while developing/testing...

          Generalising: while we are at it, should we think about about separating the 'ClientConfiguration' into static and dynamic parts? The 'dynamic' aspect would cover cases like: switch log-level, turn speculative execution on/off etc.

          I concede it might be more useful for developers rather than users in the short/medium term...

          Thoughts? Does it sound like a good direction?

          Show
          Arun C Murthy added a comment - +1 Another view-point: One of the things I frequently wish I had goes something like this - for a particular job/task I'd like to tweak the log-level to 'debug' while developing/testing... Generalising: while we are at it, should we think about about separating the 'ClientConfiguration' into static and dynamic parts? The 'dynamic' aspect would cover cases like: switch log-level, turn speculative execution on/off etc. I concede it might be more useful for developers rather than users in the short/medium term... Thoughts? Does it sound like a good direction?
          Hide
          Doug Cutting added a comment -

          > should we think about about separating the 'ClientConfiguration' into static and dynamic parts?

          To some degree we already have that: the static part is in the config file and the dynamic part is the object. We support specifying properties on the command line, so that one can run things like:

          bin/hadoop jar -Dlog.level=DEBUG ...

          I think this works with most commands since HADOOP-59 was committed.

          Show
          Doug Cutting added a comment - > should we think about about separating the 'ClientConfiguration' into static and dynamic parts? To some degree we already have that: the static part is in the config file and the dynamic part is the object. We support specifying properties on the command line, so that one can run things like: bin/hadoop jar -Dlog.level=DEBUG ... I think this works with most commands since HADOOP-59 was committed.
          Hide
          Milind Bhandarkar added a comment -

          Request for Comments
          --------------------------------

          Separating Server and Client Configuration
          ------------------------------------------

          Current mechanisms for configuring Hadoop daemons and specifying job-specific
          details using a single Configuration object is confusing and error-prone. The
          overall goal of this proposal is to make it more intuitive, and thus, less
          prone to errors.

          Detailed Goals:
          ---------------

          1. Separate configuration variables according to the contexts in which they are
          used. There are two contexts in which the configuration variables are used
          currently. Those that are used in the server context by Hadoop daemons
          (Namenode, Datanodes, Jobtracker and TaskTrackers) and those that are used in
          the client context by running jobs (either jobs that use the MapReduce
          framework or standalone jobs that are DFSClients).

          2. Allow job-specific configuration as a way to pass job-wide parameters from
          JobClient to individual tasks that belong to the job. This also includes
          frameworks built on top of the MapReduce framework, such as Hadoop Streaming.

          3. Provide documentation for all parameters used in both server and client
          contexts in default configuration resources.

          4. Examining the need for each configuration parameter used in Hadoop code,
          and eliminating unnecessary parameters for which we see no need of overriding
          the default values.

          5. Provide mechanisms to detect configuration errors as early as possible.

          Configuration Parameters Used In Hadoop
          ---------------------------------------

          Configuration parameters used in Hadoop codebase are either used in
          server-context (dfs.name.dir, mapred.local.dir), in Client context
          (dfs.replication, mapred.map.tasks), or both (fs.default.name). All the
          configuration parameters should have default values specified in the default
          configuration files. In addition, we need to enforce that the server-context
          parameters cannot be overridden from the client-context, and vice-versa.

          Client configurations have effect during the lifetime of the client and the
          artifacts that it created. For example, the replication factor configured in
          the HDFS client would remain the default for only that client and for the files
          that the client created during its lifetime. Similarly, configuration of the
          JobClient would remain effective for the jobs that the JobClient created during
          its lifetime.

          Apart from the configuration parameters used in Hadoop, individual jobs or
          frameworks built on top of Hadoop may use their own configuration parameters as
          means of communication from job-client to the job. We need to make sure that
          these parameters do not conflict with parameters used in Hadoop.

          For common parameters, such as dfs.replication, which are used in the
          server-context and can be overridden in the client-context per file, we need to
          make sure that such parameters are bounded by upper and lower bounds specified
          in the server configuration.

          Class Hierarchy
          ---------------

          In order to implement the requirements outlined above, we propose the following
          class hierarchy, and the default and final resources that they load.

          Configuration (common-defaults.xml, common-final.xml)

          +---ServerConfiguration (common-defauls.xml, server-defaults.xml, server-final.xml, common-final.xml)

          +---ClientConfiguration (common-defauls.xml, client-defaults.xml, common-final.xml)

          +---AppConfiguration (common-defauls.xml, client-defaults.xml, common-final.xml)

          New configuration parameters and default-overrides are specified between the
          default resources and final resources. If a parameter exists in final resource
          already, then it cannot be overridden. Thus, server-final and common-final
          corresponds to current hadoop-site.xml.

          common-defaults.xml should contains parameters that are used in both server and
          client contexts, such as ipc., io., fs., user. parameters. common-final.xml
          overrides selected parameters in common-defaults.xml.

          Generated job.xml file would contain parameters not specified in *-defaults.xml
          resources.

          Other Proposals
          ---------------

          In order to ensure that all configuration parameters used in the Hadoop
          codebase are documented in the configuration files, the default value specified
          in the Configuration.get* methods should be eliminated. This ensures that ALL
          configuration parameters have exactly one default value in the configuration
          files. If a given parameter is somehow not defined in any of the configuration
          resources, these methods would throw an exception called ConfigurationException.

          Direct use of Configuration.get* and Configuration.set* methods be allowed only
          from classes that derive from Configuration. That is, these methods should be
          protected. One should use static methods such as

          JobConf.setNumMapTasks(ClientConfiguration conf, int num);

          or

          HdfsClient.setReplication(ClientConfiguration, int num);

          in order to access or modify Configuration. This allows us to change parameter
          names used in Hadoop without changing the application codes.

          The AppConfiguration class is the only configuration class that allows usage of
          get* and set* methods directly. However, the ClientConfiguration class is the
          only way to communicate from JobClient to the Application. We would provide a
          static method:

          JobConf.setAppConfiguration(ClientConfiguration, AppConfiguration);

          to merge the application (or framework configuration) with JobConf. This
          allows us to check that the application or framework configuration does not try
          to reuse the same configuration parameters for different purposes.

          Show
          Milind Bhandarkar added a comment - Request for Comments -------------------------------- Separating Server and Client Configuration ------------------------------------------ Current mechanisms for configuring Hadoop daemons and specifying job-specific details using a single Configuration object is confusing and error-prone. The overall goal of this proposal is to make it more intuitive, and thus, less prone to errors. Detailed Goals: --------------- 1. Separate configuration variables according to the contexts in which they are used. There are two contexts in which the configuration variables are used currently. Those that are used in the server context by Hadoop daemons (Namenode, Datanodes, Jobtracker and TaskTrackers) and those that are used in the client context by running jobs (either jobs that use the MapReduce framework or standalone jobs that are DFSClients). 2. Allow job-specific configuration as a way to pass job-wide parameters from JobClient to individual tasks that belong to the job. This also includes frameworks built on top of the MapReduce framework, such as Hadoop Streaming. 3. Provide documentation for all parameters used in both server and client contexts in default configuration resources. 4. Examining the need for each configuration parameter used in Hadoop code, and eliminating unnecessary parameters for which we see no need of overriding the default values. 5. Provide mechanisms to detect configuration errors as early as possible. Configuration Parameters Used In Hadoop --------------------------------------- Configuration parameters used in Hadoop codebase are either used in server-context (dfs.name.dir, mapred.local.dir), in Client context (dfs.replication, mapred.map.tasks), or both (fs.default.name). All the configuration parameters should have default values specified in the default configuration files. In addition, we need to enforce that the server-context parameters cannot be overridden from the client-context, and vice-versa. Client configurations have effect during the lifetime of the client and the artifacts that it created. For example, the replication factor configured in the HDFS client would remain the default for only that client and for the files that the client created during its lifetime. Similarly, configuration of the JobClient would remain effective for the jobs that the JobClient created during its lifetime. Apart from the configuration parameters used in Hadoop, individual jobs or frameworks built on top of Hadoop may use their own configuration parameters as means of communication from job-client to the job. We need to make sure that these parameters do not conflict with parameters used in Hadoop. For common parameters, such as dfs.replication, which are used in the server-context and can be overridden in the client-context per file, we need to make sure that such parameters are bounded by upper and lower bounds specified in the server configuration. Class Hierarchy --------------- In order to implement the requirements outlined above, we propose the following class hierarchy, and the default and final resources that they load. Configuration (common-defaults.xml, common-final.xml) +---ServerConfiguration (common-defauls.xml, server-defaults.xml, server-final.xml, common-final.xml) +---ClientConfiguration (common-defauls.xml, client-defaults.xml, common-final.xml) +---AppConfiguration (common-defauls.xml, client-defaults.xml, common-final.xml) New configuration parameters and default-overrides are specified between the default resources and final resources. If a parameter exists in final resource already, then it cannot be overridden. Thus, server-final and common-final corresponds to current hadoop-site.xml. common-defaults.xml should contains parameters that are used in both server and client contexts, such as ipc. , io. , fs. , user. parameters. common-final.xml overrides selected parameters in common-defaults.xml. Generated job.xml file would contain parameters not specified in *-defaults.xml resources. Other Proposals --------------- In order to ensure that all configuration parameters used in the Hadoop codebase are documented in the configuration files, the default value specified in the Configuration.get* methods should be eliminated. This ensures that ALL configuration parameters have exactly one default value in the configuration files. If a given parameter is somehow not defined in any of the configuration resources, these methods would throw an exception called ConfigurationException. Direct use of Configuration.get* and Configuration.set* methods be allowed only from classes that derive from Configuration. That is, these methods should be protected. One should use static methods such as JobConf.setNumMapTasks(ClientConfiguration conf, int num); or HdfsClient.setReplication(ClientConfiguration, int num); in order to access or modify Configuration. This allows us to change parameter names used in Hadoop without changing the application codes. The AppConfiguration class is the only configuration class that allows usage of get* and set* methods directly. However, the ClientConfiguration class is the only way to communicate from JobClient to the Application. We would provide a static method: JobConf.setAppConfiguration(ClientConfiguration, AppConfiguration); to merge the application (or framework configuration) with JobConf. This allows us to check that the application or framework configuration does not try to reuse the same configuration parameters for different purposes.
          Hide
          Michael Bieniosek added a comment -

          Has anybody done any work on this issue? This seems like an important thing to fix.

          Show
          Michael Bieniosek added a comment - Has anybody done any work on this issue? This seems like an important thing to fix.
          Hide
          Arun C Murthy added a comment -

          I'd also like to see a set of hard rules for config parameter (whereever they make sense)which override all parameters so as to ensure a reasonably consistent configuration.

          For e.g. MAX_TASK_FAILURES in HADOOP-1304 - it doesn't make sense to allow a TIP to fail thousands of times, so even if the user mis-configures it would make sense to check it against the hard-rule and over-ride the user supplied config value.

          Show
          Arun C Murthy added a comment - I'd also like to see a set of hard rules for config parameter (whereever they make sense)which override all parameters so as to ensure a reasonably consistent configuration. For e.g. MAX_TASK_FAILURES in HADOOP-1304 - it doesn't make sense to allow a TIP to fail thousands of times, so even if the user mis-configures it would make sense to check it against the hard-rule and over-ride the user supplied config value.
          Hide
          Doug Cutting added a comment -

          Some comments on Milind's proposal:

          I'm unclear on the difference between a ClientConfiguration and an AppConfiguration. I'm also not certain that configurations w/o setters will be practical. MapReduce's daemons do need to distinguish between the server's own configuration and the JobConf, but those should already be completely distinct, no?

          I prefer not using -final in the config file names. I'd vote for using -default for defaults and leaving overrides unmarked (server.xml, client.xml). Either that or we should use more clearly opposite terms, like initial/final, default/override, etc, but unmarked would be my first choice.

          I don't see the need for both client-default and common-default, nor both client-final and common-final. Can you give examples of lists of things that would go in each file? A major goal of this redesign is that it should always be very clear which file one should specify a parameter in. Many of our servers are also clients (e.g., JobTracker uses FileSystem) but pure client code is never a Hadoop daemon. So it's possible to determine which parameters should only be read by daemon code, but it's harder to determine parameters which should never be read by daemon code. Hence it's possible to have server-only configurations, but I'm not sure it makes sense to have client-only configurations.

          Show
          Doug Cutting added a comment - Some comments on Milind's proposal: I'm unclear on the difference between a ClientConfiguration and an AppConfiguration. I'm also not certain that configurations w/o setters will be practical. MapReduce's daemons do need to distinguish between the server's own configuration and the JobConf, but those should already be completely distinct, no? I prefer not using -final in the config file names. I'd vote for using -default for defaults and leaving overrides unmarked (server.xml, client.xml). Either that or we should use more clearly opposite terms, like initial/final, default/override, etc, but unmarked would be my first choice. I don't see the need for both client-default and common-default, nor both client-final and common-final. Can you give examples of lists of things that would go in each file? A major goal of this redesign is that it should always be very clear which file one should specify a parameter in. Many of our servers are also clients (e.g., JobTracker uses FileSystem) but pure client code is never a Hadoop daemon. So it's possible to determine which parameters should only be read by daemon code, but it's harder to determine parameters which should never be read by daemon code. Hence it's possible to have server-only configurations, but I'm not sure it makes sense to have client-only configurations.
          Hide
          Michael Bieniosek added a comment -

          > I'd also like to see a set of hard rules for config parameter (whereever they make sense)which override all parameters so as to ensure a reasonably consistent configuration.

          > For e.g. MAX_TASK_FAILURES in HADOOP-1304 - it doesn't make sense to allow a TIP to fail thousands of times, so even if the user mis-configures it would make sense to check it against the hard-rule and over-ride the user supplied config value.

          I think it's also important not to try to overprotect users – I don't think people are capriciously or randomly changing hadoop configuration parameters. It is frustrating to try to change a parameter, only to have hadoop ignore me. In the case where the user submits a truly ridiculous configuration, perhaps it might be better to error out or print a very loud warning visible to the job submitter?

          Show
          Michael Bieniosek added a comment - > I'd also like to see a set of hard rules for config parameter (whereever they make sense)which override all parameters so as to ensure a reasonably consistent configuration. > For e.g. MAX_TASK_FAILURES in HADOOP-1304 - it doesn't make sense to allow a TIP to fail thousands of times, so even if the user mis-configures it would make sense to check it against the hard-rule and over-ride the user supplied config value. I think it's also important not to try to overprotect users – I don't think people are capriciously or randomly changing hadoop configuration parameters. It is frustrating to try to change a parameter, only to have hadoop ignore me. In the case where the user submits a truly ridiculous configuration, perhaps it might be better to error out or print a very loud warning visible to the job submitter?
          Hide
          Doug Cutting added a comment -

          > I think it's also important not to try to overprotect users [ ... ]

          I agree. Especially when doing so adds code that will need to be maintained. I think marking things that should not normally be used as "Expert" should be sufficient. The biggest problem with dangerous knobs in my experience is that they generate spurious bug reports from folks who don't understand what they're really for and abuse them. I think this is primarily a documentation problem. We should advise reasonable values, and folks who ignore that and provide ridiculous values deserve ridiculous behavior.

          Show
          Doug Cutting added a comment - > I think it's also important not to try to overprotect users [ ... ] I agree. Especially when doing so adds code that will need to be maintained. I think marking things that should not normally be used as "Expert" should be sufficient. The biggest problem with dangerous knobs in my experience is that they generate spurious bug reports from folks who don't understand what they're really for and abuse them. I think this is primarily a documentation problem. We should advise reasonable values, and folks who ignore that and provide ridiculous values deserve ridiculous behavior.
          Hide
          Arun C Murthy added a comment -

          I'll try and take this forward from now on...

          After some hallway discussions here are some of my ideas, clearly they are fairly nascent and open to discussion...

          The whys for this issue are fairly clear and I'm not getting into them again...

          The hows are a mixture of ideas already thrown around here and some of my own... (so yeah, clearly there is a fair amount of plagiarism involved! smile ).

          -

          Proposal:

          Like all previous proposals I'm all for splitting up client and server configs, this would let the administrators of large clusters change them independently (e.g. configure dfs.client.buffer.dir separately on the actual cluster and on submission nodes - this is important in cases where the submission nodes lie outside the hadoop cluster itself). Also, I'm with separation of configuration variables according to the contexts in which they are used, and not which part of code they configure.

          One break from the past in this proposal is to split-up hadoop-site.xml into hadoop-server.xml & hadoop-client.xml to reflect that we have separate configs for servers (hadoop daemons) and clients (job-clients or dfs-clients). Both of these are initially empty as with hadoop-site.xml today.

          Thus the class heirarchy would look like:

          Configuration (reads hadoop-default.xml)

          ServerConfiguration (reads hadoop-default.xml & hadoop-server.xml)

          ClientConfiguration (reads hadoop-default.xml & hadoop-client.xml)

          JobConfiguration (reads hadoop-default.xml & hadoop-client.xml & maybe a user-defined job config file)

          Thus hadoop daemons i.e. servers only use ServerConfiguration and clients (e.g. dfsClient) use ClientConfiguration to ensure they don't get polluted by the others. Clearly mapred jobs use JobConfiguration as-is today.

          To ensure users know where to override specific config values (i.e. should I override fs.default.name in hadoop-server.xml or hadoop-client.xml to make sure my clients pick up the right values) I propose we add a context tag to property (or just an attribute) which is either server, client or job.

          E.g.

          <property> 
            <context>client</context>
            <name>dfs.client.buffer.dir</name>
            <value>/tmp/dfs/bufdir</value>
            <description>...</description>
          </property>
          

          This could alternatively be done via a comment for each property in hadoop-default.xml but I reckon the tag (or attribute) sort of institutionalizes it... smile.

          Similarly we could also add a level tag (or attribute) to each property which is either of expert, intermediate, beginner to let users know how much of an affect changing a specific knob entails... (again this could be just a comment, and at the risk of repeating myself... yadda, yadda, yadda).

          -

          Overall the idea is to have a simple, reasonably error-resistant configuration system without falling into the trap of over-generalising the same.

          Thoughts?

          Show
          Arun C Murthy added a comment - I'll try and take this forward from now on... After some hallway discussions here are some of my ideas, clearly they are fairly nascent and open to discussion... The whys for this issue are fairly clear and I'm not getting into them again... The hows are a mixture of ideas already thrown around here and some of my own... (so yeah, clearly there is a fair amount of plagiarism involved! smile ). - Proposal: Like all previous proposals I'm all for splitting up client and server configs, this would let the administrators of large clusters change them independently (e.g. configure dfs.client.buffer.dir separately on the actual cluster and on submission nodes - this is important in cases where the submission nodes lie outside the hadoop cluster itself). Also, I'm with separation of configuration variables according to the contexts in which they are used, and not which part of code they configure. One break from the past in this proposal is to split-up hadoop-site.xml into hadoop-server.xml & hadoop-client.xml to reflect that we have separate configs for servers (hadoop daemons) and clients (job-clients or dfs-clients). Both of these are initially empty as with hadoop-site.xml today. Thus the class heirarchy would look like: Configuration (reads hadoop-default.xml) ServerConfiguration (reads hadoop-default.xml & hadoop-server.xml) ClientConfiguration (reads hadoop-default.xml & hadoop-client.xml) JobConfiguration (reads hadoop-default.xml & hadoop-client.xml & maybe a user-defined job config file) Thus hadoop daemons i.e. servers only use ServerConfiguration and clients (e.g. dfsClient) use ClientConfiguration to ensure they don't get polluted by the others. Clearly mapred jobs use JobConfiguration as-is today. To ensure users know where to override specific config values (i.e. should I override fs.default.name in hadoop-server.xml or hadoop-client.xml to make sure my clients pick up the right values) I propose we add a context tag to property (or just an attribute) which is either server, client or job. E.g. <property> <context>client</context> <name>dfs.client.buffer.dir</name> <value>/tmp/dfs/bufdir</value> <description>...</description> </property> This could alternatively be done via a comment for each property in hadoop-default.xml but I reckon the tag (or attribute) sort of institutionalizes it... smile . Similarly we could also add a level tag (or attribute) to each property which is either of expert, intermediate, beginner to let users know how much of an affect changing a specific knob entails... (again this could be just a comment, and at the risk of repeating myself... yadda, yadda, yadda). - Overall the idea is to have a simple, reasonably error-resistant configuration system without falling into the trap of over-generalising the same. Thoughts?
          Hide
          Michael Bieniosek added a comment -

          Arun,

          Your proposal sounds reasonable. Thanks for looking at this issue.

          Currently, hadoop-default.xml is not supposed to be changed by users. Would you relax this convention in your proposal? There might be a few variables that I'd like to set for client and server at the same time (eg. namenode address).

          Why don't you want to split up namenode vs. jobtracker and datanode vs. tasktracker? I understand that it's desirable to keep things simple, but dfs and mapreduce don't interact very much in terms of their configs, so there is a natural separation.

          Instead of dividing configs into "beginner" and "advanced", we should think about dividing into "things you probably need to change" (at the top of the file) and "things you probably don't need to change" (at the bottom of the file). This division could be done with xml comments – I don't think it needs to be so formal as to need a new field.

          Show
          Michael Bieniosek added a comment - Arun, Your proposal sounds reasonable. Thanks for looking at this issue. Currently, hadoop-default.xml is not supposed to be changed by users. Would you relax this convention in your proposal? There might be a few variables that I'd like to set for client and server at the same time (eg. namenode address). Why don't you want to split up namenode vs. jobtracker and datanode vs. tasktracker? I understand that it's desirable to keep things simple, but dfs and mapreduce don't interact very much in terms of their configs, so there is a natural separation. Instead of dividing configs into "beginner" and "advanced", we should think about dividing into "things you probably need to change" (at the top of the file) and "things you probably don't need to change" (at the bottom of the file). This division could be done with xml comments – I don't think it needs to be so formal as to need a new field.
          Hide
          Arun C Murthy added a comment -

          Currently, hadoop-default.xml is not supposed to be changed by users. Would you relax this convention in your proposal? There might be a few variables that I'd like to set for client and server at the same time (eg. namenode address).

          Hmm... how about letting both server and client values for fs.default.name's context tag to let people know to it can be specified in both hadoop-server.xml and hadoop-client.xml, and will be used appropriately? Would that help? I'd rather keep hadoop-default.xml sacrosanct, though we don't prevent you from editing it even today - thus it serves as a gold-standard for everyone.

          Why don't you want to split up namenode vs. jobtracker and datanode vs. tasktracker?

          I did think about this, and I really don't see what value a

          {HDFS|MR}ServerConfiguration and {HDFS|MR}

          ClientConfiguration will provide, which is why I didn't take this route... but I'm open to arguments. Just separation of physical files doesn't seem enough to warrant 4 classes rather than 2.

          This division could be done with xml comments - I don't think it needs to be so formal as to need a new field.

          I agree, yet it's my take that it is better to institutionalise this by adding another tag, same with the context tag. Again this depends on whether or not we can reach a common ground...

          Show
          Arun C Murthy added a comment - Currently, hadoop-default.xml is not supposed to be changed by users. Would you relax this convention in your proposal? There might be a few variables that I'd like to set for client and server at the same time (eg. namenode address). Hmm... how about letting both server and client values for fs.default.name 's context tag to let people know to it can be specified in both hadoop-server.xml and hadoop-client.xml, and will be used appropriately? Would that help? I'd rather keep hadoop-default.xml sacrosanct, though we don't prevent you from editing it even today - thus it serves as a gold-standard for everyone. Why don't you want to split up namenode vs. jobtracker and datanode vs. tasktracker? I did think about this, and I really don't see what value a {HDFS|MR}ServerConfiguration and {HDFS|MR} ClientConfiguration will provide, which is why I didn't take this route... but I'm open to arguments. Just separation of physical files doesn't seem enough to warrant 4 classes rather than 2. This division could be done with xml comments - I don't think it needs to be so formal as to need a new field. I agree, yet it's my take that it is better to institutionalise this by adding another tag, same with the context tag. Again this depends on whether or not we can reach a common ground...
          Hide
          Devaraj Das added a comment -

          I'd rather keep hadoop-default.xml sacrosanct, though we don't prevent you from editing it even today - thus it serves as a gold-standard for everyone

          +1

          This division could be done with xml comments - I don't think it needs to be so formal as to need a new field.

          +1

          Why don't you want to split up namenode vs. jobtracker and datanode vs. tasktracker? I understand that it's desirable to keep things simple, but dfs and mapreduce don't interact very much in terms of their configs, so there is a natural separation.

          This probably could be addressed by having a clear (documentation wise) separation in the configuration file(s). This is already done today in the hadoop-default.xml file via the three sections "global properties", "map/reduce properties" and "file system properties".

          Having the classes

          {Client, Server, Job}

          Configuration seems interesting, but one issue that needs to be looked at is what Michael points out. Some config items would be needed by both server and client. Items like fs.default.name can be handled fairly easily though it amounts to having duplicate config items in the files. The other (more semantic) issue that needs to be looked at is for things like ipc.client.connection.maxidletime. This config item is used, for example, by the TaskTracker to set it's client side connection idle timeout for the RPCs to the JobTracker. However, it also affects the timeout that the Tasks (Map/Reduce) would see, and, unless we have different values for this item in the server and client config files, both the entities would see the same timeout value. This could be an issue (since for Tasks i would have the value set to a very high number - ref HADOOP-1651).
          To summarize, we might end up having a couple of duplicate config items, potentially having different values. Does this seem like a problem? I am okay with such an arrangement but just wanted to bring out this issue while we are designing the system. By the way, this brings us to Doug's comment whether it makes sense to have a separate client-only configuration?

          Also, the current framework has a bug - for e.g. , if i programmatically set speculative execution to false in the JobConf, it is not considered by the framework. The framework has already read the value from the config files it has before i submitted my job and doesn't take notice of my requirement. Now this is a good thing for some config items like fs.default.name, where we DON'T want clients to tell us what the namenode is, but not so for things like mapred.speculative.execution. This issue probably needs to handled in this redesign.

          Show
          Devaraj Das added a comment - I'd rather keep hadoop-default.xml sacrosanct, though we don't prevent you from editing it even today - thus it serves as a gold-standard for everyone +1 This division could be done with xml comments - I don't think it needs to be so formal as to need a new field. +1 Why don't you want to split up namenode vs. jobtracker and datanode vs. tasktracker? I understand that it's desirable to keep things simple, but dfs and mapreduce don't interact very much in terms of their configs, so there is a natural separation. This probably could be addressed by having a clear (documentation wise) separation in the configuration file(s). This is already done today in the hadoop-default.xml file via the three sections "global properties", "map/reduce properties" and "file system properties". Having the classes {Client, Server, Job} Configuration seems interesting, but one issue that needs to be looked at is what Michael points out. Some config items would be needed by both server and client. Items like fs.default.name can be handled fairly easily though it amounts to having duplicate config items in the files. The other (more semantic) issue that needs to be looked at is for things like ipc.client.connection.maxidletime. This config item is used, for example, by the TaskTracker to set it's client side connection idle timeout for the RPCs to the JobTracker. However, it also affects the timeout that the Tasks (Map/Reduce) would see, and, unless we have different values for this item in the server and client config files, both the entities would see the same timeout value. This could be an issue (since for Tasks i would have the value set to a very high number - ref HADOOP-1651 ). To summarize, we might end up having a couple of duplicate config items, potentially having different values. Does this seem like a problem? I am okay with such an arrangement but just wanted to bring out this issue while we are designing the system. By the way, this brings us to Doug's comment whether it makes sense to have a separate client-only configuration? Also, the current framework has a bug - for e.g. , if i programmatically set speculative execution to false in the JobConf, it is not considered by the framework. The framework has already read the value from the config files it has before i submitted my job and doesn't take notice of my requirement. Now this is a good thing for some config items like fs.default.name, where we DON'T want clients to tell us what the namenode is, but not so for things like mapred.speculative.execution. This issue probably needs to handled in this redesign.
          Hide
          Arun C Murthy added a comment - - edited

          To summarize, we might end up having a couple of duplicate config items, potentially having different values. Does this seem like a problem?

          Sorry if I wasn't clear before, but this proposal is explicitly designed keeping in mind tricky knobs such as ipc.client.connection.maxidletime and this is how it would work:

          One can, and should, specify different values for ipc.client.connection.maxidletime in hadoop-server.xml and hadoop-client.xml. Now, the values seen by ServerConfiguration (reads hadoop-default.xml and hadoop-server.xml) viz. used by the TaskTracker is clearly different from that seen by the child-vm which uses ClientConfiguration.

          So, yes, in summary this proposal encourages and facilitates the same config knobs being setup separately for use in different contexts...

          Show
          Arun C Murthy added a comment - - edited To summarize, we might end up having a couple of duplicate config items, potentially having different values. Does this seem like a problem? Sorry if I wasn't clear before, but this proposal is explicitly designed keeping in mind tricky knobs such as ipc.client.connection.maxidletime and this is how it would work: One can, and should, specify different values for ipc.client.connection.maxidletime in hadoop-server.xml and hadoop-client.xml. Now, the values seen by ServerConfiguration (reads hadoop-default.xml and hadoop-server.xml) viz. used by the TaskTracker is clearly different from that seen by the child-vm which uses ClientConfiguration . So, yes, in summary this proposal encourages and facilitates the same config knobs being setup separately for use in different contexts...
          Hide
          dhruba borthakur added a comment -

          +1. I like this current proposal of splitting up the config into three categories.

          Show
          dhruba borthakur added a comment - +1. I like this current proposal of splitting up the config into three categories.
          Hide
          Enis Soztutar added a comment - - edited

          BTW, a related issue about configuration parameters is that the names of the parameters are used in different contexts(in hadoop-xxx.xml files and in the code), sometimes resulting in misspelling and confusion. How about the following slightly ugly way to programaticaly access parameter names.

          public class ParameterNames {
            private static final int prefixLength 
              = ParameterNames.class.getCanonicalName().length() + 1;
            
            public static final class fs {
              public static final class trash {
                public static final class root { }
                public static final class interval { }
              }
              public static final class file {
                public static final class impl { }
              }
            }
            public static String getName(Class<?> clazz) {
              return clazz.getCanonicalName().substring(prefixLength);
            } 
          }
          

          and the code to use the name would be :

          String trashPath = conf.get(ParameterNames.getName(ParameterNames.fs.trash.root.class))
          
          Show
          Enis Soztutar added a comment - - edited BTW, a related issue about configuration parameters is that the names of the parameters are used in different contexts(in hadoop-xxx.xml files and in the code), sometimes resulting in misspelling and confusion. How about the following slightly ugly way to programaticaly access parameter names. public class ParameterNames { private static final int prefixLength = ParameterNames.class.getCanonicalName().length() + 1; public static final class fs { public static final class trash { public static final class root { } public static final class interval { } } public static final class file { public static final class impl { } } } public static String getName( Class <?> clazz) { return clazz.getCanonicalName().substring(prefixLength); } } and the code to use the name would be : String trashPath = conf.get(ParameterNames.getName(ParameterNames.fs.trash.root.class))
          Hide
          Owen O'Malley added a comment -

          I think Arun's proposal looks good, except for having explicit context. I'd much rather have that either in xml comments or in the description.

          I think the ParameterName proposal would be too much complexity for the payback, especially given that we make getter/setter methods for most of the properties anyways and thus there shouldn't be duplicate strings running around.

          On a side note, I would prefer if the xml looked more like:

          <propery name="foo" value="bar"/>

          with description being an optional subelement. That however would break the config files even more than the current proposal. One of the advantages to using attributes rather than subelements is that it is pretty clear that:

          <property name=" foo" value="bar"/>

          is wrong, while:

          <property><name> foo</name><value>bar</value></property>

          looks right. (The spaces are much more obvious (to programers at least!) when enclosed in quotes.

          Show
          Owen O'Malley added a comment - I think Arun's proposal looks good, except for having explicit context. I'd much rather have that either in xml comments or in the description. I think the ParameterName proposal would be too much complexity for the payback, especially given that we make getter/setter methods for most of the properties anyways and thus there shouldn't be duplicate strings running around. On a side note, I would prefer if the xml looked more like: <propery name="foo" value="bar"/> with description being an optional subelement. That however would break the config files even more than the current proposal. One of the advantages to using attributes rather than subelements is that it is pretty clear that: <property name=" foo" value="bar"/> is wrong, while: <property><name> foo</name><value>bar</value></property> looks right. (The spaces are much more obvious (to programers at least!) when enclosed in quotes.
          Hide
          Arun C Murthy added a comment -

          I'd like to assuage some concerns and clarify/change some more about certain aspects of the above proposal, here we go...

          Existing issues:
          a) Easy to misconfigure (e.g. Owen's original description) and hard to debug.
          b) No way to separately configure the servers and clients (both dfs & mapred clients).
          c) Ease administration of large clusters, ensuring that administrators have a fair amount of control over some configuration knobs, while at the same time ensuring users can control *relevant" config knobs.
          d) No way to override hadoop-site.xml from job.xml (e.g. turn on/off speculative execution from job.xml).

          Proposal:

          Organise the config tree around where things in cluster are used i.e. things used only by server, and knobs clients (both hdfs and mapred clients) may tweak.

          To help clarify the above I propose we re-work the config files as follows:

          a) hadoop-default.xml - Read-only defaults.
          b) hadoop-site.xml - General site-specific values e.g. you'd put fs.default.name and mapred.job.tracker here. Important change: values here can be overridden (code-speak: we'd do addDefaultResource(hadoop-site) rather than addFinalResource(hadoop-site) as done today).
          c) hadoop-server.xml - Hadoop servers (daemons) are configured via this, and values here cannot be overridden. One would, for e.g., put dfs.replication here.
          d) hadoop-client.xml - Hadoop clients (bot hdfs and mapred clients) are configured via this file, which, again, cannot be overridden. One would, for e.g., put dfs.client.buffer.dir here.
          e) job.xml - Job-specific configuration at run-time (as today).

          Thus, hadoop-site.xml, which is the non-overridable today, is kept around as a general, site-specific, overridable configuration file, where any config knob can be tweaked. We could completely do away with hadoop-site.xml, however to keep things backward compatible and to ensure that common configuration isn't duplicated (else, we would have to force users to specify fs.default.name/mapred.job.tracker in both hadoop-server.xml & hadoop-client.xml, which seems redundant).

          However, hadoop-server.xml and hadoop-client.xml are 2 new config files for hadoop servers (daemons) and clients which cannot be overridden in the server and client context respectively.

          The class heirarchy now looks like this:

            abstract public class Configuration {
              // ...
            }
          
            public class ServerConfiguration extends Configuration {
              // reads hadoop-default.xml, hadoop-site.xml & hadoop-server.xml, in that order
            }
            
            public class ClientConfiguration extends Configuration {
              // reads hadoop-default.xml, hadoop-site.xml & hadoop-client.xml, in that order
            }
            
            public class JobConf extends ClientConfiguration {
              // reads hadoop-default.xml, hadoop-site.xml, hadoop-client.xml & job.xml in that order
            }
          

          I propose we tighten interfaces of servers/clients to reflect that they are using a specific type of configuration. E.g.

            class JobTracker {
              public JobTracker(JobConf conf) {
                // ...
                }
            }
          

          becomes:

            class JobTracker {
              public JobTracker(ServerConfiguration conf) {
                // ...
                }
            }
          

          and

            public class DFSClient {
              public DFSClient(Configuration conf) {
                // ...
              }
            }
          

          becomes:

            public class DFSClient {
              public DFSClient(ClientConfiguration conf) {
                // ...
              }
            }
          

          and so on... (Map/Reduce public apis already correctly defined in terms of JobConf already).

          Thus, we would create different types of configuration objects (ServerConfiguration/ClientConfiguration/JobConf) and use them in the relevant sub-systems (NameNode/JobTracker: ServerConfiguration, DFSClient/MapReduceBase: ClientConfiguration/JobConf) etc.

          This has the benefit that
          a) It matches the context and use cases of the designated configuration files.
          b) Users have a fair amount of control over relevant knobs e.g. they can specify io.buffer.size in JobConf and that will be used by, for e.g., in SequenceFile via InputFormat.getRecordReader(InputSplit, JobConf, Reporter).
          c) Ensures that administration of the large cluster is eased and that admins have a fair amount of control over the various configuration knobs. E.g. there is no way for a user to tweak dfs.client.buffer.dir via the JobConf since DFSClient is defined in terms of ClientConfiguration which clearly never looks at job.xml.

          Also, I'd like to use this redo to go ahead and implement Doug's suggestion of using static methods, where applicable, to setup the actual values; thereby stop dumping too many get/set methods in Configuration/JobConf.

          Last, not least, is how we go about communicating where the right place for each config knob is; in this regard I guess the consensus has been to not introduce newer xml tags/attributes and hence to use the description tag and place it along with the 'level' (expert/intermediate etc.).

          Show
          Arun C Murthy added a comment - I'd like to assuage some concerns and clarify/change some more about certain aspects of the above proposal, here we go... Existing issues: a) Easy to misconfigure (e.g. Owen's original description) and hard to debug. b) No way to separately configure the servers and clients (both dfs & mapred clients). c) Ease administration of large clusters, ensuring that administrators have a fair amount of control over some configuration knobs, while at the same time ensuring users can control *relevant" config knobs. d) No way to override hadoop-site.xml from job.xml (e.g. turn on/off speculative execution from job.xml). Proposal : Organise the config tree around where things in cluster are used i.e. things used only by server, and knobs clients (both hdfs and mapred clients) may tweak. To help clarify the above I propose we re-work the config files as follows: a) hadoop-default.xml - Read-only defaults. b) hadoop-site.xml - General site-specific values e.g. you'd put fs.default.name and mapred.job.tracker here. Important change: values here can be overridden (code-speak: we'd do addDefaultResource(hadoop-site) rather than addFinalResource(hadoop-site) as done today). c) hadoop-server.xml - Hadoop servers (daemons) are configured via this, and values here cannot be overridden. One would, for e.g., put dfs.replication here. d) hadoop-client.xml - Hadoop clients (bot hdfs and mapred clients) are configured via this file, which, again, cannot be overridden. One would, for e.g., put dfs.client.buffer.dir here. e) job.xml - Job-specific configuration at run-time (as today). Thus, hadoop-site.xml, which is the non-overridable today, is kept around as a general, site-specific, overridable configuration file, where any config knob can be tweaked. We could completely do away with hadoop-site.xml, however to keep things backward compatible and to ensure that common configuration isn't duplicated (else, we would have to force users to specify fs.default.name / mapred.job.tracker in both hadoop-server.xml & hadoop-client.xml, which seems redundant). However, hadoop-server.xml and hadoop-client.xml are 2 new config files for hadoop servers (daemons) and clients which cannot be overridden in the server and client context respectively. The class heirarchy now looks like this: abstract public class Configuration { // ... } public class ServerConfiguration extends Configuration { // reads hadoop-default.xml, hadoop-site.xml & hadoop-server.xml, in that order } public class ClientConfiguration extends Configuration { // reads hadoop-default.xml, hadoop-site.xml & hadoop-client.xml, in that order } public class JobConf extends ClientConfiguration { // reads hadoop-default.xml, hadoop-site.xml, hadoop-client.xml & job.xml in that order } I propose we tighten interfaces of servers/clients to reflect that they are using a specific type of configuration. E.g. class JobTracker { public JobTracker(JobConf conf) { // ... } } becomes: class JobTracker { public JobTracker(ServerConfiguration conf) { // ... } } and public class DFSClient { public DFSClient(Configuration conf) { // ... } } becomes: public class DFSClient { public DFSClient(ClientConfiguration conf) { // ... } } and so on... (Map/Reduce public apis already correctly defined in terms of JobConf already). Thus, we would create different types of configuration objects (ServerConfiguration/ClientConfiguration/JobConf) and use them in the relevant sub-systems (NameNode/JobTracker: ServerConfiguration, DFSClient/MapReduceBase: ClientConfiguration/JobConf) etc. This has the benefit that a) It matches the context and use cases of the designated configuration files. b) Users have a fair amount of control over relevant knobs e.g. they can specify io.buffer.size in JobConf and that will be used by, for e.g., in SequenceFile via InputFormat.getRecordReader(InputSplit, JobConf, Reporter) . c) Ensures that administration of the large cluster is eased and that admins have a fair amount of control over the various configuration knobs. E.g. there is no way for a user to tweak dfs.client.buffer.dir via the JobConf since DFSClient is defined in terms of ClientConfiguration which clearly never looks at job.xml. Also, I'd like to use this redo to go ahead and implement Doug's suggestion of using static methods, where applicable, to setup the actual values; thereby stop dumping too many get/set methods in Configuration/JobConf. Last, not least, is how we go about communicating where the right place for each config knob is; in this regard I guess the consensus has been to not introduce newer xml tags/attributes and hence to use the description tag and place it along with the 'level' (expert/intermediate etc.).
          Hide
          Doug Cutting added a comment -

          A few comments:

          1. A job is created by a client, so it gets values from hadoop-client.xml as well as job-specific things set by the application, right? Then the job is serialized as a job.xml and sent to a server. When it is read on the server, should any other configuration files be read at all? I think perhaps not. Job defaults, site specifics, etc. should all be determined at the client. If a value from hadoop-server.xml is to be considered, then the parameter is not client-overrideable. Conversely, if a value is client overrideable, then the value in hadoop-server.xml will not be consulted, only the value in job.xml will be seen. A job,xml should contain a complete, standalone set of values, no? So there are two ways to create a JobConfiguration: one that reads hadoop-default.xml, hadoop-site,xml and hadoop-client.xml, and one that only reads job.xml.

          2. Many parameters should either be in hadoop-client.xml or hadoop-server.xml, but not both. Thus we can organize the defaults into separate sections for client and server. Parameters that are used by both clients and servers can be in a "universal" section: these may be meaningfully added to the client, server or site configuration. The top-level organization of hadoop-default.xml can be by technology (hdfs, mapred, etc.) and within that sub-sections for universal, client and server parameters. This can provide folks a guide for where things are intended to be overridden.

          Show
          Doug Cutting added a comment - A few comments: 1. A job is created by a client, so it gets values from hadoop-client.xml as well as job-specific things set by the application, right? Then the job is serialized as a job.xml and sent to a server. When it is read on the server, should any other configuration files be read at all? I think perhaps not. Job defaults, site specifics, etc. should all be determined at the client. If a value from hadoop-server.xml is to be considered, then the parameter is not client-overrideable. Conversely, if a value is client overrideable, then the value in hadoop-server.xml will not be consulted, only the value in job.xml will be seen. A job,xml should contain a complete, standalone set of values, no? So there are two ways to create a JobConfiguration: one that reads hadoop-default.xml, hadoop-site,xml and hadoop-client.xml, and one that only reads job.xml. 2. Many parameters should either be in hadoop-client.xml or hadoop-server.xml, but not both. Thus we can organize the defaults into separate sections for client and server. Parameters that are used by both clients and servers can be in a "universal" section: these may be meaningfully added to the client, server or site configuration. The top-level organization of hadoop-default.xml can be by technology (hdfs, mapred, etc.) and within that sub-sections for universal, client and server parameters. This can provide folks a guide for where things are intended to be overridden.
          Hide
          Arun C Murthy added a comment -

          Ok, how is this for an about turn...

          I had a long, soul-crushing, discussion with Doug last night about the config rejig where he basically blew away my above proposal to smithereens while I lamely noded. smile

          Here is the crux of the Doug's arguments:

          Essentially we need 3 config files:
          a) Read-only defaults (existing hadoop-defaults.xml).
          b) A file where the admin specifies config values which can be overridden (existing mapred-defaults.xml).
          c) A file where the admin specifies a set of hard, sane limits for some config values which cannot be overridden (existing hadoop-site.xml).

          Clearly we have issues when users/admins specify configs in the wrong place e.g. set mapred.speculative.execution in hadoop-site.xml, thereby robbing users of the opportunity to override it and so on, and those are just that: mistakes while configuring hadoop.

          That being said, clearly we have a documentation and worse, a naming issue. It is hardly apparent to the users that mapred-defaults.xml is a generic, overridable config file and clearly not their fault that it is hardly used.

          Overall there isn't any missing functionality, rather a lack of clarity and understanding; primarily a nomencalture/documentation issue.

          Hence, here is a much simpler way to go about this:
          a) Keep hadoop-defaults.xml as the read-only default config file.
          b) Rename hadoop-site.xml and mapred-defaults.xml to better reflect what they are: non-overridable & overridable site-specific configs. Some options are:
          i) hadoop-initial.xml (overridable) and hadoop-final.xml (non-overridable)
          ii) hadoop-site-defaults.xml (overridable) and hadoop-site-limits.xml (non-overridable)

          I strongly feel we do need to rename existing config files just to get the message across...

          Clearly existing Configuration and JobConf classes handle these quite well and hence there is hardly any reason to change them. OTOH we really need to shout from the rooftops w.r.t the various config files and their roles and uses (i.e. better documentation).

          Overall, less change the better. So much for my earlier proposal... sigh! smile
          (I know Owen has some thoughts on the same... watch this space!.)

          -

          Thoughts?

          Show
          Arun C Murthy added a comment - Ok, how is this for an about turn... I had a long, soul-crushing, discussion with Doug last night about the config rejig where he basically blew away my above proposal to smithereens while I lamely noded. smile Here is the crux of the Doug's arguments: Essentially we need 3 config files: a) Read-only defaults (existing hadoop-defaults.xml). b) A file where the admin specifies config values which can be overridden (existing mapred-defaults.xml). c) A file where the admin specifies a set of hard, sane limits for some config values which cannot be overridden (existing hadoop-site.xml). Clearly we have issues when users/admins specify configs in the wrong place e.g. set mapred.speculative.execution in hadoop-site.xml, thereby robbing users of the opportunity to override it and so on, and those are just that: mistakes while configuring hadoop. That being said, clearly we have a documentation and worse, a naming issue. It is hardly apparent to the users that mapred-defaults.xml is a generic, overridable config file and clearly not their fault that it is hardly used. Overall there isn't any missing functionality , rather a lack of clarity and understanding; primarily a nomencalture/documentation issue. Hence, here is a much simpler way to go about this: a) Keep hadoop-defaults.xml as the read-only default config file. b) Rename hadoop-site.xml and mapred-defaults.xml to better reflect what they are: non-overridable & overridable site-specific configs. Some options are: i) hadoop-initial.xml (overridable) and hadoop-final.xml (non-overridable) ii) hadoop-site-defaults.xml (overridable) and hadoop-site-limits.xml (non-overridable) I strongly feel we do need to rename existing config files just to get the message across... Clearly existing Configuration and JobConf classes handle these quite well and hence there is hardly any reason to change them. OTOH we really need to shout from the rooftops w.r.t the various config files and their roles and uses (i.e. better documentation). Overall, less change the better. So much for my earlier proposal... sigh! smile (I know Owen has some thoughts on the same... watch this space!.) - Thoughts?
          Hide
          Sameer Paranjpye added a comment -

          Essentially we need 3 config files:
          a) Read-only defaults (existing hadoop-defaults.xml).
          b) A file where the admin specifies config values which can be overridden (existing mapred-defaults.xml).
          c) A file where the admin specifies a set of hard, sane limits for some config values which cannot be overridden (existing hadoop-site.xml).

          I don't think we need 3 config files or a hierarchy of configs. The above 3 categories of configuration need to exist, but can be expressed in many different ways. What if we had the following files:

          • hadoop-defaults.xml, the read-only default config file
          • hadoop-client.xml, specifies client behavior, resides on a client machine is processed by clients
          • hadoop-server.xml, specifies server behavior is processed by servers

          The one place where the client and server configs interact is when tasks are localized and clients are running in a server controlled context. Here some of the clients configuration can be overridden by values in the servers config. The variables to be overridden can be hard coded. If this means we're overprotecting users, then the list of variables to override can itself be placed in the server config, say in the hadoop.client.overrides config variable.

          The treatment of the 3 categories of config values would be as follows:

          • Read-only defaults - hadoop-defaults.xml
          • Admin specified config values which can be overridden - This set of values no longer exists, everything can be overridden by clients with a few exceptions, all client configuration appears in hadoop-client.xml
          • Admin specified set of hard, sane limits for some config values which cannot be overridden - This is a set of exceptions listed in hadoop-server.xml, represented by the config value hadoop.client.override
          Show
          Sameer Paranjpye added a comment - Essentially we need 3 config files: a) Read-only defaults (existing hadoop-defaults.xml). b) A file where the admin specifies config values which can be overridden (existing mapred-defaults.xml). c) A file where the admin specifies a set of hard, sane limits for some config values which cannot be overridden (existing hadoop-site.xml). I don't think we need 3 config files or a hierarchy of configs. The above 3 categories of configuration need to exist, but can be expressed in many different ways. What if we had the following files: hadoop-defaults.xml , the read-only default config file hadoop-client.xml , specifies client behavior, resides on a client machine is processed by clients hadoop-server.xml , specifies server behavior is processed by servers The one place where the client and server configs interact is when tasks are localized and clients are running in a server controlled context. Here some of the clients configuration can be overridden by values in the servers config. The variables to be overridden can be hard coded. If this means we're overprotecting users, then the list of variables to override can itself be placed in the server config, say in the hadoop.client.overrides config variable. The treatment of the 3 categories of config values would be as follows: Read-only defaults - hadoop-defaults.xml Admin specified config values which can be overridden - This set of values no longer exists, everything can be overridden by clients with a few exceptions, all client configuration appears in hadoop-client.xml Admin specified set of hard, sane limits for some config values which cannot be overridden - This is a set of exceptions listed in hadoop-server.xml , represented by the config value hadoop.client.override
          Hide
          Doug Cutting added a comment -

          Sameer, I think your proposal is mostly isomorphic to Arun's, but with redundancy. The stuff in your hadoop-server.xml and hadoop.client.override is the same as in Arun's hadoop-final.xml in servers. hadoop-final.xml would not normally exist on clients, only on servers. So the difference between your proposals is (a) the name of the files; and (b) that you want to also list non-overideable parameter names in a parameter. The latter seems fragile and hard to maintain to me.

          It does make sense to have overrideable values on the server too, e.g., to determine the default block size for client programs which don't override it. Under Arun's proposal this would be in hadoop-initial.xml on the servers. Where would it be in your proposal? As items in hadoop-server.xml that are not named in hadoop.client.override? Is this really less confusing?

          Another issue with your proposal is that it requires different Configuration construction code on clients and servers. Do we always know, everywhere that a Configuration is created, whether we are running as a client or a server? Our servers use much of our client code: a MapReduce server is an HDFS client, etc. I think this is more reliably done by using uniform Configuration construction code, and simply configuring server hosts differently from client hosts, if different configurations are even required. In most cases this should not be required, since clients have no little need to specify non-overrideable values, hence hadoop-final.xml should generally only exist on servers.

          We're mostly talking about host-specific settings, not server/client distinctions. Some things should not be overridden because they're specific to the host. Thus they should be overridden by a file on that host whose sole purpose is to do this. This concept makes sense on both client and server machines.

          Show
          Doug Cutting added a comment - Sameer, I think your proposal is mostly isomorphic to Arun's, but with redundancy. The stuff in your hadoop-server.xml and hadoop.client.override is the same as in Arun's hadoop-final.xml in servers. hadoop-final.xml would not normally exist on clients, only on servers. So the difference between your proposals is (a) the name of the files; and (b) that you want to also list non-overideable parameter names in a parameter. The latter seems fragile and hard to maintain to me. It does make sense to have overrideable values on the server too, e.g., to determine the default block size for client programs which don't override it. Under Arun's proposal this would be in hadoop-initial.xml on the servers. Where would it be in your proposal? As items in hadoop-server.xml that are not named in hadoop.client.override? Is this really less confusing? Another issue with your proposal is that it requires different Configuration construction code on clients and servers. Do we always know, everywhere that a Configuration is created, whether we are running as a client or a server? Our servers use much of our client code: a MapReduce server is an HDFS client, etc. I think this is more reliably done by using uniform Configuration construction code, and simply configuring server hosts differently from client hosts, if different configurations are even required. In most cases this should not be required, since clients have no little need to specify non-overrideable values, hence hadoop-final.xml should generally only exist on servers. We're mostly talking about host-specific settings, not server/client distinctions. Some things should not be overridden because they're specific to the host. Thus they should be overridden by a file on that host whose sole purpose is to do this. This concept makes sense on both client and server machines.
          Hide
          Sameer Paranjpye added a comment -

          It does make sense to have overrideable values on the server too, e.g., to determine the default block size for client programs which don't override it. Under Arun's proposal this would be in hadoop-initial.xml on the servers. Where would it be in your proposal? As items in hadoop-server.xml that are not named in hadoop.client.override? Is this really less confusing?

          The default block size for client programs would be in hadoop-client.xml, settings in this file would override those in hadoop-defaults.xml.

          Another issue with your proposal is that it requires different Configuration construction code on clients and servers. Do we always know, everywhere that a Configuration is created, whether we are running as a client or a server?

          I proposed the client-server nomenclature because I feel it makes the system more comprehensible. Admittedly, the distinction between clients and servers isn't always clear, but the proposed filenames are intended to map elements of configuration to system components and the people that configure them. The file hadoop-client.xml is supplied by "users" - people that run map/reduce jobs and is read by "clients" i.e. jobs, tasks and the shell. The file hadoop-server.xml is supplied by "admins" - people that keep Hadoop clusters up and running and is read by servers. Depending on the context either hadoop-client.xml or hadoop-server.xml would be the "final resource" read by a Configuration object. There is no technical reason for these files to be named differently, indeed currently they are not, hadoop-site.xml is the final resource read by both clients and servers. We could even have 3 files, hadoop-client.xml, hadoop-mapred.xml and hadoop-dfs.xml read by clients, map/reduce servers and HDFS servers respectively. It would require some differences in Configuration construction code, but these don't appear to be too convoluted. The name of the final resource consumed could be set by clients and servers upon start-up and then used by all Configuration objects constructed by the servers. The final resource could also be overridden by values supplied on the command line.

          Show
          Sameer Paranjpye added a comment - It does make sense to have overrideable values on the server too, e.g., to determine the default block size for client programs which don't override it. Under Arun's proposal this would be in hadoop-initial.xml on the servers. Where would it be in your proposal? As items in hadoop-server.xml that are not named in hadoop.client.override? Is this really less confusing? The default block size for client programs would be in hadoop-client.xml , settings in this file would override those in hadoop-defaults.xml . Another issue with your proposal is that it requires different Configuration construction code on clients and servers. Do we always know, everywhere that a Configuration is created, whether we are running as a client or a server? I proposed the client-server nomenclature because I feel it makes the system more comprehensible. Admittedly, the distinction between clients and servers isn't always clear, but the proposed filenames are intended to map elements of configuration to system components and the people that configure them. The file hadoop-client.xml is supplied by "users" - people that run map/reduce jobs and is read by "clients" i.e. jobs, tasks and the shell. The file hadoop-server.xml is supplied by "admins" - people that keep Hadoop clusters up and running and is read by servers. Depending on the context either hadoop-client.xml or hadoop-server.xml would be the "final resource" read by a Configuration object. There is no technical reason for these files to be named differently, indeed currently they are not, hadoop-site.xml is the final resource read by both clients and servers. We could even have 3 files, hadoop-client.xml , hadoop-mapred.xml and hadoop-dfs.xml read by clients, map/reduce servers and HDFS servers respectively. It would require some differences in Configuration construction code, but these don't appear to be too convoluted. The name of the final resource consumed could be set by clients and servers upon start-up and then used by all Configuration objects constructed by the servers. The final resource could also be overridden by values supplied on the command line.
          Hide
          Doug Cutting added a comment -

          The default block size for client programs would be in hadoop-client.xml [ ...]

          Where would the default block size for server programs be set? In hadoop-server.xml?

          It sounds like you want to break what Arun's calling hadoop-initial.xml into two files: a client and server version, and replace hadoop-final.xml with a parameter that names those values which may not be overridden, but that parameter is only used on "servers"? Is that a fair comparison?

          My belief is that the primary reason we've seen misconfiguration that that folks don't understand that hadoop-site.xml is not overrideable on servers by jobs. We've encouraged folks to put most things in that file (hadoop-site.xml), when in fact it should only be used for very limited purposes, mostly for host-specific paths. This has caused many serious problems. But we shouldn't overreact. We should fix this issue. We should make it clearer where most things belong, and what particular things should not be overrideable.

          The root of the problem might be:

          http://lucene.apache.org/hadoop/api/overview-summary.html#overview_description

          This is where we've first encouraged all users of Hadoop to edit the wrong file.

          I don't think that, long-term, client and server are fundamental distinctions in Hadoop, we run clients on servers and will probably do the converse someday, so I am hesitant to hardwire these in as fundamental concepts in the configuration system, which is fundamental. I think the notion of host-specific settings which cannot be overridden is a universal concept and would rather focus on making that distinction clear to users.

          Show
          Doug Cutting added a comment - The default block size for client programs would be in hadoop-client.xml [ ...] Where would the default block size for server programs be set? In hadoop-server.xml? It sounds like you want to break what Arun's calling hadoop-initial.xml into two files: a client and server version, and replace hadoop-final.xml with a parameter that names those values which may not be overridden, but that parameter is only used on "servers"? Is that a fair comparison? My belief is that the primary reason we've seen misconfiguration that that folks don't understand that hadoop-site.xml is not overrideable on servers by jobs. We've encouraged folks to put most things in that file (hadoop-site.xml), when in fact it should only be used for very limited purposes, mostly for host-specific paths. This has caused many serious problems. But we shouldn't overreact. We should fix this issue. We should make it clearer where most things belong, and what particular things should not be overrideable. The root of the problem might be: http://lucene.apache.org/hadoop/api/overview-summary.html#overview_description This is where we've first encouraged all users of Hadoop to edit the wrong file. I don't think that, long-term, client and server are fundamental distinctions in Hadoop, we run clients on servers and will probably do the converse someday, so I am hesitant to hardwire these in as fundamental concepts in the configuration system, which is fundamental. I think the notion of host-specific settings which cannot be overridden is a universal concept and would rather focus on making that distinction clear to users.
          Hide
          Owen O'Malley added a comment -

          I think that Sameer's proposal of having a fixed or configurable list of properties that is overridden when the task is localized for a given server is a very very good thing. It ensures that the override happens at exactly one spot, namely the switch over between the server code and the client code on the task tracker. Otherwise, we end up with the current situation where the hadoop-site on the various nodes that the configuration has gone through all can override various properties. In particular, I don't want the hadoop-site equivalent on the launching node to be consulted at all. It just provides one more location where things can be broken.

          On the other hand, I've changed religions and I'm convinced that we want exactly one config file: hadoop-site.xml. All settings both client and server should go in there.

          So my proposed dataflow for JobConfs looks like:

          1. Client creates JobConf on the submit node, which reads hadoop-site.xml (and the readonly hadoop-default.xml).
          2. Client fills in their desired values and submits it (by serializing it).
          3. It is never ever modified by any other config files on any of the servers.
          4. When the Task is starting, it is localized by looking in the server's config for hadoop.client.override for a list of properties to be copied over to the task's JobConf from the server's configuration.

          The only piece that is missing is how to set the default number of reduces. And I think the best way is to introduce a new pair of attributes:
          mapred.map.tasks.default
          mapred.reduce.tasks.default
          which are used if the specific values aren't set.

          Also note that Path.getFileSystem() should take a Configuration so that it is compatible with both server configs and JobConfs.

          Show
          Owen O'Malley added a comment - I think that Sameer's proposal of having a fixed or configurable list of properties that is overridden when the task is localized for a given server is a very very good thing. It ensures that the override happens at exactly one spot, namely the switch over between the server code and the client code on the task tracker. Otherwise, we end up with the current situation where the hadoop-site on the various nodes that the configuration has gone through all can override various properties. In particular, I don't want the hadoop-site equivalent on the launching node to be consulted at all. It just provides one more location where things can be broken. On the other hand, I've changed religions and I'm convinced that we want exactly one config file: hadoop-site.xml. All settings both client and server should go in there. So my proposed dataflow for JobConfs looks like: 1. Client creates JobConf on the submit node, which reads hadoop-site.xml (and the readonly hadoop-default.xml). 2. Client fills in their desired values and submits it (by serializing it). 3. It is never ever modified by any other config files on any of the servers. 4. When the Task is starting, it is localized by looking in the server's config for hadoop.client.override for a list of properties to be copied over to the task's JobConf from the server's configuration. The only piece that is missing is how to set the default number of reduces. And I think the best way is to introduce a new pair of attributes: mapred.map.tasks.default mapred.reduce.tasks.default which are used if the specific values aren't set. Also note that Path.getFileSystem() should take a Configuration so that it is compatible with both server configs and JobConfs.
          Hide
          Doug Cutting added a comment -

          It sounds like we're mostly in agreement. We agree that there should be a file with values that jobs can override, and that's where most things should go. There also needs to be a way for mapreduce daemons to list parameters that may not be overridden by jobs. Where we differ is what the files should be named and how the non-overrideable parameters should be named. These seems like mostly cosmetic differences that should be easily resolved by reasonable folks.

          I'd prefer it if:

          • The override mechanism is not specific to mapreduce, since other daemons may wish to use it in the future. We should also avoid the terms 'client' and 'server', since these are relative, not universal.
          • The override specification format is merge-friendly, since, e.g., both mapreduce and hdfs may have values that jobs should not override, and changes to the file should be easy to see with, e.g., 'diff'. In other words, different parameters and values should be on different lines.

          Finally, it makes sense to me that the value which cannot be overridden could be set in the same place where it is declared to be not overrideable. So, unless we want to invent a new file format, this sounds a lot like a special config file.

          Show
          Doug Cutting added a comment - It sounds like we're mostly in agreement. We agree that there should be a file with values that jobs can override, and that's where most things should go. There also needs to be a way for mapreduce daemons to list parameters that may not be overridden by jobs. Where we differ is what the files should be named and how the non-overrideable parameters should be named. These seems like mostly cosmetic differences that should be easily resolved by reasonable folks. I'd prefer it if: The override mechanism is not specific to mapreduce, since other daemons may wish to use it in the future. We should also avoid the terms 'client' and 'server', since these are relative, not universal. The override specification format is merge-friendly, since, e.g., both mapreduce and hdfs may have values that jobs should not override, and changes to the file should be easy to see with, e.g., 'diff'. In other words, different parameters and values should be on different lines. Finally, it makes sense to me that the value which cannot be overridden could be set in the same place where it is declared to be not overrideable. So, unless we want to invent a new file format, this sounds a lot like a special config file.
          Hide
          Arun C Murthy added a comment -

          It sounds like we're mostly in agreement.

          Phew! smile

          With the only bone of contention being:

          Where we differ is what the files should be named and how the non-overrideable parameters should be named [...]

          and given:

          The override mechanism is not specific to mapreduce, since other daemons may wish to use it in the future. We should also avoid the terms 'client' and 'server', since these are relative, not universal. [...]

          I'd still vote to go the hadoop-initial/hadoop-final way:
          a) Introducing another hadoop.client.overrides is another pretty big deal, given it isn't around today. It is only a few parameters today, but imagine if it keeps growing...
          b) Doing overrides through code seems like a brittle solution, imho.
          c) Most importantly: imagine trying to explain hadoop.client.overrides to people (users/admins) new to hadoop... just seems like we would put a lot more onus on them to understand internals.

          while:

          d) the hadoop-initial/hadoop-final way seems, to me, a lot more generic & simpler, future-proof, backward compatible conceptually and easier on admins.

          Having said that, I promise this is my last comment on this debate. smile

          Clearly I'd love to hear from other users about what they think is easier on them...

          Show
          Arun C Murthy added a comment - It sounds like we're mostly in agreement. Phew! smile With the only bone of contention being: Where we differ is what the files should be named and how the non-overrideable parameters should be named [...] and given: The override mechanism is not specific to mapreduce, since other daemons may wish to use it in the future. We should also avoid the terms 'client' and 'server', since these are relative, not universal. [...] I'd still vote to go the hadoop-initial/hadoop-final way: a) Introducing another hadoop.client.overrides is another pretty big deal, given it isn't around today. It is only a few parameters today, but imagine if it keeps growing... b) Doing overrides through code seems like a brittle solution, imho. c) Most importantly: imagine trying to explain hadoop.client.overrides to people (users/admins) new to hadoop... just seems like we would put a lot more onus on them to understand internals. while: d) the hadoop-initial/hadoop-final way seems, to me, a lot more generic & simpler, future-proof, backward compatible conceptually and easier on admins. Having said that, I promise this is my last comment on this debate. smile Clearly I'd love to hear from other users about what they think is easier on them...
          Hide
          Devaraj Das added a comment -

          I am mostly convinced that Owen's last summary of things about having a single hadoop-site.xml is a good way to go. But I think having a separate special config file for the hard-coded config items is a better approach than having a new config item hadoop.client.overrides. The name of the file could be hadoop-site-final.xml. This could be the final resource for all configuration constructions. This file is created by admins and put in the cluster nodes.

          Show
          Devaraj Das added a comment - I am mostly convinced that Owen's last summary of things about having a single hadoop-site.xml is a good way to go. But I think having a separate special config file for the hard-coded config items is a better approach than having a new config item hadoop.client.overrides. The name of the file could be hadoop-site-final.xml. This could be the final resource for all configuration constructions. This file is created by admins and put in the cluster nodes.
          Hide
          Devaraj Das added a comment -

          My last comment means that we have two config files hadoop-site.xml and hadoop-site-final.xml. Arun's hadoop-initial and hadoop-final serve the same purpose i guess.

          Show
          Devaraj Das added a comment - My last comment means that we have two config files hadoop-site.xml and hadoop-site-final.xml. Arun's hadoop-initial and hadoop-final serve the same purpose i guess.
          Hide
          Doug Cutting added a comment -

          My last comment means that we have two config files hadoop-site.xml and hadoop-site-final.xml. Arun's hadoop-initial and hadoop-final serve the same purpose i guess.

          From a back-compatibility standpoint it might be simpler to keep the name hadoop-site.xml. But, is that name so misleading that we should change it anyway? If its semantics are changed (as proposed), to no longer override job-specified parameters, then its name may no longer be so misleading, and we might not need to use a new name after all. And it would be nice to not force folks to rename all their configuration files when they upgrade.

          On the other hand, if we change the the semantics of hadoop-site, will that silently break folks whose configurations depended on the old semantics? That would argue for new names, so that folks are forced to address the incompatible change. Thoughts?

          Show
          Doug Cutting added a comment - My last comment means that we have two config files hadoop-site.xml and hadoop-site-final.xml. Arun's hadoop-initial and hadoop-final serve the same purpose i guess. From a back-compatibility standpoint it might be simpler to keep the name hadoop-site.xml. But, is that name so misleading that we should change it anyway? If its semantics are changed (as proposed), to no longer override job-specified parameters, then its name may no longer be so misleading, and we might not need to use a new name after all. And it would be nice to not force folks to rename all their configuration files when they upgrade. On the other hand, if we change the the semantics of hadoop-site, will that silently break folks whose configurations depended on the old semantics? That would argue for new names, so that folks are forced to address the incompatible change. Thoughts?
          Hide
          Owen O'Malley added a comment -

          I'm pretty confused about why you want a second file that contains a list of attribute names.

          It absolutely can not be a list of attribute name/value pairs, because that would put us back in the current mess of two places to set things.

          Having a single attribute that just lists the attribute names that need to be set will be simple and direct. It is very unlikely to be misconfigured. Furthermore, it is addressing the user delegation/switching that is pretty unique to map/reduce and other batch systems. I can't imagine what Nutch or Hbase would want to add to the list.

          Show
          Owen O'Malley added a comment - I'm pretty confused about why you want a second file that contains a list of attribute names. It absolutely can not be a list of attribute name/value pairs, because that would put us back in the current mess of two places to set things. Having a single attribute that just lists the attribute names that need to be set will be simple and direct. It is very unlikely to be misconfigured. Furthermore, it is addressing the user delegation/switching that is pretty unique to map/reduce and other batch systems. I can't imagine what Nutch or Hbase would want to add to the list.
          Hide
          Andrzej Bialecki added a comment -

          I rather like the suggestion to change the name of hadoop-site.xml to something more explicit, e.g. hadoop-final.xml.

          Also, I have a general comment on the format: I like the way that Solr folks solved the configuration format issue, where property names are XPaths. This allows one to add arbitrary properties, as we can do, but also encourages much more sensible property names, because they form a hierarchy. It's also much easier to find related properties.

          Of course, this would be a big incompatible change to make ...

          Show
          Andrzej Bialecki added a comment - I rather like the suggestion to change the name of hadoop-site.xml to something more explicit, e.g. hadoop-final.xml. Also, I have a general comment on the format: I like the way that Solr folks solved the configuration format issue, where property names are XPaths. This allows one to add arbitrary properties, as we can do, but also encourages much more sensible property names, because they form a hierarchy. It's also much easier to find related properties. Of course, this would be a big incompatible change to make ...
          Hide
          Doug Cutting added a comment -

          Owen is concerned that having different files with different semantics (initial versus final) is confusing. That we should rather just have a list of files (e.g., hadoop-default.xml, hadoop-site.xml, job.xml) that are all treated identically. That has merit. It is simpler.

          But how do we specify that some parameters may not be overridden by files later in the list? Instead of having separate files for that, perhaps we can annotate the parameters themselves, adding a <final/> tag or somesuch to their definitions. The first 'final' value found for a parameter when processing the files would determine the value: no values in subsequent files would modify the value of that parameter. Thus, in a tasktracker's hadoop-site.xml, the dfs.client.buffer.dir would be set final, and a job would not be able to override it, while the job could override the non-final dfs.block.size set there.

          Owen, does this address your concern?

          Show
          Doug Cutting added a comment - Owen is concerned that having different files with different semantics (initial versus final) is confusing. That we should rather just have a list of files (e.g., hadoop-default.xml, hadoop-site.xml, job.xml) that are all treated identically. That has merit. It is simpler. But how do we specify that some parameters may not be overridden by files later in the list? Instead of having separate files for that, perhaps we can annotate the parameters themselves, adding a <final/> tag or somesuch to their definitions. The first 'final' value found for a parameter when processing the files would determine the value: no values in subsequent files would modify the value of that parameter. Thus, in a tasktracker's hadoop-site.xml, the dfs.client.buffer.dir would be set final, and a job would not be able to override it, while the job could override the non-final dfs.block.size set there. Owen, does this address your concern?
          Hide
          Devaraj Das added a comment -

          If we really don't want to have separate files, IMO, from the admin point of view, what Doug proposed seems simpler and a bit easier to maintain than the hadoop.client.overrides approach (especially when the list of items becomes big).

          Show
          Devaraj Das added a comment - If we really don't want to have separate files, IMO, from the admin point of view, what Doug proposed seems simpler and a bit easier to maintain than the hadoop.client.overrides approach (especially when the list of items becomes big).
          Hide
          Owen O'Malley added a comment -

          Sorry, I thought I was clearer. I was proposing a much more radical structure. Under your proposal, the source of config values in a task looks like:

          a few attributes localized in (eg. task id, etc.)
          server's hadoop-final.xml
          user's setting
          client's hadoop-final.xml
          client's hadoop-initial.xml
          client's hadoop-default.xml (readonly)
          server's hadoop-initial.xml
          server's hadoop-default.xml (readonly)
          

          and server's configuration look like:

          server's hadoop-final.xml
          server's hadoop-initial.xml
          server's hadoop-default.xml (readonly)
          

          I'm proposing a much simpler structure, with client task configs looking like:

          a few attributes localized in (eg. task id, mapred.local.dir,dfs.client.buffer.dir, etc.)
          user's settings
          client's hadoop-site.xml
          client's hadoop-default.xml (readonly)
          

          Server's configs in my view would look like:

          server's hadoop-site.xml
          server's hadoop-defaul.xml (readonly)
          

          I think by reducing the number of places where a given setting can be changed will dramatically help the usability of the config system. I think using different orderings of the config files based on the context makes things really confusing.

          Show
          Owen O'Malley added a comment - Sorry, I thought I was clearer. I was proposing a much more radical structure. Under your proposal, the source of config values in a task looks like: a few attributes localized in (eg. task id, etc.) server's hadoop- final .xml user's setting client's hadoop- final .xml client's hadoop-initial.xml client's hadoop- default .xml (readonly) server's hadoop-initial.xml server's hadoop- default .xml (readonly) and server's configuration look like: server's hadoop- final .xml server's hadoop-initial.xml server's hadoop- default .xml (readonly) I'm proposing a much simpler structure, with client task configs looking like: a few attributes localized in (eg. task id, mapred.local.dir,dfs.client.buffer.dir, etc.) user's settings client's hadoop-site.xml client's hadoop- default .xml (readonly) Server's configs in my view would look like: server's hadoop-site.xml server's hadoop-defaul.xml (readonly) I think by reducing the number of places where a given setting can be changed will dramatically help the usability of the config system. I think using different orderings of the config files based on the context makes things really confusing.
          Hide
          Doug Cutting added a comment -

          Owen, yes, I was agreeing with your simplification, as was I think Devaraj, so that, e.g., the only file that mapreduce users need edit is hadoop-site.xml.

          The only difference from your proposal is that, instead of having a single attribute to name non-overrideable attribute names, we have a way to mark any attribute as final, so that it cannot be subsequently overridden. I proposed adding a <final/> tag to the definition, so that, e.g., one might have something like:

          <parameter>
          <name>dfs.client.buffer.dir</name>
          <value>/foo/bar</value>
          <final/>
          </parameter>

          If this were in the hadoop-site on a tasktracker, then jobs would not be able to override this value.

          Show
          Doug Cutting added a comment - Owen, yes, I was agreeing with your simplification, as was I think Devaraj, so that, e.g., the only file that mapreduce users need edit is hadoop-site.xml. The only difference from your proposal is that, instead of having a single attribute to name non-overrideable attribute names, we have a way to mark any attribute as final, so that it cannot be subsequently overridden. I proposed adding a <final/> tag to the definition, so that, e.g., one might have something like: <parameter> <name>dfs.client.buffer.dir</name> <value>/foo/bar</value> <final/> </parameter> If this were in the hadoop-site on a tasktracker, then jobs would not be able to override this value.
          Hide
          Owen O'Malley added a comment -

          I'm ok with marking the attributes that need to be forced onto users from the server's hadoop-site.xml, although I think it should be "force" rather than "final".

          My personal preference in xml is also for attributes, so it would look like:

          {{
          <parameter force="yes">
          <name>....</name>
          <value>...</value>
          <parameter>
          }}

          but that is just a preference.

          Show
          Owen O'Malley added a comment - I'm ok with marking the attributes that need to be forced onto users from the server's hadoop-site.xml, although I think it should be "force" rather than "final". My personal preference in xml is also for attributes, so it would look like: {{ <parameter force="yes"> <name>....</name> <value>...</value> <parameter> }} but that is just a preference.
          Hide
          Arun C Murthy added a comment -

          Here is my final (final final) say on the topic, basically re-stating my preference:

          If I was maintaining more than one large cluster, I'd prefer to have a separate hadoop-final.xml where I'd put not only the varying paths, but also some of the hardware specific stuff (e.g. mapred.tasktracker.tasks.maximum would depend on how beefy a node is, in a given cluster).

          Why?

          Personally I'd like the convenience of being able to put a small set of hard-limits in different hadoop-final.xml files (hadoop-final-cluster

          {1,2,3}

          .xml), stick them up in subversion/cvs and being able to quickly glean insights into the various limits.

          IMHO it's harder to do such stuff with a single, large hadoop-site.xml, especially minus the ability to quickly diff (harder to diff on xml files via cmd-line diff) etc.

          Having said that, I'd like to know what real admins think of course! smile

          Show
          Arun C Murthy added a comment - Here is my final (final final) say on the topic, basically re-stating my preference: If I was maintaining more than one large cluster, I'd prefer to have a separate hadoop-final.xml where I'd put not only the varying paths, but also some of the hardware specific stuff (e.g. mapred.tasktracker.tasks.maximum would depend on how beefy a node is, in a given cluster). Why? Personally I'd like the convenience of being able to put a small set of hard-limits in different hadoop-final.xml files (hadoop-final-cluster {1,2,3} .xml), stick them up in subversion/cvs and being able to quickly glean insights into the various limits. IMHO it's harder to do such stuff with a single, large hadoop-site.xml, especially minus the ability to quickly diff (harder to diff on xml files via cmd-line diff) etc. Having said that, I'd like to know what real admins think of course! smile
          Hide
          Sameer Paranjpye added a comment -

          +1 for marking a parameter final with attribute

          I think we should give more thought to naming. The proposal so far is certainly simpler than the current system, but the idea having a single name hadoop-site.xml still bothers me. It feels a lot cleaner to have config files map to the components they affect. The 'client/server' nomenclature is ambiguous, so why not have a scheme that's not. What about having the following names for files:

          • hadoop-user.xml - this affects the behavior of user code
          • hadoop-mapred.xml - this affects the behavior of the map/reduce servers
          • hadoop-dfs.xml - this affects the behavior of the HDFS servers

          I believe that this would make configuration and administration easier. A few advantages that come to mind:

          • In order to affect the behavior or examine/debug the configuration of a particular part of the system you need only know the name of the file to look at, instead of having to figure out which hadoop-site.xml is the right one
          • It makes it much less likely that you'll edit the wrong file by mistake
          • Admins can keep all Hadoop configuration in a single directory for instance, a directory from which it gets distributed to the cluster or in a directory in a version control system

          It will need different code for constructing configuration in different places. But the code changes don't seem to be very cumbersome. It seems like a very reasonable price to pay for ease of administration.

          Show
          Sameer Paranjpye added a comment - +1 for marking a parameter final with attribute I think we should give more thought to naming. The proposal so far is certainly simpler than the current system, but the idea having a single name hadoop-site.xml still bothers me. It feels a lot cleaner to have config files map to the components they affect. The 'client/server' nomenclature is ambiguous, so why not have a scheme that's not. What about having the following names for files: hadoop-user.xml - this affects the behavior of user code hadoop-mapred.xml - this affects the behavior of the map/reduce servers hadoop-dfs.xml - this affects the behavior of the HDFS servers I believe that this would make configuration and administration easier. A few advantages that come to mind: In order to affect the behavior or examine/debug the configuration of a particular part of the system you need only know the name of the file to look at, instead of having to figure out which hadoop-site.xml is the right one It makes it much less likely that you'll edit the wrong file by mistake Admins can keep all Hadoop configuration in a single directory for instance, a directory from which it gets distributed to the cluster or in a directory in a version control system It will need different code for constructing configuration in different places. But the code changes don't seem to be very cumbersome. It seems like a very reasonable price to pay for ease of administration.
          Hide
          Sameer Paranjpye added a comment -

          Another option could be to have the default name of the 'final resource' be hadoop-site.xml, but give users the ability to override it on the command line.

          Show
          Sameer Paranjpye added a comment - Another option could be to have the default name of the 'final resource' be hadoop-site.xml , but give users the ability to override it on the command line.
          Hide
          Doug Cutting added a comment -

          > It makes it much less likely that you'll edit the wrong file by mistake

          I don't follow how replacing one file with three will decrease the probability of editting the wrong file. In the current proposal, there's only one user-editable file, hadoop-site.xml. The version installed on the tasktracker may differ from that installed on the job client, and, when those differences are specified as final parameters on the tasktracker, then the version installed on the tasktracker will win. Hadoop-default.xml is already divided into sections, but far more than three sections, so I don't think your three categories would be definitive. I'm not convinced that we should add a config file for, e.g., each java package.

          Show
          Doug Cutting added a comment - > It makes it much less likely that you'll edit the wrong file by mistake I don't follow how replacing one file with three will decrease the probability of editting the wrong file. In the current proposal, there's only one user-editable file, hadoop-site.xml. The version installed on the tasktracker may differ from that installed on the job client, and, when those differences are specified as final parameters on the tasktracker, then the version installed on the tasktracker will win. Hadoop-default.xml is already divided into sections, but far more than three sections, so I don't think your three categories would be definitive. I'm not convinced that we should add a config file for, e.g., each java package.
          Hide
          Allen Wittenauer added a comment -

          I'll openly admit that I'm still processing the contents of this thread, but that isn't going to stop me from throwing this into the ring:

          If I think about a heterogeneous cluster, the thought of having to have multiple copies of the same config file/directories with just slight differences because one machine is a dual core opteron and this other one over here is a 32-thread capable ultrasparc t1 doesn't really fill me with glee. I see significant admin headaches keeping track of which values need to get modified for what architectures.

          Also:

          I think it is fair to say that in our experiences, most of our configuration headaches actually come from hadoop-env.sh. I realize that doesn't appear to be the topic du jour, but... I'm just sayin'....

          Show
          Allen Wittenauer added a comment - I'll openly admit that I'm still processing the contents of this thread, but that isn't going to stop me from throwing this into the ring: If I think about a heterogeneous cluster, the thought of having to have multiple copies of the same config file/directories with just slight differences because one machine is a dual core opteron and this other one over here is a 32-thread capable ultrasparc t1 doesn't really fill me with glee. I see significant admin headaches keeping track of which values need to get modified for what architectures. Also: I think it is fair to say that in our experiences, most of our configuration headaches actually come from hadoop-env.sh. I realize that doesn't appear to be the topic du jour, but... I'm just sayin'....
          Hide
          Arun C Murthy added a comment -

          Is it safe to say we are collectively converging on something along these lines:

          a) A single, site-specific hadoop-site.xml to configure hadoop (in addition to hadoop-defaults.xml of course).
          b) A final attribute/tag which the admin uses to specify a hard-limit.
          c) Throw away existing mapred-defaults.xml

          Shout now or hold your silence forever... smile

          Show
          Arun C Murthy added a comment - Is it safe to say we are collectively converging on something along these lines: a) A single, site-specific hadoop-site.xml to configure hadoop (in addition to hadoop-defaults.xml of course). b) A final attribute/tag which the admin uses to specify a hard-limit. c) Throw away existing mapred-defaults.xml Shout now or hold your silence forever... smile
          Hide
          Doug Cutting added a comment -

          +1. This sounds like a good plan to me.

          Show
          Doug Cutting added a comment - +1. This sounds like a good plan to me.
          Hide
          Owen O'Malley added a comment -

          +1

          Show
          Owen O'Malley added a comment - +1
          Hide
          Arun C Murthy added a comment -

          Here is reasonable first-shot at this... I plan to fix the documentation once we get this in.

          Show
          Arun C Murthy added a comment - Here is reasonable first-shot at this... I plan to fix the documentation once we get this in.
          Show
          Hadoop QA added a comment - +1 http://issues.apache.org/jira/secure/attachment/12365024/HADOOP-785_1_20070903.patch applied and successfully tested against trunk revision r572580. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/675/testReport/ Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/675/console
          Hide
          Doug Cutting added a comment -

          You've added support for final attributes in Configuration, but only enabled them in JobConf. I'd assumed they'd be implemented entirely in Configuration, that any attribute specified as final could not be set again by any subsequently loaded file, and perhaps not even by Configuration#set(String, String).

          Wouldn't a HashMap be faster than a TreeMap to represent the set of final attributes?

          There are a few whitespace-only changes in the patch.

          Show
          Doug Cutting added a comment - You've added support for final attributes in Configuration, but only enabled them in JobConf. I'd assumed they'd be implemented entirely in Configuration, that any attribute specified as final could not be set again by any subsequently loaded file, and perhaps not even by Configuration#set(String, String). Wouldn't a HashMap be faster than a TreeMap to represent the set of final attributes? There are a few whitespace-only changes in the patch.
          Hide
          Arun C Murthy added a comment -

          You've added support for final attributes in Configuration, but only enabled them in JobConf.

          Could you please clarify Doug? In the current patch I process 'final' parameters only from hadoop-site.xml since:
          a) the admins control hadoop-site.xml on the server-side
          a) 'final' params make sense only the server-side
          b) hence, there is no point transmitting 'final' attributes from the client to the server (i.e. in job.xml)

          Thanks!

          Show
          Arun C Murthy added a comment - You've added support for final attributes in Configuration, but only enabled them in JobConf. Could you please clarify Doug? In the current patch I process 'final' parameters only from hadoop-site.xml since: a) the admins control hadoop-site.xml on the server-side a) 'final' params make sense only the server-side b) hence, there is no point transmitting 'final' attributes from the client to the server (i.e. in job.xml) Thanks!
          Hide
          Doug Cutting added a comment -

          I think we should implement a simple, consistent policy:

          • a configuration processes a single list of configuration files;
          • any file can contain parameters labeled 'final';
          • final parameters may not be altered by subsequent files.
          • serializations of a Configuration, like job.xml, will not contain any final declarations.

          There need be no special cases for hadoop-default.xml or hadoop-site.xml. They're just the first and second files in the list. Thus if someone specifies a final parameter in hadoop-default.xml it is effectively a constant. If someone specifies a final value in a client-side hadoop-site.xml, then that value may still be overridden in a task process, where the local hadoop-site.xml file will be loaded before the final-free job.xml.

          We should deprecate the Configuration methods addFinalResource and addDefaultResource, and replace them with a public addResource method. For back-compatibility, we must still keep track of the list of resources added, and of the position of last default resource in that list, and addDefaultResource must insert itself in the list after the last default resource and then trigger reloading of all resources. But this back-compatibility code should be clearly marked for removal when these deprecated methods are removed. In the long-term we should no longer need to re-load resources or even track the list of resources added. The addResource method will simply load the file into the configuration.

          Does that sound reasonable?

          Show
          Doug Cutting added a comment - I think we should implement a simple, consistent policy: a configuration processes a single list of configuration files; any file can contain parameters labeled 'final'; final parameters may not be altered by subsequent files. serializations of a Configuration, like job.xml, will not contain any final declarations. There need be no special cases for hadoop-default.xml or hadoop-site.xml. They're just the first and second files in the list. Thus if someone specifies a final parameter in hadoop-default.xml it is effectively a constant. If someone specifies a final value in a client-side hadoop-site.xml, then that value may still be overridden in a task process, where the local hadoop-site.xml file will be loaded before the final-free job.xml. We should deprecate the Configuration methods addFinalResource and addDefaultResource, and replace them with a public addResource method. For back-compatibility, we must still keep track of the list of resources added, and of the position of last default resource in that list, and addDefaultResource must insert itself in the list after the last default resource and then trigger reloading of all resources. But this back-compatibility code should be clearly marked for removal when these deprecated methods are removed. In the long-term we should no longer need to re-load resources or even track the list of resources added. The addResource method will simply load the file into the configuration. Does that sound reasonable?
          Hide
          Owen O'Malley added a comment -

          The semantics are pretty close and I can live with either. I still think the semantics of not loading the server's hadoop-site.xml into the JobConf is cleaner, but Doug convinced me that it:
          1. is pretty close (it only differs in the values that the JobConf didn't define)
          2. may work better if the hadoop-default is different in client and server.
          So I'll give Doug's proposal a +1. The only thing that you can't do is put secret keys into the server's hadoop-site.xml, because they'll be passed along to the Tasks and that would be bad. Frown

          Show
          Owen O'Malley added a comment - The semantics are pretty close and I can live with either. I still think the semantics of not loading the server's hadoop-site.xml into the JobConf is cleaner, but Doug convinced me that it: 1. is pretty close (it only differs in the values that the JobConf didn't define) 2. may work better if the hadoop-default is different in client and server. So I'll give Doug's proposal a +1. The only thing that you can't do is put secret keys into the server's hadoop-site.xml, because they'll be passed along to the Tasks and that would be bad. Frown
          Hide
          Arun C Murthy added a comment -

          Ok, looks like we have a reasonable consensus... +1.

          W.r.t add

          {Default|Final}

          Resources it's a little tricky... clearly we'll deprecate them for now and here is the interim plan:
          a) Keep the 2 separate default/final lists
          b) Load default resources first (front-to-back as is)
          c) Load final resources next (front-to-back, not back-to-front as today)
          d) Remove all this code in the next release and have a single list of resouces thus doing away with the notion of default/final resources. I'll file a follow-on jira for this task.

          Clearly, process parameters marked as 'force' from all configs, but once a param is marked as 'force' then subsequent ones from other configs will not be parsed. Thus a 'force' config-param fron hadoop-default.xml essentially is a 'constant'.

          Show
          Arun C Murthy added a comment - Ok, looks like we have a reasonable consensus... +1. W.r.t add {Default|Final} Resources it's a little tricky... clearly we'll deprecate them for now and here is the interim plan: a) Keep the 2 separate default/final lists b) Load default resources first (front-to-back as is) c) Load final resources next (front-to-back, not back-to-front as today) d) Remove all this code in the next release and have a single list of resouces thus doing away with the notion of default/final resources. I'll file a follow-on jira for this task. Clearly, process parameters marked as 'force' from all configs, but once a param is marked as 'force' then subsequent ones from other configs will not be parsed. Thus a 'force' config-param fron hadoop-default.xml essentially is a 'constant'.
          Hide
          Arun C Murthy added a comment -

          Here's another go at this... addresses Doug's concerns.

          I've also filed HADOOP-1843 to remove methods deprecated by this patch.

          Show
          Arun C Murthy added a comment - Here's another go at this... addresses Doug's concerns. I've also filed HADOOP-1843 to remove methods deprecated by this patch.
          Hide
          Doug Cutting added a comment -

          > The only thing that you can't do is put secret keys into the server's hadoop-site.xml [...]

          Where do we expect to keep secret keys? In the Configuration? If so, this could be a serious problem. For jobs, the JobConf is the obvious place to put keys. I don't know that the tasktracker will need its own keys but let's assume it does. JobConf's are written by JobClient to mapred.system.dir. We could make that directory world-writable but readable by only the jobtracker, so users could securely put their keys in a JobConf. The users' keys would normally overwrite the servers keys when the configurations are merged, but that's not 100% reliable, if, e.g., the client somehow manages to include no keys in a JobConf then it would see the servers keys, which would be bad.

          Show
          Doug Cutting added a comment - > The only thing that you can't do is put secret keys into the server's hadoop-site.xml [...] Where do we expect to keep secret keys? In the Configuration? If so, this could be a serious problem. For jobs, the JobConf is the obvious place to put keys. I don't know that the tasktracker will need its own keys but let's assume it does. JobConf's are written by JobClient to mapred.system.dir. We could make that directory world-writable but readable by only the jobtracker, so users could securely put their keys in a JobConf. The users' keys would normally overwrite the servers keys when the configurations are merged, but that's not 100% reliable, if, e.g., the client somehow manages to include no keys in a JobConf then it would see the servers keys, which would be bad.
          Hide
          Arun C Murthy added a comment - - edited

          I'm not super-sure about use-cases for secret keys (nutch?), but one way to get around it would be to have a separate directory with config file(s) which are read by the server's config via Configuration.addResouce(Path).

          Would that be acceptable?

          Show
          Arun C Murthy added a comment - - edited I'm not super-sure about use-cases for secret keys (nutch?), but one way to get around it would be to have a separate directory with config file(s) which are read by the server's config via Configuration.addResouce(Path) . Would that be acceptable?
          Hide
          Doug Cutting added a comment -

          > I'm not super-sure about use-cases for secret keys

          Once we have user permissions (HADOOP-1298) we will need to handle secret keys.

          I don't think the tasktracker should need to access HDFS except on behalf of the job (reading the job.xml and job.jar), but I'm not yet quite ready to embrace that principle in the architecture.

          This also raises another issue: if we keep the jobtracker's system dir world-unreadable, since it contains job's keys, it would be nice if a job could read its own data there. That would argue for per-file permissions rather than per-directory. Hmm. Time to comment on HADOOP-1298...

          Show
          Doug Cutting added a comment - > I'm not super-sure about use-cases for secret keys Once we have user permissions ( HADOOP-1298 ) we will need to handle secret keys. I don't think the tasktracker should need to access HDFS except on behalf of the job (reading the job.xml and job.jar), but I'm not yet quite ready to embrace that principle in the architecture. This also raises another issue: if we keep the jobtracker's system dir world-unreadable, since it contains job's keys, it would be nice if a job could read its own data there. That would argue for per-file permissions rather than per-directory. Hmm. Time to comment on HADOOP-1298 ...
          Hide
          Doug Cutting added a comment -

          Another approach would be to not keep secret keys in the configuration or ever in the filesystem. Rather, the the client would pass it's keys to the jobtracker as a parameter, and the jobtracker would pass them to the tasktracker, and the tasktracker to the task, and the task would somehow set them for the process.

          Show
          Doug Cutting added a comment - Another approach would be to not keep secret keys in the configuration or ever in the filesystem. Rather, the the client would pass it's keys to the jobtracker as a parameter, and the jobtracker would pass them to the tasktracker, and the tasktracker to the task, and the task would somehow set them for the process.
          Hide
          Doug Cutting added a comment -

          > That would argue for per-file permissions rather than per-directory. Hmm. Time to comment on HADOOP-1298...

          Nevermind. We can simply store all job-specific data in a sub-directory of the system dir that's owned by the job's user. The client writes it there, and the task reads it. We just need to make the system directory world-writable so that clients can create jobs there, and the clients can set their job directories world-unreadable. We can use group permissions to permit the jobtracker to read the job.xml. So per-directory permissions are sufficient.

          Show
          Doug Cutting added a comment - > That would argue for per-file permissions rather than per-directory. Hmm. Time to comment on HADOOP-1298 ... Nevermind. We can simply store all job-specific data in a sub-directory of the system dir that's owned by the job's user. The client writes it there, and the task reads it. We just need to make the system directory world-writable so that clients can create jobs there, and the clients can set their job directories world-unreadable. We can use group permissions to permit the jobtracker to read the job.xml. So per-directory permissions are sufficient.
          Hide
          Owen O'Malley added a comment -

          I withdraw my comment about the secret keys. While it is true, there are ways around it including reading side files and manually putting them into the server's in memory configuration.

          Show
          Owen O'Malley added a comment - I withdraw my comment about the secret keys. While it is true, there are ways around it including reading side files and manually putting them into the server's in memory configuration.
          Hide
          Doug Cutting added a comment -

          I prefer calling these parameters "final" rather than "force". The "final" keyword here has the similar meaning as it does in Java: cannot be overridden.

          Show
          Doug Cutting added a comment - I prefer calling these parameters "final" rather than "force". The "final" keyword here has the similar meaning as it does in Java: cannot be overridden.
          Hide
          Arun C Murthy added a comment -

          Here is another take, changed nomenclature to final from 'force' ...

          Show
          Arun C Murthy added a comment - Here is another take, changed nomenclature to final from 'force' ...
          Hide
          Arun C Murthy added a comment -

          Note: since this patch removes mapred-default.xml completely, hence we'll need:

          $ svn remove conf/mapred-default.xml.template src/contrib/test/mapred-default.xml src/test/mapred-default.xml

          Show
          Arun C Murthy added a comment - Note: since this patch removes mapred-default.xml completely, hence we'll need: $ svn remove conf/mapred-default.xml.template src/contrib/test/mapred-default.xml src/test/mapred-default.xml
          Hide
          Hadoop QA added a comment -

          -1, build or testing failed

          2 attempts failed to build and test the latest attachment http://issues.apache.org/jira/secure/attachment/12365395/HADOOP-785_3_20070908.patch against trunk revision r573777.

          Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/719/testReport/
          Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/719/console

          Please note that this message is automatically generated and may represent a problem with the automation system and not the patch.

          Show
          Hadoop QA added a comment - -1, build or testing failed 2 attempts failed to build and test the latest attachment http://issues.apache.org/jira/secure/attachment/12365395/HADOOP-785_3_20070908.patch against trunk revision r573777. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/719/testReport/ Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/719/console Please note that this message is automatically generated and may represent a problem with the automation system and not the patch.
          Hide
          Arun C Murthy added a comment -

          Resubmitting to hudson... previous error was due to an unrelated SocketTimeoutException in org.apache.hadoop.dfs.TestDecommission.

          Show
          Arun C Murthy added a comment - Resubmitting to hudson... previous error was due to an unrelated SocketTimeoutException in org.apache.hadoop.dfs.TestDecommission .
          Show
          Hadoop QA added a comment - +1 http://issues.apache.org/jira/secure/attachment/12365395/HADOOP-785_3_20070908.patch applied and successfully tested against trunk revision r573777. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/720/testReport/ Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/720/console
          Hide
          Doug Cutting added a comment -

          This looks good. But we need to update the documentation, add a unit test, and warn folks when they attempt to override a final parameter.

          Show
          Doug Cutting added a comment - This looks good. But we need to update the documentation, add a unit test, and warn folks when they attempt to override a final parameter.
          Hide
          Doug Cutting added a comment -

          Here's a new version of the patch that updates the javadoc (providing an example), adds a unit test, and logs whenever someone attempts to override a final parameter.

          Show
          Doug Cutting added a comment - Here's a new version of the patch that updates the javadoc (providing an example), adds a unit test, and logs whenever someone attempts to override a final parameter.
          Show
          Hadoop QA added a comment - +1 http://issues.apache.org/jira/secure/attachment/12365505/HADOOP-785_4.patch applied and successfully tested against trunk revision r574404. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/730/testReport/ Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/730/console
          Hide
          Arun C Murthy added a comment -

          Thanks for the new patch Doug! I definitely planned on on fixing documentation etc. (e.g. http://wiki.apache.org/lucene-hadoop/HowToConfigure) but appreciate your help! smile

          Show
          Arun C Murthy added a comment - Thanks for the new patch Doug! I definitely planned on on fixing documentation etc. (e.g. http://wiki.apache.org/lucene-hadoop/HowToConfigure ) but appreciate your help! smile
          Hide
          Doug Cutting added a comment -

          I just committed this. Thanks, Arun!

          Show
          Doug Cutting added a comment - I just committed this. Thanks, Arun!
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-Nightly #231 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/231/ )
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-Nightly #311 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/311/ )

            People

            • Assignee:
              Arun C Murthy
              Reporter:
              Owen O'Malley
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development