Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Configuration parameters should be fully validated before name nodes or data nodes begin service.

      Required parameters must be present.
      Required and optional parameters must have values of proper type and range.
      Undefined parameters must not be present.

      (I was recently observing some confusion whose root cause was a mis-spelled parameter.)

        Issue Links

          Activity

          Hide
          Doug Cutting added a comment -

          > I would prefer creating new classes solely dedicated to configuration logic [ ... ]

          I think this varies, case-by-case. For a complex subsystem like HDFS, it may make sense to have dedicated configuration classes. For a standalone classes, like an InputFormat or compression codec, it probably makes sense to put configuration accessors directly on the class in question.

          Show
          Doug Cutting added a comment - > I would prefer creating new classes solely dedicated to configuration logic [ ... ] I think this varies, case-by-case. For a complex subsystem like HDFS, it may make sense to have dedicated configuration classes. For a standalone classes, like an InputFormat or compression codec, it probably makes sense to put configuration accessors directly on the class in question.
          Hide
          Konstantin Shvachko added a comment - - edited

          I agree, visibility of accessors should depend on whether they are used publicly or internally.
          Say getters for parameters used only in FSNamesystem should be package private.
          Some getters may be public, but the corresponding setters may not.

          > find the most-specific public class that encompasses the use and add the accessor there.

          I would prefer creating new classes solely dedicated to configuration logic rather then including implementation of accessors in existing public classes. Imo this makes a better structured code.

          Show
          Konstantin Shvachko added a comment - - edited I agree, visibility of accessors should depend on whether they are used publicly or internally. Say getters for parameters used only in FSNamesystem should be package private. Some getters may be public, but the corresponding setters may not. > find the most-specific public class that encompasses the use and add the accessor there. I would prefer creating new classes solely dedicated to configuration logic rather then including implementation of accessors in existing public classes. Imo this makes a better structured code.
          Hide
          Doug Cutting added a comment -

          > This means that the configuration classes should be public then, right?

          Yes, if the parameters they access should be publicly accessible. One might argue that certain parameters are only consumed internally and don't need public accessors, but, more typically, parameter accessors are on public classes.

          > And it doesn't matter where the get/setters are.
          > Particularly we can combine all of them in one class
          > or even place them in the Configuration class. Is it what you want?

          They shouldn't be all in one place or all in Configuration for the same reason that we don't put everything in a single file: we should attempt to keep related things together, to localize changes. So an HDFS-specific parameter accessor should be on an HDFS-specific class. How fine-grained we localize isn't clear. Generally, finer is better: find the most-specific public class that encompasses the use and add the accessor there. So if something's only used in the Datanode, but used in a few different classes there, then it might best be on Datanode.

          > What I meant is that we keep placing logically independent
          > code inside e.g. FSNamesystem, which makes it bigger, while it could easily be made a separate class.
          > And configuration is just an example of such logically independent part.

          If configuration stuff is not specific to FSNamesystem (i.e., logically independent) then it shouldn't go there. If it is specific to FSNamesystem then it could go there, or perhaps on a new class that's used only by FSNamesystem, e.g., FSNamesystemParams. If it's used equally by FSNamesystem and other classes then it could either go on an existing shared class (e.g., Namenode) or a new shared class (NamenodeParams).

          Show
          Doug Cutting added a comment - > This means that the configuration classes should be public then, right? Yes, if the parameters they access should be publicly accessible. One might argue that certain parameters are only consumed internally and don't need public accessors, but, more typically, parameter accessors are on public classes. > And it doesn't matter where the get/setters are. > Particularly we can combine all of them in one class > or even place them in the Configuration class. Is it what you want? They shouldn't be all in one place or all in Configuration for the same reason that we don't put everything in a single file: we should attempt to keep related things together, to localize changes. So an HDFS-specific parameter accessor should be on an HDFS-specific class. How fine-grained we localize isn't clear. Generally, finer is better: find the most-specific public class that encompasses the use and add the accessor there. So if something's only used in the Datanode, but used in a few different classes there, then it might best be on Datanode. > What I meant is that we keep placing logically independent > code inside e.g. FSNamesystem, which makes it bigger, while it could easily be made a separate class. > And configuration is just an example of such logically independent part. If configuration stuff is not specific to FSNamesystem (i.e., logically independent) then it shouldn't go there. If it is specific to FSNamesystem then it could go there, or perhaps on a new class that's used only by FSNamesystem, e.g., FSNamesystemParams. If it's used equally by FSNamesystem and other classes then it could either go on an existing shared class (e.g., Namenode) or a new shared class (NamenodeParams).
          Hide
          Konstantin Shvachko added a comment -

          JobConf does not set hdfs parameters, but I agree users might want to set/access inter-component parameters.
          My idea was to use carefully designed hierarchy of configuration classes and interfaces.
          Like TaskTrackerConfiguration can be inherited from DFSClientConfiguration.
          But I see the static approach you propose is simpler.
          This means that the configuration classes should be public then, right?
          And it doesn't matter where the get/setters are. Particularly we can combine all of them in one class
          or even place them in the Configuration class. Is it what you want?

          > I don't find the argument that FSNamesystem is already too big compelling.

          Yes the size is not important. What I meant is that we keep placing logically independent
          code inside e.g. FSNamesystem, which makes it bigger, while it could easily be made a separate class.
          And configuration is just an example of such logically independent part.
          If you write a converter for a parameter or add a verification constraint - these changes belong to
          the configuration only, namely to the implementation of getters, why modify FSNamesystem if it only calls them.

          Show
          Konstantin Shvachko added a comment - JobConf does not set hdfs parameters, but I agree users might want to set/access inter-component parameters. My idea was to use carefully designed hierarchy of configuration classes and interfaces. Like TaskTrackerConfiguration can be inherited from DFSClientConfiguration. But I see the static approach you propose is simpler. This means that the configuration classes should be public then, right? And it doesn't matter where the get/setters are. Particularly we can combine all of them in one class or even place them in the Configuration class. Is it what you want? > I don't find the argument that FSNamesystem is already too big compelling. Yes the size is not important. What I meant is that we keep placing logically independent code inside e.g. FSNamesystem, which makes it bigger, while it could easily be made a separate class. And configuration is just an example of such logically independent part. If you write a converter for a parameter or add a verification constraint - these changes belong to the configuration only, namely to the implementation of getters, why modify FSNamesystem if it only calls them.
          Hide
          Doug Cutting added a comment -

          > Why setters need to be static?

          Users need to, e.g., be able to set HDFS parameters on a JobConf. We can get away with a single subclass of Configuration that has setters, but once we add a second, it would be impossible to create a single configuration instance that can configure multiple components.

          > Why per-package, not per-component?

          That's fine too. You seemed to be complaining that classes were too specific for this case, so I said I was okay with per-package if you thought that more appropirate here, although perhaps that's too general for your taste in this case, and you'd rather separate, e.g., Namenode from Datanode parameters. That's fine with me too. However I don't find the argument that FSNamesystem is already too big compelling. That's a separate issue: it should perhaps be decomposed into multiple classes, and when that's done, configuration accessors might move around, but if there are FSNamesystem-specific configuration accessors then I'd argue they belong in FSNamesystem, regardless of that class's current size.

          Show
          Doug Cutting added a comment - > Why setters need to be static? Users need to, e.g., be able to set HDFS parameters on a JobConf. We can get away with a single subclass of Configuration that has setters, but once we add a second, it would be impossible to create a single configuration instance that can configure multiple components. > Why per-package, not per-component? That's fine too. You seemed to be complaining that classes were too specific for this case, so I said I was okay with per-package if you thought that more appropirate here, although perhaps that's too general for your taste in this case, and you'd rather separate, e.g., Namenode from Datanode parameters. That's fine with me too. However I don't find the argument that FSNamesystem is already too big compelling. That's a separate issue: it should perhaps be decomposed into multiple classes, and when that's done, configuration accessors might move around, but if there are FSNamesystem-specific configuration accessors then I'd argue they belong in FSNamesystem, regardless of that class's current size.
          Hide
          Konstantin Shvachko added a comment -

          Why setters need to be static? What is wrong with this

          class NameNodeConfig extends Configuration {
          	void setHttpBindAddres(String host, int port) {
          		set("dfs.http.bindAddress", host + ":" + port);
          	}
          }
          

          Why per-package, not per-component?

          Serialization is common for all of them: it's the one defined in Configuration.
          And the configuration files (hadooop-default.xml, hadoop-site.xml) are all the same.
          Just the classes are different because their accessors are different.

          Show
          Konstantin Shvachko added a comment - Why setters need to be static? What is wrong with this class NameNodeConfig extends Configuration { void setHttpBindAddres( String host, int port) { set( "dfs.http.bindAddress" , host + ":" + port); } } Why per-package, not per-component? Serialization is common for all of them: it's the one defined in Configuration. And the configuration files (hadooop-default.xml, hadoop-site.xml) are all the same. Just the classes are different because their accessors are different.
          Hide
          Doug Cutting added a comment -

          > The Configuration itself should remain the same for each component.
          > It just exposes get methods specific to the component.

          Yes, that would work for getters, but not for setters. In many cases we need setters too, and it would be confusing to implement getters and setters using different styles. Setters are best implemented as static methods, thus, for symmetry, getters must be also.

          > I do not support the idea of placing static getters for configuration parameters in the (top-level) component

          I'm okay having per-package config classes (e.g.m DFSConfig) that centralizes configuration setters and getters for that package, since, in some cases, the classes which consume these (e.g., FSNamesystem) are not public classes.

          Show
          Doug Cutting added a comment - > The Configuration itself should remain the same for each component. > It just exposes get methods specific to the component. Yes, that would work for getters, but not for setters. In many cases we need setters too, and it would be confusing to implement getters and setters using different styles. Setters are best implemented as static methods, thus, for symmetry, getters must be also. > I do not support the idea of placing static getters for configuration parameters in the (top-level) component I'm okay having per-package config classes (e.g.m DFSConfig) that centralizes configuration setters and getters for that package, since, in some cases, the classes which consume these (e.g., FSNamesystem) are not public classes.
          Hide
          Konstantin Shvachko added a comment -

          We already have JobConf derived from Configuration.
          I am just proposing to extend this approach to other components.
          The Configuration itself should remain the same for each component.
          That is it reads the same config files with the same override rules, and with the same properties inside.
          It just exposes get methods specific to the component.
          Conversion from one class to another can be easily provided by constructors like

           FsConfiguration(Configuration conf) 

          And then you can pass RPCCongifuration as a parameter into it since the latter
          is a subclass of Configuration and since it contains all properties related to rpc.
          So configuration class for each component is just a wrapper exposing methods and
          providing validation and potentially conversion of the parameters related to the component.
          I do not support the idea of placing static getters for configuration parameters in the (top-level) component
          classes because I would prefer to separate all configuration issues from the component code.
          FSNamesystem is again almost 4K lines long, JobTracker and TaskTracker over 2K lines.

          Show
          Konstantin Shvachko added a comment - We already have JobConf derived from Configuration. I am just proposing to extend this approach to other components. The Configuration itself should remain the same for each component. That is it reads the same config files with the same override rules, and with the same properties inside. It just exposes get methods specific to the component. Conversion from one class to another can be easily provided by constructors like FsConfiguration(Configuration conf) And then you can pass RPCCongifuration as a parameter into it since the latter is a subclass of Configuration and since it contains all properties related to rpc. So configuration class for each component is just a wrapper exposing methods and providing validation and potentially conversion of the parameters related to the component. I do not support the idea of placing static getters for configuration parameters in the (top-level) component classes because I would prefer to separate all configuration issues from the component code. FSNamesystem is again almost 4K lines long, JobTracker and TaskTracker over 2K lines.
          Hide
          Doug Cutting added a comment -

          > I'd prefer if we had a separate configuration class for each component, say NamenodeConfiguration, DatanodeConfiguration, ...

          This can be a problem when a configuration is passed between components. For example, if we had an FsConfiguration, RpcConfiguration, etc., then calls from, e.g., mapred code to these would require the creation of a new configuration. So this only works well for top-level components, when all calls from a component are to components that are configured by Configuration. But it can be hard to know when something is top-level, e.g., we'd like to be able to run datanode within other daemons.

          > That reminds me about one abandoned issue HADOOP-24 - an attempt to make a configuration interface.

          Hmm. It looks like I dropped the ball on that one! It's not back-compatible, since any code that calls 'new Configuration()' would break, and it's missing some javadoc, etc. But it might still be a nice change, although I don't see how it would help us here.

          > We should also use xml schema for the verification purposes.

          I don't see how that would do sufficient type checking any more easily. Do you?

          Show
          Doug Cutting added a comment - > I'd prefer if we had a separate configuration class for each component, say NamenodeConfiguration, DatanodeConfiguration, ... This can be a problem when a configuration is passed between components. For example, if we had an FsConfiguration, RpcConfiguration, etc., then calls from, e.g., mapred code to these would require the creation of a new configuration. So this only works well for top-level components, when all calls from a component are to components that are configured by Configuration. But it can be hard to know when something is top-level, e.g., we'd like to be able to run datanode within other daemons. > That reminds me about one abandoned issue HADOOP-24 - an attempt to make a configuration interface. Hmm. It looks like I dropped the ball on that one! It's not back-compatible, since any code that calls 'new Configuration()' would break, and it's missing some javadoc, etc. But it might still be a nice change, although I don't see how it would help us here. > We should also use xml schema for the verification purposes. I don't see how that would do sufficient type checking any more easily. Do you?
          Hide
          Konstantin Shvachko added a comment -

          I think this should be related to all hadoop components including hdfs client, secondary-node, job-tracker and task-tracker.
          They all should independently verify parameters they depend upon.

          I'd prefer if we had a separate configuration class for each component, say NamenodeConfiguration, DatanodeConfiguration, ...
          derived from the base Configuration class, with explicit getters and setters for each parameter used in the component.
          That way it will be easy to provide verification and backward compatibility.

          That reminds me about one abandoned issue HADOOP-24 - an attempt to make a configuration interface.
          What was the reason it had never been committed?

          We should also use xml schema for the verification purposes. That would make verification automatic, and there will be no
          need to write verification code.

          Show
          Konstantin Shvachko added a comment - I think this should be related to all hadoop components including hdfs client, secondary-node, job-tracker and task-tracker. They all should independently verify parameters they depend upon. I'd prefer if we had a separate configuration class for each component, say NamenodeConfiguration, DatanodeConfiguration, ... derived from the base Configuration class, with explicit getters and setters for each parameter used in the component. That way it will be easy to provide verification and backward compatibility. That reminds me about one abandoned issue HADOOP-24 - an attempt to make a configuration interface. What was the reason it had never been committed? We should also use xml schema for the verification purposes. That would make verification automatic, and there will be no need to write verification code.
          Hide
          Doug Cutting added a comment -

          > Configuration parameters should be fully validated before name nodes or data nodes begin service.

          Is the idea that namenode or datanode startup should fail fast when misconfigured? Or that we be able to check parameters without starting daemons?

          If the former, we can just be more careful to validate all critical parameters when they're first used. If the latter, we could add an option to the main() that simply validates critical parameters and exits. In either case, configuration validation should not be centralized in Hadoop, but rather implemented in each component that is configured.

          Show
          Doug Cutting added a comment - > Configuration parameters should be fully validated before name nodes or data nodes begin service. Is the idea that namenode or datanode startup should fail fast when misconfigured? Or that we be able to check parameters without starting daemons? If the former, we can just be more careful to validate all critical parameters when they're first used. If the latter, we could add an option to the main() that simply validates critical parameters and exits. In either case, configuration validation should not be centralized in Hadoop, but rather implemented in each component that is configured.
          Hide
          Owen O'Malley added a comment -

          I can see the motivation/frustration for this, but it doesn't work very well with the Hadoop configuration model, which is very open. In particular, it would have to be a very inclusive and configurable list, since the same configuration file is used for map/reduce and hdfs.

          I would think that the best tradeoff would be to have a configuration file that the namenode and datanodes use to say that all attributes under dfs.* can only be on this list.

          Show
          Owen O'Malley added a comment - I can see the motivation/frustration for this, but it doesn't work very well with the Hadoop configuration model, which is very open. In particular, it would have to be a very inclusive and configurable list, since the same configuration file is used for map/reduce and hdfs. I would think that the best tradeoff would be to have a configuration file that the namenode and datanodes use to say that all attributes under dfs.* can only be on this list.

            People

            • Assignee:
              Unassigned
              Reporter:
              Robert Chansler
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:

                Development