Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-16064

Load configuration values from external sources



    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None


      This is a proposal to improve the Configuration.java to load configuration from external sources (kubernetes config map, external http reqeust, any cluster manager like ambari, etc.)

      I will attach a patch to illustrate the proposed solution, but please comment the concept first, the patch is just poc and not fully implemented.


      • Load the configuration files (core-site.xml/hdfs-site.xml/...) from external locations instead of the classpath (classpath remains the default)
      • Make the configuration loading extensible
      • Make it in an backward-compatible way with minimal change in the existing Configuration.java


       1.) load configuration from the namenode (http://namenode:9878/conf). With this approach only the namenode should be configured, other components require only the url of the namenode

       2.) Read configuration directly from kubernetes config-map (or mesos)

       3.) Read configuration from any external cluster management (such as Apache Ambari or any equivalent)

       4.) as of now in the hadoop docker images we transform environment variables (such as HDFS-SITE.XML_fs.defaultFs) to configuration xml files with the help of a python script. With the proposed implementation it would be possible to read the configuration directly from the system environment variables.


      The existing Configuration.java can read configuration from multiple sources. But most of the time it's used to load predefined config names ("core-site.xml" and "hdfs-site.xml") without configuration location. In this case the files will be loaded from the classpath.

      I propose to add additional option to define the default location of core-site.xml and hdfs-site.xml (any configuration which is defined by string name) to use external sources in the classpath.

      The configuration loading requires implementation + configuration (where are the external configs). We can't use regular configuration to configure the config loader (chicken/egg).

      I propose to use a new environment variable HADOOP_CONF_SOURCE

      The environment variable could contain a URL, where the schema of the url can define the config source and all the other parts can configure the access to the resource.





      The ConfigurationSource interface can be as easy as:

       * Interface to load hadoop configuration from custom location.
      public interface ConfigurationSource {
         * Method will be called one with the defined configuration url.
         * @param uri
        void initialize(URI uri) throws IOException;
         * Method will be called to load a specific configuration resource.
         * @param name of the configuration resource (eg. hdfs-site.xml)
         * @return List of loaded configuraiton key and values.
        List<ParsedItem> readConfiguration(String name);

      We can choose the right implementation based the schema of the uri and with Java Service Provider Interface mechanism (META-INF/services/org.apache.hadoop.conf.ConfigurationSource)

      It could be with minimal modification in the Configuration.java (see the attached patch as an example)

       The patch contains two example implementation:


      This can load configuration from environment variables based on a naming convention (eg. HDFS-SITE.XML_hdfs.dfs.key=value)


       This implementation can load the configuration from a /conf servlet of any Hadoop components.



        1. HADOOP-16064.001.patch
          31 kB
          Marton Elek



            elek Marton Elek
            elek Marton Elek
            1 Vote for this issue
            12 Start watching this issue

