Issue Details (XML | Word | Printable)

Key: HADOOP-3479
Type: New Feature New Feature
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Hemanth Yamijala
Reporter: Hemanth Yamijala
Votes: 0
Watchers: 5
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Implement configuration items useful for Hadoop resource manager (v1)

Created: 02/Jun/08 12:56 PM   Updated: 20/Nov/08 11:38 PM
Return to search
Component/s: conf
Affects Version/s: None
Fix Version/s: 0.19.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works 3479.1.patch 2008-06-19 05:47 PM Hemanth Yamijala 30 kB
Text File Licensed for inclusion in ASF works 3479.2.patch 2008-06-30 04:11 PM Hemanth Yamijala 41 kB
Text File Licensed for inclusion in ASF works 3479.3.patch 2008-07-01 05:21 AM Hemanth Yamijala 41 kB
Text File Licensed for inclusion in ASF works 3479.4.patch 2008-07-02 03:55 PM Hemanth Yamijala 42 kB
Text File Licensed for inclusion in ASF works 3479.5.patch 2008-07-03 10:56 AM Hemanth Yamijala 42 kB
Text File Licensed for inclusion in ASF works 3479.patch 2008-06-13 05:52 AM Hemanth Yamijala 14 kB
Issue Links:
Blocker
 
Incorporates
 

Hadoop Flags: Reviewed
Resolution Date: 07/Jul/08 10:43 AM


 Description  « Hide
HADOOP-3421 lists requirements for a new resource manager for Hadoop. Implementation for these will require support for new configuration items in Hadoop. This JIRA is to define such configuration, and track it's implementation.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Hemanth Yamijala added a comment - 02/Jun/08 01:20 PM
The current list of configuration items, gleaned from HADOOP-3421, are the following:
  • Organization (org), identified by a string
    • Guarenteed capacity of resources per org.
    • Time limit within which resources redistributed from this org will be reclaimed by it.
  • Queue, identified by a string - org-name.queue-name should be unique
    • Organization to which queue belongs
    • Ordered list of users who can or cannot submit jobs to the queue
    • Default priority that will be assigned to all jobs submitted to this queue. This is optional, and if it is not defined, it means the queue does not support priorities.
    • Resource limit range per user.

Additional requirements that are related to configuration:

  • All of these configuration items can be modified only by the admin.
  • It should be possible to reflect modifications to these values without a JT restart.
  • There must be good default values for these items or they should be optional so that users who do not want to setup organizations, multiple queues etc will, ideally, not have to configure anything, or at worst, should configure very little.
  • Further the default values should also help in setting up new organizations, queues, etc. For instance, we can define a global default priority value that could apply to any newly created queue. And if required we should be able to define a different value for this queue, which would override the global value.

Hemanth Yamijala added a comment - 04/Jun/08 05:47 AM
Some initial thoughts, combining ideas Vivek expressed on a mail to core-dev.

We have the following options to consider:

  • Format for the new configuration options
  • Where to define them in. Could be related to the format.

Looking at format first, I can think of 2 options. Given that there is a set of properties (for e.g. user list, default priority, resource limit, etc) that are related, we could have a nested XML format - something like:

<Grid>
  <Organization>
    <Name>org1</Name>
    <MaxCapacity>100</MaxCapacity>
    <Queues>
      <Queue>
        <Name>queue1</Name>
        <AllowedUsers>u1,u2,u3</AllowedUsers>
        <DisallowedUsers>u3,u4,u5</DisallowedUsers>
        <AllowedOverrides>False</AllowedOverrides>
      </Queue>
    </Queues>
</Grid>

IMO, this is an intuitive model to capture the configuration. We can use a DOM parser like what we use currently in Configuration.java to construct Scheduler config objects.

The main drawbacks of this approach are:

  • A new format to administer. Could be a pain for administrators.
  • Parity expectations with other features provided by standard Hadoop configuration.

The other format tries to retain the same structure as current Hadoop configuration, which is truly like a list of key and value pairs. Here's an example:

<property>
    <name>hadoop.scheduler.orgs</name>
    <value>Org1,Org2</value>
    <description>Comma separated list of Org names</description>
</property>
<property>
    <name>hadoop.scheduler.Org1.max-capacity</name>
    <value>100</value>
</property>
<property>
    <name>hadoop.scheduler.Org1.queues</name>
    <value>q1,q2</value>
</property>
<property>
    <name>hadoop.scheduler.Org1.q1.allowedusers</name>
    <value>u1,u2,u3</value>
</property>

As shown above, the keys for the properties are used to indicate the grouping. All the properties for an Org would be under hadoop.scheduler.org-name, likewise for Queues. This implies that we need property names to be dynamically built by code and that we may need ways of listing all children of a given property - something that can possibly be solved using HADOOP-3407.

This format is much less intuitive, and maybe error prone to administer ? Also, we may have unnecessary restrictions or special handling on names. For e.g. a queue name cannot contain a '.'

However, it allows us to have the same format as in Hadoop and use the same features related to configuration. It would also help us to reuse the basic code for parsing and reading in configuration.

Regarding where to define this configuration, the first format will necessitate a new file, as the current hadoop configs are truly a single level hierarchy. The second option allows us to continue to use the current Hadoop config files: hadoop-defaults.xml and hadoop-site.xml.

Having a separate file has the benefits that we can define policies around how to manage updates to the file (for e.g. by reading it periodically, etc). However, it would add admin overhead, in that there is now one more file to administer.

Personally, I prefer the more intuitive format of Option 1. Though some learning is involved, it may be easier to learn this format.

Comments from others ? Any other options ?


Alejandro Abdelnur added a comment - 04/Jun/08 11:08 AM
Having the default implementation for authorization being file based is OK.

IMO there should be an interface that allows to enforce authorization based on external sources (if you want to leverage exiting stuff this could be based on JAAS for example, else it could be a simple interface that given a userID and a queue returns true or false with the default implementation reading for the config file).


Hemanth Yamijala added a comment - 04/Jun/08 11:45 AM

IMO there should be an interface that allows to enforce authorization based on external sources (if you want to leverage exiting stuff this could be based on JAAS for example, else it could be a simple interface that given a userID and a queue returns true or false with the default implementation reading for the config file).

This makes sense. We could have a property in the configuration that defines the implementation class we wish to you, so that we can plug-in different schemes. The default, simple scheme could be file based in which case the properties I've described above such as 'allowed users' could be used by the implementation class.


Hemanth Yamijala added a comment - 06/Jun/08 11:20 AM
Some points in favor of a separate file for queue configuration:
  • All parameters related to queues are controlled by administrators. These are needed only by the Jobtracker for V1 of the Scheduler implementation
  • The file needs to be monitored for updates periodically
  • The format is more nested (> 1 level of hierarchy)
  • In case the scheduler becomes separate from the JT (conceivably sometime in the future), a separate file helps because we won't have to separate the configuration items out of hadoop configuration then.

Given all these, I am thinking of implementing the following:

  • Define a hadoop.scheduler.config variable in hadoop configuration that will define the name of the config file for the Scheduler.
  • From Jobtracker read this value and create a SchedulerConf class that will parse this file and populate itself with the configuration parameters.
  • The SchedulerConf class will provide APIs to get and set Scheduler related configuration params, and also implement functionality such as looking for updates etc.

Please comment if you prefer a different approach.


Hemanth Yamijala added a comment - 06/Jun/08 11:39 AM
One comment from Vivek was to use 'ResourceManager' instead of Scheduler as a string at relevant places, as Scheduler may refer to a specific portion of the larger piece.

Hemanth Yamijala added a comment - 13/Jun/08 05:52 AM
The attached patch is a work-in-progress version of the resource manager configuration changes following requirements in HADOOP-3421. It is incomplete and only intended to get some early feedback on the basic approach.

The following changes are included:

  • Introduced a resource-manager-conf.xml which will be the basic configuration file for all resource manager related configuration. By default it ships with one default queue. The values for the parameters in this queue may need to be relooked and decided carefully.
  • Introduced a hadoop.resource-manager.config variable in hadoop-default.xml. User can specify a name of a file for this variable.
  • Introduced a ResourceManagerConf class which is responsible for parsing the config file and exposing the configuration through basic APIs. The main APIs are:
    Set<String> getQueues();
    String get(String queueName, String propertyName);

    There are convenience APIs for getting integers, booleans etc. More could be added.

  • The ResoureManagerConf reads from either the value specified in hadoop.resource-manager.config if defined, or from the resource-manager-conf.xml if the former is undefined.
  • The ResourceManagerConf object is initialized from the JobTracker constructor and exposed with a getter method. For illustration, I've just printed some LOG messages printing the queues and the properties.

Unit tests, documentation and even API set need to be completed before the patch can be submitted. If anyone feels strongly on an alternative approach, please comment now.


Hemanth Yamijala added a comment - 13/Jun/08 06:09 AM
One of the alternative designs that we considered is to see if the Configuration class can be reused.

The basic difference, as explained in comments above, is in the nested nature of the configuration - sets of properties are grouped under a queue configuration element, for e.g. We could possibly enhance Configuration to support parsing and exposing such nested configuration also, without changing the basic API. One approach could be to support nesting that occurs in the following manner:

<configuration>
  <level1 key="value">
    <level2>
      <property>
        <!-- usual key-value elements here -->
      </property>
    </level2>
  </level1>
</configuration>

We could then build a Map of an XPath type expression to a Properties object, where the XPatch expression takes us till the parent of a list of property elements. The current configuration is then basically a Properties object mapped for the element /configuration. Queue related properties could be looked up like /configuration/queues/queue[@name="default"]. We could even modify current configuration to be stored as a hierarchy:

<configuration>
  <hadoop>
    <mapred>
       <!-- mapred configuration -->
    </mapred>
    <hdfs>
       <!-- hdfs configuration -->
    </hdfs>
    <resource-manager>
       <!-- resource-manager configuration -->
    </resource-manager>
  </hadoop>
</configuration>

... and so on.

However, given this is the first attempt for building resource manager functionality, it seems prudent to keep things simple, if at the cost of duplicating a little code between Configuration and ResourceManagerConf, and refactor if things are working out fine. Also, Configuration is a religious topic and it would take a while to converge of how and what to do. smile. Only if people feel very strongly about this approach, should we consider it for Hadoop 0.19 though.


Steve Loughran added a comment - 13/Jun/08 11:39 AM
As someone whose product does configuration and deployment, this is somehting that I do feel strongly about.

While I like hierarchical representations, I dont think XPath should be used unless everyone understand the consequences. For example, what happens if more than one thing comes back? What happens if someone plays xml namespace games. I have tried to use XPath based configuration schemas and it soon gets complex. What if I want to do property expansion in the Path, like /configuration/queues/queue[@name="${user.name}"] ? what if I set a path that exists -is that an append or an overwrite?

Equally importantly, it forces a requirement for the XML DOM to be the only representation of content. That is a tree, not a graph, and not that agile. If you were to use LDAP or anything else then things wouldnt work.

So . I'm:
+1 for a consistent way of configuring things.
+1 for hierarchical representation of information.
+1 for decoupling the configuration source from the Configuration class, so that any provider of a hierarchical configuration name-value pairs can return values on name lookups.
-1 for XPath


Hemanth Yamijala added a comment - 13/Jun/08 12:33 PM
Steve, thanks for your comments. I'd thrown the XPath idea only to describe that an alternative design was considered, than what is implemented in the attached patch. That said, it is nice to hear early feedback that an XPath based scheme could be bad, and so not go down that path, if at all.

Some questions on your comments:

  • When you say 'consistent way', do you mean across Hadoop configuration and the Resource manager configuration ? That we define a similar format to represent both of them ?
  • Likewise, can you also explain a bit more about the decoupling of configuration source from the implementation ?

Steve Loughran added a comment - 13/Jun/08 03:42 PM
'consistent': how the things are configured and how the configs are read are similar. You don't want to go the java webapp way with an ear.xml, a war.xml, a persistence xml file, each with their own syntax and a need to try and keep the things coherent.

As of 5 minutes ago, I have checked in the outline slides for the august UK talk, in which I talk about managing the config and lifecycle of Hadoop nodes:

http://smartfrog.svn.sourceforge.net/viewvc/*checkout*/smartfrog/trunk/core/components/hadoop/doc/managing_hadoop.ppt

To handle hadoop configuration we had to subclass JobConf (as too much does instanceof JobConf tests), binding to our own configuration source and replicating any private methods of Configuration needed to serialize the configuration into the wire format, which is an XML document:

http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/components/hadoop/src/org/smartfrog/services/hadoop/conf/ManagedConfiguration.java?view=markup

There was some discussion on HADOOP-24 on doing things to Configuration; what would seem best would be to let different things be a source of configuration data, with the actual Configuration class or subclass (including JobConf and HBaseConf) getting their input from that source.


Hemanth Yamijala added a comment - 13/Jun/08 05:06 PM
One of the problems I am facing in visualizing a consistent syntax and API is that the Hadoop configuration is a single list of key-value pairs expressed in an XML format. So, the basic Configuration API is a String get(String property). With the resource manager changes, we have to support groups of such key-value pairs, where each instance of such a group represents related configuration for a single queue. In other words, a simple get(String property) no longer works. You need to tell for which instance you have to get the property. Any ideas on how to define a consistent API to represent both ?

Hemanth Yamijala added a comment - 16/Jun/08 07:00 AM
I looked at HADOOP-24 in detail. IMO, I feel there are 2 different issues we are discussing in these two JIRAs:
  • What HADOOP-24 described and concluded was to provide a way to read Configuration from multiple data sources, including files, LDAP etc. and separate the issue of filling up Configuration from a source (aka ConfSource approach) from answering queries of Configuration itself, which is the Configuration clas.
  • What we are trying to evaluate here is whether the configuration format (rather than it's source) be made generic enough to support a very generalized structure of configuration - multiple hierarchies, named groups, etc.

For doing a good job of the latter, IMO, we should just try and define a format that fits the current Hadoop configuration and the new Resource Manager configuration, if at all. Trying to build something too generic without fully understanding what else is required may not work at this level.

Even there, I feel like going with my original proposal of a separate configuration for the resource manager part. We are still figuring out with how the resource management functionality all turns out. If our attempts work well, then we can always refactor to make the Configuration class capable of handling both formats. Then we can make the ResourceManagerConf I defined in the patch a sub-class of the main Configuration class, and reuse the code.

Comments from others on this issue ?


Hemanth Yamijala added a comment - 17/Jun/08 10:31 AM
I filed HADOOP-3579 (a hundred JIRAs away - coincidence smile) to add support for groups in Configuration. I feel that will help greatly in having a consistent way of defining configuration. Keeping these two issues separate will help to move them forward. It will be very easy to migrate the ResourceManagerConf to the format I've specified in HADOOP-3579, if we decide that's the right way to move.

Hemanth Yamijala added a comment - 19/Jun/08 05:47 PM
This new patch adds Javadocs, unit tests and a more complete API for queue related configuration.

Elsewhere, (for e.g. on HADOOP-3579) we are discussing changes to Configuration to have one consistent format and mechanism. When we arrive at a consensus on that, we can implement changes in Configuration, and modify the ResourceManagerConf class to extend from Configuration and retain the APIs like getQueues etc for convenience. These calls will be delegated to the Configuration class to get the values. Until then, this patch will help to move things forward for other issues tied to HADOOP-3444.

Requesting comments...


Doug Cutting added a comment - 19/Jun/08 06:28 PM
This new class unfortunately duplicates a lot of code from Configuration. We need more configuration classes, but rather fewer. Long-term we should get rid of JobConf, and replace it with, e.g., setter and getter methods on a Job class, where Job has a Configuration field and does not subclass Configuration.

It also seems to me that your needs here could be met with the existing Configuration, by either:

  • a separate config file per queue; and/or
  • a property-naming convention, e.g., queue.default.allowed-users, for the "default" queue.
    If you need a list of all queues, you could either scan, or add a property that lists queue names.

This might not be optimal, but it would be functional. Revamping Configuration should not be done by a proliferation of new, incompatible configuration mechanisms, but rather by enhancing or replacing the existing mechanism.


Vivek Ratan added a comment - 20/Jun/08 04:03 AM
Doug, when we (mostly Hemanth and I) thought about configuration for queues, we felt we had a somewhat unique situation: the number of queues is dynamic, and each queue has different values for a common set of attributes . We felt we had two fundamental implementation choices:
  • use the 'flatness' of the existing Configuration class and config format to reflect a more hierarchical config structure that queues inherently need (by 'flatness', I mean that we don't have an obvious way today to express hierarchies or a dynamic number of entries).
  • extend the Configuration class and its supporting classes to handle hierarchies and dynamic number of properties

We looked hard at the first option and considered some strategies, including some of the ones you've mentioned. In fact, I sent out a mail to code-dev on Tue, 6/3, suggesting an option of having a property that lists comma-separated queue names or number of queues, and then building the property name string for each queue attribute and getting its value from the config file. The pros are obvious: we use the existing framework and do not duplicate code. But we felt like the configuration framework needed to fundamentally handle hierarchies and dynamic number of properties. Arguably, it's only for queues today, but later, if/when we support hierarchies of Orgs/queues/users and perhaps overwriting of default values (lower-level entities in the hierarchies can override higher-level defaults; for example, queues can specify some defaults for users which some users can override), we will need such functionality.

Now, if we need this functionality, we felt we had two options:

  • we could alter the Configuration class to support the new features, then perhaps have a QueueConf subclass, just like we have one for JobConf.
  • we could build this functionality in the QueueConf class separately to see how well it works

There is no doubt that the first approach is a better longer-term solution. However, we didn't really want to change the Configuration class too much at this stage. It's a core class, and given that this whole stuff about queues and orgs is fairly new and we don't know how much it will change over time, we felt that we should restrict our modifications to the QueueConfig class for now. This isolates the more stable Configuration class from too many changes. It's also something we can do faster. Once we feel we have it right, we do want to do the right thing long-term, which is to build support in the Configuration class. Furthermore, people haven't responded very much to our proposal, and given that configuration is usually one of those areas where there are lots of string views, we wanted to keep the code impact minimal, pending further discussions. The flip side is duplication of code, which you have pointed out. And there is always the danger that once things get into the code base, they're often not modified according to their original intent, but that's something we need to be disciplined about. But eventually, we do want to go with the first option.

I'm personally not against using the current flat config structure to handle queues, at least until we have more use cases for hierarchical configuration. But I think that by limiting the new code to a separate class, and the new configuration to a separate file, we isolate ourselves against too many changes to code in the future, till we get our use cases right. And the configuration format that Hemanth has proposed is more compact and easier to understand than having a separate property for every attribute of every queue.

These are the reasons for our proposal. We'd love to hear more. Hemanth's been soliciting comments for a while


Hemanth Yamijala added a comment - 20/Jun/08 03:08 PM
Doug, I fully agree with you on the need to ultimately keep all configuration management with the Configuration class and not proliferate random formats. I opened HADOOP-3579 with the intention of doing just that. It would be nice if you can take a look at what I propose there and comment.

However, I feel having a wrapper around such a Configuration class that provides a closer API for accessing Queue related configuration is a good idea. I imagine the ResourceManagerConf class will evolve to that. It would have a Configuration class as a member variable and provide APIs such as getQueues which will delegate calls to Configuration. I'm concerned that it would be a while until we can arrive at a generic format that a majority of us agree with, and the wrapper helps to abstract the rest of the code from these details.

If you look at the 2nd comment I made on this issue, I mention the format you proposed as an option. There are disadvantages I've mentioned there (less intuitive, restrictions on queue names, etc). However, it will have the advantage of reusing well tested code. The other option is what I've implemented. It does repeat code, and so is undesirable. However, it is a more intuitive format IMO. I selected this option because in my mind, I see it as a temporary arrangement until we fix Configuration to handle hierarchies.

Honestly, if there are no major concerns from admins on the format you proposed, I would be happy to implement it that way. Does this sound reasonable ? Any other ideas ?


Doug Cutting added a comment - 20/Jun/08 08:17 PM
> by limiting the new code to a separate class, and the new configuration to a separate file, we isolate ourselves against too many changes to code in the future

Yes, but at a high cost. Not only is code duplicated, but a new public API is added. New public APIs should only be added when we intend to support them long-term. If you want to implement something short-term, then it ought to be package-private, located with the rest of your implementation, not grouped with other generic, public configuration APIs.

Until HADOOP-3579 is resolved, I would prefer an approach that leverages the existing Configuration mechanism, perhaps with minor modifications if needed. (For example, one might reasonably wish to be able to create a Configuration from a file without reading hadoop-default.xml and hadoop-site.xml.)


Hemanth Yamijala added a comment - 24/Jun/08 03:22 PM
After some internal discussion, and considering Doug's suggestions in detail, I am summarizing the choices we have, and some recommendations.

There are 3 things considered: storage, format and API.

For storage, there are the following options:

  • Store all resource manager configuration in a separate file.
  • Store the configuration in different files, one per queue, and maybe one for the default. This could help us completely reuse the Configuration class. However, the implementation may require multiple Configuration objects to be created, and does not scale as well as the other option. Also, managing multiple files could be a problem.

For format, there are the following options:

  • Store in a hierarchical fashion as described above (and in the first patch uploaded). This is an intuitive format, but requires special parsing and code duplication with the Configuration class.
  • Store in a flat format using the property naming convention as suggested by Doug. For e.g., it would look something like this:
    <property>
        <name>hadoop.rm.queues</name>
        <value>q1,q2</value>
    </property>
    <property>
        <name>hadoop.rm.max-capacity</name>
        <value>200</value>
    </property>
    <property>
        <name>q1.hadoop.rm.max-capacity</name>
        <value>100</value>
    </property>
    <property>
        <name>q2.hadoop.rm.max-capacity</name>
        <value>100</value>
    </property>

    This format has the advantage that the existing Configuration mechanism can be reused to a large extent. However, it may be cumbersome for admins to manage.

For the API, there should be a separate class like ResourceManagerConf that provides an API like below:

Set<String> getQueues();
int getMaxCapacity(String queueName);
void setMaxCapacity(String queueName, int capacity);

Having such an API makes it easy for the client, currently the JobTracker to access Queue configuration.
We have two options:

  • Make this package private. Has the advantage and flexibility that it can be changed until we get sufficient use cases for making it public, and/or the API becomes stable.
  • Make this a new public API. Personally, I don't see a big advantage of doing this.

As a side note, I made a mistake of adding this to the core conf package in my earlier patch. This is not required, and should be in the package along with other resource manager classes.

In summary, I would prefer a design that looks as follows:

  • Use a separate file to store the configuration
  • Store properties using a special naming convention, so as to keep the hierarchy flat and allow for reuse of the Configuration class.
  • Define a new package-private class that exposes the properties using getters and setters.
    The implementation of this class could use the Configuration class underneath to parse and store the values, and retrieve them using Configuration's API.

Comments ?


Hemanth Yamijala added a comment - 24/Jun/08 03:30 PM
Of course, please note that once HADOOP-3579 is fixed, we can make the file format also intuitive for administrators. We can retain the API and the storage in a separate file.

Rob Weltman added a comment - 24/Jun/08 03:58 PM
Hierarchical is much better than flat/generic for the configuration format. It is possible and even likely that human beings will be editing the config files manually for some time. Having all properties associated with a queue under that queue instead of spread around the file with no hard boundaries will lead to errors such as changing the wrong property, deleting some but not all properties for a queue to be removed, deleting properties from a queue that was not supposed to be touched, etc.

Strongly typed configuration allows for quick and efficient automatic validation as well, so we can reduce the likelihood of typos in deployed configuration.


Owen O'Malley added a comment - 24/Jun/08 04:18 PM
I like having them in the same file, with the naming scheme.

I'd propose flipping the names around so that they form a hierarchy:

<property>
    <name>hadoop.rm.queues</name>
    <value>q1,q2</value>
</property>
<property>
    <name>hadoop.rm.max-capacity</name>
    <value>200</value>
</property>
<property>
    <name>hadoop.rm.q1.max-capacity</name>
    <value>100</value>
</property>
<property>
    <name>hadoop.rm.q2.max-capacity</name>
    <value>100</value>
</property>

It would be trivial to write a pretty printer for the queue configuration that would help spot problems with bad edits before they went live. Would that address your concerns Rob?

In terms of API, I think that Hemanth means something like:

class ResourceManager {
   private Configuration conf = ...
   public Container<String> getQueueNames()
   public static int getMaxCapactiry(String queueName);
...
}

which sounds reasonable.


Steve Loughran added a comment - 24/Jun/08 04:19 PM
+1 to hierarchical; it can be generally useful. Going with long.property.names. is simple, but doesnt scale and leads to files like log4j.properties.

=0 to strong typing. That may seem odd as SmartFrog does have strong types integers, floats, strings but note that Hadoop Configuration currently does java property expansion in its work, so it expects to get raw strings and post-process them. FWIW, we work in our types and then toString() everything in the hadoop binding, so you can do integer maths (1024 * 1024 * 1024) and boolean logic, and hadoop doesn't need to care.

If the XML format adopts XML Schema types then you couldnt have a reference like ${http.Proxy.Port} in your (integer) port property, as the string would still be unexpanded. during XSD validation.

Rob, what you need is "preflight validation", which takes the conf files + other settings and validates them, either locally or in the context of the target system

I am offering the engineering effort needed to get some of this working with a couple of back ends SmartFrog and JNDI as both are hierarchical and so test things better. That's about 30% of my time, depending on other crises.


Owen O'Malley added a comment - 24/Jun/08 04:22 PM
I should also add that if you had a lot of queues, it would probably make sense to do:
bin/hadoop queueadmin -print > orig
emacs $HADOOP_CONF/hadoop-site.xml
bin/hadoop queueadmin -print > new
diff -wc orig new

to make sure that you changed precisely the desired queue.


Hemanth Yamijala added a comment - 24/Jun/08 04:22 PM

In terms of API, I think that Hemanth means something like: ...

Yes, that's what I meant. Thanks for clarifying, Owen.


Owen O'Malley added a comment - 24/Jun/08 05:26 PM
I like the hierarchical solution, if it is applied to all configs, but that seems like a bigger ticket item.

I'd suggest that Hemanth take this approach for now and when HADOOP-3579 is done, we can change this over to use the new structure. I'd rather not have a custom format & parser that is only used for queue configuration.


Doug Cutting added a comment - 24/Jun/08 06:10 PM
> I am offering the engineering effort needed to get some of this working with a couple of back ends [...]

Great! It seems like HADOOP-3579 should be the focus of your efforts, not this issue, right? Perhaps you can make a concrete proposal there?


Rob Weltman added a comment - 24/Jun/08 09:29 PM
Pretty-printing tools may be nice but do not replace proper structure in the document itself, especially if it is going to be edited with an off-the-shelf tool (XML editor or text editor) or viewed with an off-the-shelf tool (HTML browser, for example).

I really don't see any advantage to the generic name-value structure over one that reflects the semantics of the content. If it is a matter of this release or the next one and the decision is time-based, it could be that hierarchical organization has to slip to the next release. There is a risk that initial decisions stick, though. It isn't necessarily the case that "there will always be time to redo this later".

If the proposal is to switch to hierarchical structure when HADOOP-3579 is addressed, maybe that is the right compromise.


Doug Cutting added a comment - 24/Jun/08 10:51 PM
> If the proposal is to switch to hierarchical structure when HADOOP-3579 is addressed, maybe that is the right compromise.

Yes, I think that is the proposal: if we need a new configuration feature, let's not add a new configuration mechanism, but rather fix the current mechanism to incorporate that feature and use it. The feature (support for hierarchical properties) seems generic, not specific to resource management and probably deserves a generic solution.


Hemanth Yamijala added a comment - 25/Jun/08 02:45 PM
So, we go with the compromise mentioned here until HADOOP-3579 is addressed. I will work on a new patch and upload. Thanks for your comments and help in reaching a decision.

Hemanth Yamijala added a comment - 30/Jun/08 04:11 PM
Attaching new file implementing the proposal above.

Hemanth Yamijala added a comment - 30/Jun/08 04:21 PM
Summary of the changes in the patch:

2 changes are made to the core Configuration class:

  • Added a new constructor that can be used to avoid loading the default resources (hadoop-site and hadoop-defaults)
  • Added a new API to reload configuration, by clearing the current list of properties, overlay and final parameters. The next set of calls will cause the data structures to be filled again from the resources that are already added. We needed this functionality for the resource manager configuration. I don't know enough about the Configuration class's usage to check if this API addition is valid or not. If there are things I have missed, and this API is not suitable for Configuration, it can be implemented only for the resource manager class.

In addition to the above, I've added a ResourceManagerConf class that is package private to org.apache.hadoop.mapred, and provides getter and setter methods for various configuration parameters needed for the resource manager. It also provides an option to print out the options to a standard output. This will facilitate building the pretty printing tools that Owen mentioned above.

Added unit tests for both Configuration and ResourceManagerConf.

Comments please.


Hemanth Yamijala added a comment - 01/Jul/08 04:00 AM
This must be committed for the core scheduler functionality to be made PA.

Hemanth Yamijala added a comment - 01/Jul/08 05:21 AM
New patch that corrects typos in some variable names and documentation. Also renames a few methods to better reflect the meaning. These are based on feedback from Vivek.

Doug Cutting added a comment - 01/Jul/08 06:34 PM
Overall this looks fine. A few minor issues:
  • addResource() should call reloadConfiguration, to eliminate duplicate code
  • I can't see where the 'overlay' feature is ever used. Should we eliminate it?
  • The first parameter of 'addResource' isn't needed.

The latter two are not your fault, but should probably be cleaned up anyway...


Hemanth Yamijala added a comment - 01/Jul/08 11:51 PM

I can't see where the 'overlay' feature is ever used. Should we eliminate it?

I thought this was being used to note which are the values that are stored through 'set' operations, because the set API adds the values to the overlay and then to the properties instance. But I realize now there is no way to retrieve these, because getOverlay is a private method. So, I can remove this. Just to be safe though, should I do it through a separate JIRA ?

addResource() should call reloadConfiguration, to eliminate duplicate code

Agreed. I didn't have common code because I was clearing the overlay in reloadConfiguration and addResource was not. But if we remove overlay as discussed above, then it is easy to reuse. Even if not, I can provide a private reloadConfiguration(boolean clearOverlay) and do it still.

The first parameter of 'addResource' isn't needed.

You mean the 'resources' parameter, right ? I wondered why it was being used like that. Can remove it. Again, can be part of the cleanup JIRA ?


Hemanth Yamijala added a comment - 02/Jul/08 03:51 PM
It appears we need the overlay feature, after all. A failed test case in TestConfiguration when I made the change highlighted its use.

Consider the following sequence of operations:

conf.addResource(someResource);
conf.set("somekey", "somevalue");
conf.addResource(someOtherResource);

At step 3, the properties object is cleared. If the overlay is not present, then it basically means that the value of "somekey" is lost at this point. The presence of the overlay makes the set values to come back when the resources are reloaded. Also, the reason it is called overlay seems to be because the values set via the setter methods override everything else. This is because they are added at the last step in the getProps method.

Thinking about this, it seemed to make sense to retain this behavior in reloadConfiguration, as in some way it is identical to addResource of all resources again in order.

So, in summary, the changes I made based on Doug's comments are:

  • Modified the behavior of reloadConfiguration to not clear the overlaid properties
  • Enhanced the test case of reload to make sure the overlay behavior is retained. This makes the testcase similar to the testOverlay method in TestConfiguration, which, very usefully, highlighted the need for the overlay feature.
  • Calling reloadConfiguration from addResource
  • Removed the first parameter of addResource, and since now the names would clash, renamed that method to addResourceObject.

Hemanth Yamijala added a comment - 02/Jul/08 03:55 PM
Patch incorporating Doug's comments.

Hemanth Yamijala added a comment - 02/Jul/08 03:56 PM
Running through Hudson, with the hope that the last patch is acceptable.

Hadoop QA added a comment - 02/Jul/08 05:38 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12385129/3479.4.patch
against trunk revision 673378.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 6 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

-1 findbugs. The patch appears to introduce 2 new Findbugs warnings.

-1 release audit. The applied patch generated 206 release audit warnings (more than the trunk's current 205 warnings).

+1 core tests. The patch passed core unit tests.

-1 contrib tests. The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2782/testReport/
Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2782/artifact/trunk/current/releaseAuditDiffWarnings.txt
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2782/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2782/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2782/console

This message is automatically generated.


Doug Cutting added a comment - 02/Jul/08 07:07 PM
+1 This looks good to me.

Hemanth Yamijala added a comment - 03/Jul/08 10:56 AM
Will attach a new patch for fixing the findbugs warning.
The release audit warning is because there is a new configuration file. It is expected.
Contrib tests failure is not related to this patch.

Hemanth Yamijala added a comment - 03/Jul/08 10:56 AM
Fixed findbugs warnings.

Hadoop QA added a comment - 03/Jul/08 02:13 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12385195/3479.5.patch
against trunk revision 673517.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 6 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

-1 release audit. The applied patch generated 206 release audit warnings (more than the trunk's current 205 warnings).

+1 core tests. The patch passed core unit tests.

-1 contrib tests. The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2791/testReport/
Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2791/artifact/trunk/current/releaseAuditDiffWarnings.txt
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2791/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2791/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2791/console

This message is automatically generated.


Devaraj Das added a comment - 07/Jul/08 10:43 AM
I just committed this. Thanks, Hemanth!

Hudson added a comment - 22/Aug/08 12:34 PM