Hadoop Common
  1. Hadoop Common
  2. HADOOP-7359

Pluggable interface for cluster membership

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Currently Hadoop uses local files to determine cluster membership. With HDFS for example, dfs.hosts and dfs.hosts.exclude are used.

      To enable tighter integrations cluster membership should be an interface, with the current file-based functionality provided as the default implementation. The common case would be no functional change, however, sites could plug an alternative implementation in, such as pulling the machine lists from a machine database.

      DETAILS:

      Two machine lists, includes and excludes, are used to define cluster membership and state. HostsFileReader currently handles reading these lists from files, who's names are passed in by FSNamesystem for HDFS and JobTracker for MR.

      The proposed change is adding a HostsReader interface to common, and changing HostsFileReader to an abstract class that functions the same as today.

      Two new classes, DFSHostsFileReader and MRHostsFileReader, extend HostsFileReader and simply pass the appropriate file names in. These new classes are needed because config key names live outside common.

      Two new conf keys, defaulting to the file-based readers, would be added to choose a different hosts reader: dfs.namenode.hosts.reader.class mapreduce.jobtracker.hosts.reader.class

      Comments/suggestions? I have most of this written already but would love some feedback on the general idea before posting the diff.

      1. HADOOP-7359.diff
        59 kB
        Travis Crawford

        Issue Links

          Activity

          Hide
          Todd Lipcon added a comment -

          Sounds like a reasonable design to me!

          Show
          Todd Lipcon added a comment - Sounds like a reasonable design to me!
          Hide
          Travis Crawford added a comment -

          Common portion of proposed change.

          Show
          Travis Crawford added a comment - Common portion of proposed change.
          Hide
          Travis Crawford added a comment -

          Would anyone object to allowing the HostsReader to trigger refreshNodes? That would let Hadoop scan for or be notified of cluster membership changes and automagically do the Right Thing.

          DETAILS

          Taking a step back, this change would be the most useful if your authoritative source for machine roles is stored Somewhere Else and you want Hadoop to integrate. The posted diff simply lets you pull the lists of included/excluded hosts from such a source, but does not activate the new lists - you still need to refreshNodes.

          Imagine you update the authoritative source with new/removed machines and want Hadoop to learn about the change (ZK watch, polling, etc.). It would be very handy for cluster membership & state changes to propagate without manual intervention as is needed today. Permitting HostsReader to call refreshNodes would accomplish this goal.

          PROPOSED IMPLEMENTATION

          Introduce a "Refreshable" interface that both FSNamesystem and JobTracker implement, that only defines a refreshNodes method. HostsReader would have an initialize method that takes a Refreshable and users could choose to call refreshNodes.

          The current file-based cluster membership would continue to work exactly as it does today.

          Sort of a bigger change, but potentially very useful at larger sites. If there's general agreement this would be useful I'll post a diff. If not, I still think there's value in this change as it means no more copy/pasting lists of machines from the machine database

          Show
          Travis Crawford added a comment - Would anyone object to allowing the HostsReader to trigger refreshNodes? That would let Hadoop scan for or be notified of cluster membership changes and automagically do the Right Thing. DETAILS Taking a step back, this change would be the most useful if your authoritative source for machine roles is stored Somewhere Else and you want Hadoop to integrate. The posted diff simply lets you pull the lists of included/excluded hosts from such a source, but does not activate the new lists - you still need to refreshNodes. Imagine you update the authoritative source with new/removed machines and want Hadoop to learn about the change (ZK watch, polling, etc.). It would be very handy for cluster membership & state changes to propagate without manual intervention as is needed today. Permitting HostsReader to call refreshNodes would accomplish this goal. PROPOSED IMPLEMENTATION Introduce a "Refreshable" interface that both FSNamesystem and JobTracker implement, that only defines a refreshNodes method. HostsReader would have an initialize method that takes a Refreshable and users could choose to call refreshNodes. The current file-based cluster membership would continue to work exactly as it does today. Sort of a bigger change, but potentially very useful at larger sites. If there's general agreement this would be useful I'll post a diff. If not, I still think there's value in this change as it means no more copy/pasting lists of machines from the machine database
          Hide
          Aaron T. Myers added a comment -

          Would anyone object to allowing the HostsReader to trigger refreshNodes? That would let Hadoop scan for or be notified of cluster membership changes and automagically do the Right Thing.

          In the abstract I think this is a fine change to make.

          Introduce a "Refreshable" interface that both FSNamesystem and JobTracker implement, that only defines a refreshNodes method. HostsReader would have an initialize method that takes a Refreshable and users could choose to call refreshNodes.

          I think the name "Refreshable" isn't the best. Seems a little too generic to me. How about something like "NodeListRefreshable" ?

          Also, the NN and the JT already implement the interfaces o.a.h.hdfs.protocol.ClientProtocol and o.a.h.mapred.AdminOperationsProtocol, respectively, both of which require implementation of a refreshNodes() method which happen to have the same signature. You could just make these interfaces extend your new interface and then you'd get the genericity you'd need without actually having to touch the NN or JT classes at all.

          The current file-based cluster membership would continue to work exactly as it does today.

          That seems wise to me. This proposed change would also make it easy to potentially make the HostsFileReader do something like periodically check the mtime of the hosts files and re-read them automatically if they've changed and call refreshNodes() on the relevant NodeListRefreshable.

          Show
          Aaron T. Myers added a comment - Would anyone object to allowing the HostsReader to trigger refreshNodes? That would let Hadoop scan for or be notified of cluster membership changes and automagically do the Right Thing. In the abstract I think this is a fine change to make. Introduce a "Refreshable" interface that both FSNamesystem and JobTracker implement, that only defines a refreshNodes method. HostsReader would have an initialize method that takes a Refreshable and users could choose to call refreshNodes. I think the name "Refreshable" isn't the best. Seems a little too generic to me. How about something like "NodeListRefreshable" ? Also, the NN and the JT already implement the interfaces o.a.h.hdfs.protocol.ClientProtocol and o.a.h.mapred.AdminOperationsProtocol , respectively, both of which require implementation of a refreshNodes() method which happen to have the same signature. You could just make these interfaces extend your new interface and then you'd get the genericity you'd need without actually having to touch the NN or JT classes at all. The current file-based cluster membership would continue to work exactly as it does today. That seems wise to me. This proposed change would also make it easy to potentially make the HostsFileReader do something like periodically check the mtime of the hosts files and re-read them automatically if they've changed and call refreshNodes() on the relevant NodeListRefreshable .
          Hide
          E. Sammer added a comment -

          I like the idea of this, but I would not necessarily attach it to the notion of files, explicitly. I would propose a ClusterTopology SPI-style interface and a ClusterTopologyListener interface for those interested in topology changes. Ideally, all clients (either internal to Hadoop daemons or external tools) would ask implementations of ClusterTopology for the list of hosts.

          Off the top of my head API:

          ClusterTopology <<interface>>
          getNodes() : Set<Node>
          refresh()
          getListeners() : Set<ClusterTopologyListener>
          addListener(ClusterTopologyListener) : boolean (wasAdded)
          removeListener(ClusterTopologyListener) : boolean (wasRemoved)

          ClusterTopologyListener <<interface>>
          onTopologyChange(ClusterTopology)

          And then have a single class that implements ClusterTopology. Configure the class two ways (no need to have an inheritance hierarchy).

          HostFileClusterTopology <<class>>
          /*
          A private member with a base file name. The implementation automatically
          looks for baseFileName + .

          {include,exclude}

          */
          baseFileName : File

          This, to me, seems like it would support the current file based membership but also things like an RDBMS or ZK. In the case of files and an RDBMS (and other non-event based systems) listeners wouldn't be notified of changes until a refresh() occurred. Alternatively, implementations could include a poller which automatically called refresh() when a change is detected or something like that. This is also easy to mock out and test.

          Show
          E. Sammer added a comment - I like the idea of this, but I would not necessarily attach it to the notion of files, explicitly. I would propose a ClusterTopology SPI-style interface and a ClusterTopologyListener interface for those interested in topology changes. Ideally, all clients (either internal to Hadoop daemons or external tools) would ask implementations of ClusterTopology for the list of hosts. Off the top of my head API: ClusterTopology <<interface>> getNodes() : Set<Node> refresh() getListeners() : Set<ClusterTopologyListener> addListener(ClusterTopologyListener) : boolean (wasAdded) removeListener(ClusterTopologyListener) : boolean (wasRemoved) ClusterTopologyListener <<interface>> onTopologyChange(ClusterTopology) And then have a single class that implements ClusterTopology. Configure the class two ways (no need to have an inheritance hierarchy). HostFileClusterTopology <<class>> /* A private member with a base file name. The implementation automatically looks for baseFileName + . {include,exclude} */ baseFileName : File This, to me, seems like it would support the current file based membership but also things like an RDBMS or ZK. In the case of files and an RDBMS (and other non-event based systems) listeners wouldn't be notified of changes until a refresh() occurred. Alternatively, implementations could include a poller which automatically called refresh() when a change is detected or something like that. This is also easy to mock out and test.
          Hide
          Travis Crawford added a comment -

          Attached patch contains changes for common, HDFS, and MR. I can break this into 3 patches if necessary (started a thread on common-dev about changes spanning projects, post merge).

          This patch adds automatic refreshing, enabled with the config keys:

          dfs.namenode.hosts.reader.refresh.sec mapreduce.jobtracker.hosts.reader.refresh.sec

          By default refreshing is not enabled, so no functional change from today. However, users could enable auto-refreshing, or plug their own HostsReader implementation in.

          Show
          Travis Crawford added a comment - Attached patch contains changes for common, HDFS, and MR. I can break this into 3 patches if necessary (started a thread on common-dev about changes spanning projects, post merge). This patch adds automatic refreshing, enabled with the config keys: dfs.namenode.hosts.reader.refresh.sec mapreduce.jobtracker.hosts.reader.refresh.sec By default refreshing is not enabled, so no functional change from today. However, users could enable auto-refreshing, or plug their own HostsReader implementation in.
          Hide
          dhruba borthakur added a comment -

          this patch is a very good thing. the default, from what I understand, is still the file based implementation. Do you have any other implementation where the data is pulled via jdbc/odbc from a mysql implementation? if so, is the alternative implementation also open-source-able?

          Show
          dhruba borthakur added a comment - this patch is a very good thing. the default, from what I understand, is still the file based implementation. Do you have any other implementation where the data is pulled via jdbc/odbc from a mysql implementation? if so, is the alternative implementation also open-source-able?
          Hide
          Travis Crawford added a comment -

          @dhruba - Correct, by default there's no change from today; files are used, and they do not refresh.

          The version I'll use internally gets data from ZooKeeper, so its not particularly reusable (hierarchy, data format, etc). I could see about posting to github or something as an example. You're right though, a jdbc-based version would likely be reusable by other sites.

          Show
          Travis Crawford added a comment - @dhruba - Correct, by default there's no change from today; files are used, and they do not refresh. The version I'll use internally gets data from ZooKeeper, so its not particularly reusable (hierarchy, data format, etc). I could see about posting to github or something as an example. You're right though, a jdbc-based version would likely be reusable by other sites.
          Hide
          Steve Loughran added a comment -

          I played for a while with a dynamic subclass of JobConf that could be changed from CM tooling in a live system. There are a lot of places where those configs get serialized, and it's surprisingly hard to change things on the fly: the nodes generally assume things are static for the life of the process. That's a hard thing to deal with in the workers

          LDAP/JNDI might be a better option for central sites over JDBC. This doesn't mean that I', a fan of JNDI, only that it's easier to write back ends to than a generic JDBC driver -and LDAP is designed with better redundancy in from the outset.

          Show
          Steve Loughran added a comment - I played for a while with a dynamic subclass of JobConf that could be changed from CM tooling in a live system. There are a lot of places where those configs get serialized, and it's surprisingly hard to change things on the fly: the nodes generally assume things are static for the life of the process. That's a hard thing to deal with in the workers LDAP/JNDI might be a better option for central sites over JDBC. This doesn't mean that I', a fan of JNDI, only that it's easier to write back ends to than a generic JDBC driver -and LDAP is designed with better redundancy in from the outset.
          Hide
          Steve Loughran added a comment -

          Some more comments after looking at the code

          • It'd be good to split cleanup (imports, better iteration) from the cluster changes, and put the cleanup in first.
          • I'm not sure about logging excludes data at info level; it seems over-verbose. If it does go in, it should link to a wiki page on ExcludesFile to say "don't panic, this is optional"
          • Following on with the new API model, I think the clustering should be a class, not an interface
          • There's an assumption in the code that get[Excluded]Hosts() never fails; probably an implicit one that it's fast. It'd make sense for the calls to be able to throw IOEs, as they could be triggering live directory lookups, and if bounded execution time is a requirement, that should be in the javadocs "must return in under 100 milliseconds"
          • I wouldn't mark the various AdminOperationsProtocols as stable, as they are clearly moving around.

          Related to this, I could imagine another JIRA issue of a kill -something that would trigger a refresh on any/all registered services in the VM. That way even if you don't have a refresh rate, you can manually trigger a reload.

          Show
          Steve Loughran added a comment - Some more comments after looking at the code It'd be good to split cleanup (imports, better iteration) from the cluster changes, and put the cleanup in first. I'm not sure about logging excludes data at info level; it seems over-verbose. If it does go in, it should link to a wiki page on ExcludesFile to say "don't panic, this is optional" Following on with the new API model, I think the clustering should be a class, not an interface There's an assumption in the code that get [Excluded] Hosts() never fails; probably an implicit one that it's fast. It'd make sense for the calls to be able to throw IOEs, as they could be triggering live directory lookups, and if bounded execution time is a requirement, that should be in the javadocs "must return in under 100 milliseconds" I wouldn't mark the various AdminOperationsProtocols as stable, as they are clearly moving around. Related to this, I could imagine another JIRA issue of a kill -something that would trigger a refresh on any/all registered services in the VM. That way even if you don't have a refresh rate, you can manually trigger a reload.
          Hide
          Travis Crawford added a comment -

          @steve - Great comments, thanks for taking a look. I agree the patch became quite large and a safer approach is separating refactoring from functional changes. I'll do that.

          Regarding get[Excluded]Hosts() never failing, an explicit goal was no change in today's behavior, unless you choose to plugin your own HostsReader, and today's reader assumes the file is quick to read and available. Hopefully anyone implementing a plugin uses automatic refresh, and only triggers a refresh when the lists were successfully updated.

          When manually calling hadoop dfsadmin -refreshNodes it may throw an IOException indicating a failure to refresh, and plugins that choose to automatically refresh should only do so after successfully updating.

          Show
          Travis Crawford added a comment - @steve - Great comments, thanks for taking a look. I agree the patch became quite large and a safer approach is separating refactoring from functional changes. I'll do that. Regarding get [Excluded] Hosts() never failing, an explicit goal was no change in today's behavior, unless you choose to plugin your own HostsReader , and today's reader assumes the file is quick to read and available. Hopefully anyone implementing a plugin uses automatic refresh, and only triggers a refresh when the lists were successfully updated. When manually calling hadoop dfsadmin -refreshNodes it may throw an IOException indicating a failure to refresh, and plugins that choose to automatically refresh should only do so after successfully updating.

            People

            • Assignee:
              Unassigned
              Reporter:
              Travis Crawford
            • Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:

                Development