Bigtop
  1. Bigtop
  2. BIGTOP-635

Implement a cluster-abstraction, discovery and manipulation framework for iTest

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.4.0
    • Fix Version/s: None
    • Component/s: Tests
    • Labels:
      None

      Description

      We've come to a point where our tests need to have a uniform way of interfacing with the cluster under test. It is no longer ok to assume that the test can be executed on a particular node (and thus have access to services running on it). It is also less than ideal for tests to assume a particular type of interaction with the services since it tends to break in different deployment scenarios.

      A framework that needs to be put in place has to be capable of (regardless of where a test using it is executed on):

      1. representing the abstract configuration of the cluster
      2. representing the abstract topology of the entire cluster (services running on a cluster, nodes hosting the daemons, racks, etc).
      3. giving tests an ability to query this topology
      4. giving tests an ability to affect the nodes in that topology in a particular way (refreshing configuration, restarting services, etc.)

      Of course, the ideal solution here would be to give Bigtop tests a programmatic access to a Hadoop cluster management framework such as Cloudera's CM or Apache Ambari.

      As with any ideal solutions I don't think it is realistic though. Hence we have to cook something up. At this point I'm really focused on getting the API right and I'm totally fine with an implementation of that API to be something as silly as a bunch of ssh-based scripts or something.

      This JIRA is primarily focused on coming up with such an API. Anybody who's willing to help is welcome to.

      1. bigtop-635.patch
        38 kB
        Sujay Rau
      2. bigtop-635.patch
        37 kB
        Sujay Rau
      3. BigtopClusterManager.zip
        11 kB
        Sujay Rau
      4. BigtopClusterManagerv2.zip
        18 kB
        Sujay Rau
      5. ClusterManagerAPI.pdf
        27 kB
        Sujay Rau

        Issue Links

          Activity

          Roman Shaposhnik made changes -
          Fix Version/s 0.6.0 [ 12323895 ]
          Roman Shaposhnik made changes -
          Fix Version/s 0.6.0 [ 12323895 ]
          Fix Version/s 0.5.0 [ 12321865 ]
          Bruno Mahé made changes -
          Link This issue relates to BIGTOP-710 [ BIGTOP-710 ]
          Stephen Chu made changes -
          Assignee Sujay Rau [ sujay.rau ] Stephen Chu [ schu ]
          Hide
          Matteo Bertozzi added a comment -

          I think that we can share some code between this and HBASE-6241.

          For example i like the ClusterManager class, but I will personally rename it in Service, breaking down each pice like

          • abstract Service
            • name/user/conf
            • metrics
            • start/stop/restart/signal/kill/suspend/resume/
          • HBaseRegionServer extend Service
          • HBaseMaster extend Service
          • HDFSNamenode extend Service
          • HDFSDataNode extend Service
          • MRJobTracker extend Service
          • MRTaskTracker extend Service

          On top of it you can create your "Cluster" classes like

          • abstract Cluster
          • HDFSCluster extend Cluster
            • add/remove/get DataNode
            • add/remove/get NameNode
          • HBaseCluster extend Cluster
            • add/remove/get Master
            • add/remove/get Region Servers
          • MRCluster
            • add/remove/get JobTracker
            • add/remove/get TaskTracker

          At this point you can implement each service to do the operation like
          /etc/init.d/service start|stop|...

          And you've just one implementation to add/remove/get nodes for each cluster type,
          and one Cluster manager that can group all together. add/remove HBaseCluster, add/remove HDFSCluster...

          In the current patch HADOOP_HOME uses env while users are hardcoded (like "hdfs") an idea can be having the setUser in the service and use the hardcoded/env has default.

          Another nice thing to add is exposing service metrics, I think that just a simple Map is enough that allows to make some performance tests like trying to manually call split region when number of files or region size is X, or the latency is to high.. or other nice stuff.

          Also you can try to grep in the hbase code for how the mini clusters are used. Most of the tests are using the mini cluster to simulate crashes or GC collect by suspending the process and so on.

          Show
          Matteo Bertozzi added a comment - I think that we can share some code between this and HBASE-6241 . For example i like the ClusterManager class, but I will personally rename it in Service, breaking down each pice like abstract Service name/user/conf metrics start/stop/restart/signal/kill/suspend/resume/ HBaseRegionServer extend Service HBaseMaster extend Service HDFSNamenode extend Service HDFSDataNode extend Service MRJobTracker extend Service MRTaskTracker extend Service On top of it you can create your "Cluster" classes like abstract Cluster HDFSCluster extend Cluster add/remove/get DataNode add/remove/get NameNode HBaseCluster extend Cluster add/remove/get Master add/remove/get Region Servers MRCluster add/remove/get JobTracker add/remove/get TaskTracker At this point you can implement each service to do the operation like /etc/init.d/service start|stop|... And you've just one implementation to add/remove/get nodes for each cluster type, and one Cluster manager that can group all together. add/remove HBaseCluster, add/remove HDFSCluster... In the current patch HADOOP_HOME uses env while users are hardcoded (like "hdfs") an idea can be having the setUser in the service and use the hardcoded/env has default. Another nice thing to add is exposing service metrics, I think that just a simple Map is enough that allows to make some performance tests like trying to manually call split region when number of files or region size is X, or the latency is to high.. or other nice stuff. Also you can try to grep in the hbase code for how the mini clusters are used. Most of the tests are using the mini cluster to simulate crashes or GC collect by suspending the process and so on.
          Sujay Rau made changes -
          Attachment bigtop-635.patch [ 12537753 ]
          Sujay Rau made changes -
          Attachment bigtop-635.patch [ 12537750 ]
          Sujay Rau made changes -
          Attachment bigtop-635.patch [ 12537750 ]
          Hide
          Sujay Rau added a comment -

          Stephen: updating patch with your comments. Thanks

          Show
          Sujay Rau added a comment - Stephen: updating patch with your comments. Thanks
          Hide
          Stephen Chu added a comment -

          Thanks, Sujay.

          Some initial comments/questions:

          +   <dependency>                                                                                                                        
          +     <groupId>org.apache.bigtop.itest</groupId>                                                                                        
          +     <artifactId>itest-common</artifactId>                                                                                             
          +     <version>0.2.0-incubating</version>                                                                                               
          +   </dependency>
          

          Should this be 0.5.0-incubating instead? Bigtop trunk test artifacts are using 0.5.0-incubating.

          +     <dependency>                                                                                                                      
          +         <groupId>org.apache.hadoop</groupId>                                                                                          
          +         <artifactId>hadoop-mapreduce-client-core</artifactId>                                                                         
          +         <version>2.0.0-alpha</version>                                                                                                
          +       </dependency>                                                                                                                   
          +     <dependency>                                                                                                                      
          +         <groupId>org.apache.hadoop</groupId>                                                                                          
          +         <artifactId>hadoop-common</artifactId>                                                                                        
          +         <version>2.0.0-alpha</version>                                                                                                
          +     </dependency>                                                                                                                     
          +     <dependency>                                                                                                                      
          +         <groupId>org.apache.hadoop</groupId>                                                                                          
          +         <artifactId>hadoop-common</artifactId>                                                                                        
          +         <version>2.0.0-alpha</version>                                                                                                
          +         <type>test-jar</type>                                                                                                         
          +     </dependency> 
          

          Bigtop trunk hadoop tests are using 2.0.0-SNAPSHOT.

          +public interface ClusterAdapter {                                                                                                      
          + /**                                                                                                                                   
          +  * Cluster Daemons: NameNode, DataNode, JobTracker, TaskTracker, SecondaryNameNode, HRegionServer, HMaster                            
          +  /** 
          

          The "Cluster Daemons:" comment seems unnecessary because the specific daemons are not referenced in the rest of the class.

          +  /**                                                                                                                                  
          +   * Shuts down HBase cluster                                                                                                          
          +   */                                                                                                                                  
          +  void hbaseShutdown();   
          

          In HDFSAdapter, there is stopHDFSservice, startHDFSservice, and restartHDFSservice (MRAdapter follows the same style, too). Seems like we should have a startHBaseService, stopHBaseService, and restartHBaseService. Also, should we truncate these names to stopHDFS/startHDFS? Tagging "Service" on the end might be unnecessary. I think most people will know what you mean.

          private LinkedList<Host> cluster = new LinkedList<Host>();
          

          Perhaps rename to clusterHosts? If I'm reading "cluster" in other parts of the code, I might not quickly remember that it's a collection of Hosts.

          +         //dataNode.refreshDaemons(); 
          

          Remove if unnecessary.

          + public void waitUntilStarted(String daemon, Host hostname, long timeout) throws Exception {                                                                                    
          +   assertTrue(hostname != null);                                                                                                                                                
          +   long endTime = System.currentTimeMillis() + timeout;                                                                                                                         
          +   boolean done = false;                                                                                                                                                        
          +     while (!done) {                                                                                                                                                            
          +         if (System.currentTimeMillis() > endTime) {   
          +             throw new Exception("Timeout value reached");                                                                                                                      
          +         }                                                                                                                                                                      
          +       for (Daemon d : hostname.getDaemons()) {                                                                                                                                 
          +         if (d.getName().equalsIgnoreCase(daemon)) {                                                                                                                            
          +           done = true;                                                                                                                                                         
          +         }                                                                                                                                                                      
          +       }                                                                                                                                                                        
          +     }                                                                                                                                                                          
          + }                                                                                                                                                                              
          + /**                                                                                                                                                                            
          +  * Stalls thread until specified daemon is stopped on specified machine or timeout value is reached.                                                                           
          +  * @throws Exception                                                                                                                                                           
          +  */                                                                                                                                                                            
          + public void waitUntilStopped(String daemon, Host hostname, long timeout) throws Exception {                                                                                    
          +   assertTrue(hostname != null);                                                                                                                                                
          +   long endTime = System.currentTimeMillis() + timeout;                                                                                                                         
          +   boolean done = false;                                                                                                                                                        
          +     while (!done) {                                                                                                                                                            
          +         if (System.currentTimeMillis() > endTime) {                                                                                                                            
          +             throw new Exception("Timeout value reached");                                                                                                                      
          +         }                                                                                                                                                                      
          +       boolean isStopped = true;                                                                                                                                                
          +       for (Daemon d : hostname.getDaemons()) {                                                                                                                                 
          +         if (d.getName().equalsIgnoreCase(daemon)) {                                                                                                                            
          +           isStopped = false;                                                                                                                                                   
          +         }                                                                                                                                                                      
          +       }                                                                                                                                                                        
          +       if (isStopped) {                                                                                                                                                         
          +         done = true;                                                                                                                                                           
          +       }                                                                                                                                                                        
          +     }                                                                                                                                                                          
          + } 
          

          Seems like we can refactor these 2 methods because they share a lot in common.

          +   if (onNamenode) {                                                                                                                                                            
          +                                                                                                           
          +   }                                                                                                                                                                            
          +   else {                                                                                                                                                                       
          +     runShellCommand("sudo -u hdfs hdfs haadmin -failover " + active + " " + standby, active_host, false, false);                                                               
          +   }        
          

          I think we can just call shHDFS.exec("hdfs haadmin -failover " + active + " " + standby); If we successfully get hdfs user's shell on any node in the cluster, we should be able to perform failover using it.

          +++ bigtop-test-framework/src/main/groovy/org/apache/bigtop/itest/clustermanager/distributions/VersionAClusterManager.java
          

          Should we start thinking of a different name for this? Maybe BigtopClusterManager like you mentioned before.

          +++ bigtop-test-framework/src/test/groovy/org/apache/bigtop/itest/clustermanager/HAMRBCMHelperThread.java                               
          
          +++ bigtop-test-framework/src/test/groovy/org/apache/bigtop/itest/clustermanager/TestHAMRBCM.java                                       
          

          We should move these tests into the Bigtop Hadoop test artifacts.

          Show
          Stephen Chu added a comment - Thanks, Sujay. Some initial comments/questions: + <dependency> + <groupId>org.apache.bigtop.itest</groupId> + <artifactId>itest-common</artifactId> + <version>0.2.0-incubating</version> + </dependency> Should this be 0.5.0-incubating instead? Bigtop trunk test artifacts are using 0.5.0-incubating. + <dependency> + <groupId>org.apache.hadoop</groupId> + <artifactId>hadoop-mapreduce-client-core</artifactId> + <version>2.0.0-alpha</version> + </dependency> + <dependency> + <groupId>org.apache.hadoop</groupId> + <artifactId>hadoop-common</artifactId> + <version>2.0.0-alpha</version> + </dependency> + <dependency> + <groupId>org.apache.hadoop</groupId> + <artifactId>hadoop-common</artifactId> + <version>2.0.0-alpha</version> + <type>test-jar</type> + </dependency> Bigtop trunk hadoop tests are using 2.0.0-SNAPSHOT. + public interface ClusterAdapter { + /** + * Cluster Daemons: NameNode, DataNode, JobTracker, TaskTracker, SecondaryNameNode, HRegionServer, HMaster + /** The "Cluster Daemons:" comment seems unnecessary because the specific daemons are not referenced in the rest of the class. + /** + * Shuts down HBase cluster + */ + void hbaseShutdown(); In HDFSAdapter, there is stopHDFSservice, startHDFSservice, and restartHDFSservice (MRAdapter follows the same style, too). Seems like we should have a startHBaseService, stopHBaseService, and restartHBaseService. Also, should we truncate these names to stopHDFS/startHDFS? Tagging "Service" on the end might be unnecessary. I think most people will know what you mean. private LinkedList<Host> cluster = new LinkedList<Host>(); Perhaps rename to clusterHosts? If I'm reading "cluster" in other parts of the code, I might not quickly remember that it's a collection of Hosts. + //dataNode.refreshDaemons(); Remove if unnecessary. + public void waitUntilStarted( String daemon, Host hostname, long timeout) throws Exception { + assertTrue(hostname != null ); + long endTime = System .currentTimeMillis() + timeout; + boolean done = false ; + while (!done) { + if ( System .currentTimeMillis() > endTime) { + throw new Exception( "Timeout value reached" ); + } + for (Daemon d : hostname.getDaemons()) { + if (d.getName().equalsIgnoreCase(daemon)) { + done = true ; + } + } + } + } + /** + * Stalls thread until specified daemon is stopped on specified machine or timeout value is reached. + * @ throws Exception + */ + public void waitUntilStopped( String daemon, Host hostname, long timeout) throws Exception { + assertTrue(hostname != null ); + long endTime = System .currentTimeMillis() + timeout; + boolean done = false ; + while (!done) { + if ( System .currentTimeMillis() > endTime) { + throw new Exception( "Timeout value reached" ); + } + boolean isStopped = true ; + for (Daemon d : hostname.getDaemons()) { + if (d.getName().equalsIgnoreCase(daemon)) { + isStopped = false ; + } + } + if (isStopped) { + done = true ; + } + } + } Seems like we can refactor these 2 methods because they share a lot in common. + if (onNamenode) { + + } + else { + runShellCommand( "sudo -u hdfs hdfs haadmin -failover " + active + " " + standby, active_host, false , false ); + } I think we can just call shHDFS.exec("hdfs haadmin -failover " + active + " " + standby); If we successfully get hdfs user's shell on any node in the cluster, we should be able to perform failover using it. +++ bigtop-test-framework/src/main/groovy/org/apache/bigtop/itest/clustermanager/distributions/VersionAClusterManager.java Should we start thinking of a different name for this? Maybe BigtopClusterManager like you mentioned before. +++ bigtop-test-framework/src/test/groovy/org/apache/bigtop/itest/clustermanager/HAMRBCMHelperThread.java +++ bigtop-test-framework/src/test/groovy/org/apache/bigtop/itest/clustermanager/TestHAMRBCM.java We should move these tests into the Bigtop Hadoop test artifacts.
          Sujay Rau made changes -
          Attachment bigtop-635.patch [ 12537620 ]
          Sujay Rau made changes -
          Attachment bigtop-635.patch [ 12537622 ]
          Sujay Rau made changes -
          Attachment bigtop-635.patch [ 12537620 ]
          Hide
          Sujay Rau added a comment -

          Attached is a patch of my latest zip file integrated into bigtop.

          Stephen:
          I added optional methods for the starting and stopping of namenodes, datanodes, jobtrackers, and tasktrackers.

          By serviceName, I actually meant serviceID and I have changed all occurrences:
          Usage: HAAdmin [-getServiceState <serviceId>]

          I will check out whether a cluster needs to be restarted after using Configuration.set and make changes accordingly.

          Thanks for taking a look.

          Show
          Sujay Rau added a comment - Attached is a patch of my latest zip file integrated into bigtop. Stephen: I added optional methods for the starting and stopping of namenodes, datanodes, jobtrackers, and tasktrackers. By serviceName, I actually meant serviceID and I have changed all occurrences: Usage: HAAdmin [-getServiceState <serviceId>] I will check out whether a cluster needs to be restarted after using Configuration.set and make changes accordingly. Thanks for taking a look.
          Hide
          Stephen Chu added a comment -

          Thanks, Sujay.

          Some comments/questions:

          • Where will this code live? In bigtop-test-framework? I think it'd be useful to attach a patch with this code integrated into Bigtop source.
          • In HBaseAdapter, you have start/stopRegionServer and start/stopHMaster. In HDFSAdapter, I don't see start/stopNameNode or start/stopDataNode. Perhaps they'd be useful? Same with the MRAdapter - start/stopTaskTracker and start/stopJobTracker. However, for the purposes of supporting the HDFS HA tests, maybe you don't have to worry about MR for now.
          • Host has a serviceName. What will this eventually be used for? When I think of a service name, I think of HDFS, MR, HBase, JobTracker, NameNode, etc. Will a Host be tied to one of those service names?
          • For ClusterManager's get/setConfiguration, do we have to restart services for changes to be in effect?
          Show
          Stephen Chu added a comment - Thanks, Sujay. Some comments/questions: Where will this code live? In bigtop-test-framework? I think it'd be useful to attach a patch with this code integrated into Bigtop source. In HBaseAdapter, you have start/stopRegionServer and start/stopHMaster. In HDFSAdapter, I don't see start/stopNameNode or start/stopDataNode. Perhaps they'd be useful? Same with the MRAdapter - start/stopTaskTracker and start/stopJobTracker. However, for the purposes of supporting the HDFS HA tests, maybe you don't have to worry about MR for now. Host has a serviceName. What will this eventually be used for? When I think of a service name, I think of HDFS, MR, HBase, JobTracker, NameNode, etc. Will a Host be tied to one of those service names? For ClusterManager's get/setConfiguration, do we have to restart services for changes to be in effect?
          Sujay Rau made changes -
          Attachment BigtopClusterManagerv2.zip [ 12537378 ]
          Hide
          Sujay Rau added a comment -

          Uploaded is a prototype of the BigtopClusterManager that has enough implemented to solve the problems of BIGTOP-614.

          bc:

          • Initializing the manager now creates an object for each host that contains its properties such as rack assignment. All hosts are currently accessed via ssh commands.
          • Daemons are also discovered through "ps" and stored so that the user can query the current state of the daemons on each host object. These daemons are refreshed after a command is run to keep the state up to date. The discovery code still needs to be updated as it doesn't properly discover hbase daemons, but it works fine for killing of namenodes and issuing commands from namenode hosts (which was needed in BIGTOP-614).
          • Configuration values can be set and obtained through the bigtop configuration class.

          Cos:
          Thanks for looking at the code.

          • I have not updated the packages yet but it will definitely be done.
          • I changed Shim to Adapter
          • Nodes commissioning/decommissioning is not included in the scope currently, but I believe it would be possible to add later on.

          TestHAMRBCM stands for High Availability, Map Reduce, with BigtopClusterManager. I have tested it on a 4 node cluster and it does what it is supposed to do.

          Show
          Sujay Rau added a comment - Uploaded is a prototype of the BigtopClusterManager that has enough implemented to solve the problems of BIGTOP-614 . bc: Initializing the manager now creates an object for each host that contains its properties such as rack assignment. All hosts are currently accessed via ssh commands. Daemons are also discovered through "ps" and stored so that the user can query the current state of the daemons on each host object. These daemons are refreshed after a command is run to keep the state up to date. The discovery code still needs to be updated as it doesn't properly discover hbase daemons, but it works fine for killing of namenodes and issuing commands from namenode hosts (which was needed in BIGTOP-614 ). Configuration values can be set and obtained through the bigtop configuration class. Cos: Thanks for looking at the code. I have not updated the packages yet but it will definitely be done. I changed Shim to Adapter Nodes commissioning/decommissioning is not included in the scope currently, but I believe it would be possible to add later on. TestHAMRBCM stands for High Availability, Map Reduce, with BigtopClusterManager. I have tested it on a 4 node cluster and it does what it is supposed to do.
          Hide
          Konstantin Boudnik added a comment -

          A couple more questions:

          • is nodes commissioning included into scope
          • is nodes decommissioning include?
            Basically, do we want to have an ability to kick-start a cluster from within the iTest driver?
          Show
          Konstantin Boudnik added a comment - A couple more questions: is nodes commissioning included into scope is nodes decommissioning include? Basically, do we want to have an ability to kick-start a cluster from within the iTest driver?
          Hide
          Konstantin Boudnik added a comment -

          Sujay, I have quickly ran over the sample code. Pretty much +1 on BC's points. Also,

          • packages in jave should have namespaces (like org.apache.bigtop.clustermanagement etc.
          • term Shims doesn't exist really anyware outside of Hive and I am sure it was called so there for a simple lack of the proper name that is widely accepted in CS word. The name is Adapter

          Other than that I can't comment much at this moment, because
          a) BC made fine points already
          b) there's not much to comment on right now

          Thanks for start the ball rolling. I am sure this ticket will get a lot of attention.

          Show
          Konstantin Boudnik added a comment - Sujay, I have quickly ran over the sample code. Pretty much +1 on BC's points. Also, packages in jave should have namespaces (like org.apache.bigtop.clustermanagement etc. term Shims doesn't exist really anyware outside of Hive and I am sure it was called so there for a simple lack of the proper name that is widely accepted in CS word. The name is Adapter Other than that I can't comment much at this moment, because a) BC made fine points already b) there's not much to comment on right now Thanks for start the ball rolling. I am sure this ticket will get a lot of attention.
          Hide
          Bruno Mahé added a comment -

          Eric> It would be very nice to have Apache Bigtop (incubating) integrated with Apache Ambari (incubating).
          Provided Apache Ambari (incubating) can work on the Apache Bigtop (incubating) supported platforms as all the other components do.
          In any case, whoever is interested in such work should feel free to open tickets or ask questions.

          Cos> Whether it is Apache Ambari (incubating) or puppet, these are just some possible implementations and out of scope for this ticket.

          Other than that, +1 to what BC said. I was about to say the very same things.

          Show
          Bruno Mahé added a comment - Eric> It would be very nice to have Apache Bigtop (incubating) integrated with Apache Ambari (incubating). Provided Apache Ambari (incubating) can work on the Apache Bigtop (incubating) supported platforms as all the other components do. In any case, whoever is interested in such work should feel free to open tickets or ask questions. Cos> Whether it is Apache Ambari (incubating) or puppet, these are just some possible implementations and out of scope for this ticket. Other than that, +1 to what BC said. I was about to say the very same things.
          Hide
          Sujay Rau added a comment -

          Devaraj: I did look at that HBase patch while I was plotting out this design. I think the fact that the scopes of the HBase jira and this Bigtop jira are slightly different, accounts for the variances between the two posted patches, but at the core they share a lot of the same structure. ClusterManager is effectively the same, while HBASE-6241's HBaseCluster has the same functionality of HBaseShim. HBaseClusterManager is an example of an implementation for a particular distribution that uses service scripts as its way of communication with the cluster.

          Eric: That's an interesting idea. In this JIRA I'm going to start with this simple implementation and try to solve the original problem created by BIGTOP-614, but other implementations like Ambari could definitely be used in the future.

          Show
          Sujay Rau added a comment - Devaraj: I did look at that HBase patch while I was plotting out this design. I think the fact that the scopes of the HBase jira and this Bigtop jira are slightly different, accounts for the variances between the two posted patches, but at the core they share a lot of the same structure. ClusterManager is effectively the same, while HBASE-6241 's HBaseCluster has the same functionality of HBaseShim. HBaseClusterManager is an example of an implementation for a particular distribution that uses service scripts as its way of communication with the cluster. Eric: That's an interesting idea. In this JIRA I'm going to start with this simple implementation and try to solve the original problem created by BIGTOP-614 , but other implementations like Ambari could definitely be used in the future.
          Hide
          Konstantin Boudnik added a comment -

          Ambari started with the BigTop puppet scripts, so it seems like it would be natural to settle on a common framework. I'm sure we could wrestle up some interested contributors form the AmbariVerse...

          Great idea, Eric! And, of course, contributions are the most important part of
          any vibrant open source community!

          Bigtop providing a pre-built package for Ambari? Yes, why not! It won't be
          included to 0.4 release, apparently. And I guess it won't be ready on time for
          0.3.1 update either, but I am sure with the right amount of contribution it
          can make into 0.5 (subject to the release vote, of course).

          Here's a couple general ideas behind this work, as a headstart to potential
          contributors:

          BigTop needs an ability to control and monitor internal states of a cluster
          from within the tests. This, basically, put a couple of requirements on any
          cluster management framework that might be considered/developed:

          • simplicity of Java APIs (after all, iTest is written in Groovy and it is
            done for a very good reason).
          • agility of the management framework: for the sake of validation's
            efficiency and clean control-flow of the testing scenarios an API call
            from a test to the framework should be synchronous
          • and most importantly, I do want to have programming control without a need
            to deal with Puppet recipes modification every time something needs to be
            tweaked, nor with bringing up a separate Puppet master, because master-less mode
            won't work in this particular case.

          Cos

          Show
          Konstantin Boudnik added a comment - Ambari started with the BigTop puppet scripts, so it seems like it would be natural to settle on a common framework. I'm sure we could wrestle up some interested contributors form the AmbariVerse... Great idea, Eric! And, of course, contributions are the most important part of any vibrant open source community! Bigtop providing a pre-built package for Ambari? Yes, why not! It won't be included to 0.4 release, apparently. And I guess it won't be ready on time for 0.3.1 update either, but I am sure with the right amount of contribution it can make into 0.5 (subject to the release vote, of course). Here's a couple general ideas behind this work, as a headstart to potential contributors: BigTop needs an ability to control and monitor internal states of a cluster from within the tests. This, basically, put a couple of requirements on any cluster management framework that might be considered/developed: simplicity of Java APIs (after all, iTest is written in Groovy and it is done for a very good reason). agility of the management framework: for the sake of validation's efficiency and clean control-flow of the testing scenarios an API call from a test to the framework should be synchronous and most importantly, I do want to have programming control without a need to deal with Puppet recipes modification every time something needs to be tweaked, nor with bringing up a separate Puppet master, because master-less mode won't work in this particular case. Cos
          Hide
          Andrew Purtell added a comment -

          If BigTop could build Ambari, then it could use it as a test framework easily as well.

          +1

          We have this on our roadmap to do next quarter unless it's done elsewhere first.

          Show
          Andrew Purtell added a comment - If BigTop could build Ambari, then it could use it as a test framework easily as well. +1 We have this on our roadmap to do next quarter unless it's done elsewhere first.
          Hide
          eric baldeschwieler added a comment -

          Maybe we should dig into how to integrate BigTop with Ambari. It is a very important ambari goal to be able to deploy bigtop stacks. If BigTop could build Ambari, then it could use it as a test framework easily as well.

          Ambari started with the BigTop puppet scripts, so it seems like it would be natural to settle on a common framework. I'm sure we could wrestle up some interested contributors form the AmbariVerse...

          Show
          eric baldeschwieler added a comment - Maybe we should dig into how to integrate BigTop with Ambari. It is a very important ambari goal to be able to deploy bigtop stacks. If BigTop could build Ambari, then it could use it as a test framework easily as well. Ambari started with the BigTop puppet scripts, so it seems like it would be natural to settle on a common framework. I'm sure we could wrestle up some interested contributors form the AmbariVerse...
          Hide
          Devaraj Das added a comment -

          Sujay, have you looked at the ClusterManager API that has been posted as a patch on HBASE-6241 https://issues.apache.org/jira/secure/attachment/12533933/HBASE-6241_v1.patch.

          Show
          Devaraj Das added a comment - Sujay, have you looked at the ClusterManager API that has been posted as a patch on HBASE-6241 https://issues.apache.org/jira/secure/attachment/12533933/HBASE-6241_v1.patch .
          Hide
          bc Wong added a comment -

          Thanks for the progress update, Sujay!

          Have you thought about how to model the rest? It's a bit hard to comment on the current design without knowing your overall plan.

          • A cluster has hosts. Perhaps the API should expose that?
            • Hosts have properties, like rack assignment. You may want to consider exposing those as well.
          • A cluster has services, like HDFS, MR, HBase, ZK, etc. How does the API let callers discover them?
          • A service has daemons (NN, DN, etc.) Should each shim expose what daemons have been setup and where they're running?
          • A service also has other properties and operations:
            1. Configuration, like `fs.defaultFS'. Probably useful for tests to know, and to change.
            2. Run state, like started/stopped.
            3. Commands, like start/stop/restart.
          • A daemon has its own:
            1. Configuration, like `hadoop.security.authentication'. For example, tests would probably need to set this for any Kerberos testing.
            2. Run state. Useful for testing failover.
            3. Commands, like start/stop/restart, decommission.
          • Currently, you're modelling a daemon instance by a (daemon_type, hostname) tuple. I'd promote it to be an interface class, because daemons seem more complex than that.

          It's useful for me to think in concrete terms. For example, to test things that breaks after you turn on HA (like HIVE-3056), you probably need the capability to:

          1. Make configuration change, to turn on HA in the middle of the test.
          2. Trigger commands, which is restart in this case. You already have that.
          3. Query the run state, to assert that other components are still running. Specific service-level tests are even better.

          I'm new to Bigtop. Let me know if that makes sense.

          Show
          bc Wong added a comment - Thanks for the progress update, Sujay! Have you thought about how to model the rest? It's a bit hard to comment on the current design without knowing your overall plan. A cluster has hosts. Perhaps the API should expose that? Hosts have properties, like rack assignment. You may want to consider exposing those as well. A cluster has services, like HDFS, MR, HBase, ZK, etc. How does the API let callers discover them? A service has daemons (NN, DN, etc.) Should each shim expose what daemons have been setup and where they're running? A service also has other properties and operations: Configuration, like `fs.defaultFS'. Probably useful for tests to know, and to change. Run state, like started/stopped. Commands, like start/stop/restart. A daemon has its own: Configuration, like `hadoop.security.authentication'. For example, tests would probably need to set this for any Kerberos testing. Run state. Useful for testing failover. Commands, like start/stop/restart, decommission. Currently, you're modelling a daemon instance by a (daemon_type, hostname) tuple. I'd promote it to be an interface class, because daemons seem more complex than that. It's useful for me to think in concrete terms. For example, to test things that breaks after you turn on HA (like HIVE-3056 ), you probably need the capability to: Make configuration change, to turn on HA in the middle of the test. Trigger commands, which is restart in this case. You already have that. Query the run state, to assert that other components are still running. Specific service-level tests are even better. I'm new to Bigtop. Let me know if that makes sense.
          Sujay Rau made changes -
          Attachment ClusterManagerAPI.pdf [ 12535760 ]
          Attachment BigtopClusterManager.zip [ 12535761 ]
          Hide
          Sujay Rau added a comment -

          Attached is a potential API for cluster-abstraction. Thoughts?

          Show
          Sujay Rau added a comment - Attached is a potential API for cluster-abstraction. Thoughts?
          Enis Soztutar made changes -
          Link This issue is related to HBASE-6201 [ HBASE-6201 ]
          Roman Shaposhnik made changes -
          Fix Version/s 0.5.0 [ 12321865 ]
          Hide
          Konstantin Boudnik added a comment -

          +1 on going with Puppet APIs: this seems to be a pretty much industry standard and we have a number of receipts already.
          As for other frameworks: they are either closed source or not mature enough to spend the time on.
          However, proper abstraction should allow 3rd parties to develop their own adapters compatible with iTest yet kept separated.

          Show
          Konstantin Boudnik added a comment - +1 on going with Puppet APIs: this seems to be a pretty much industry standard and we have a number of receipts already. As for other frameworks: they are either closed source or not mature enough to spend the time on. However, proper abstraction should allow 3rd parties to develop their own adapters compatible with iTest yet kept separated.
          Roman Shaposhnik made changes -
          Link This issue blocks BIGTOP-614 [ BIGTOP-614 ]
          Roman Shaposhnik made changes -
          Field Original Value New Value
          Assignee Roman Shaposhnik [ rvs ] Sujay Rau [ sujay.rau ]
          Hide
          Roman Shaposhnik added a comment -

          @Andrew, thanks for pointing HADOOP-8468 out. This is indeed what we have to keep an eye on.

          Regarding your other comment – I agree that it is the way to go here: make it pluggable and start small. I'm reassigning this to Sujay who expressed his willingness to prototype. I think the focus for him is going to be gradually building this API while re-factoring some of the tests he's already created.

          Show
          Roman Shaposhnik added a comment - @Andrew, thanks for pointing HADOOP-8468 out. This is indeed what we have to keep an eye on. Regarding your other comment – I agree that it is the way to go here: make it pluggable and start small. I'm reassigning this to Sujay who expressed his willingness to prototype. I think the focus for him is going to be gradually building this API while re-factoring some of the tests he's already created.
          Hide
          Andrew Purtell added a comment -

          Of course, the ideal solution here would be to give Bigtop tests a programmatic access to a Hadoop cluster management framework such as Cloudera's CM or Apache Ambari. As with any ideal solutions I don't think it is realistic though.

          This is true but it may be possible to abstract the means of querying topology to plug in support for those management frameworks later.

          A default plugin for native iTest capability could be one that queries the Puppetmaster via its REST API. Resources can be declared in the scripts or Facter can be extended to publish facts on service locations; the former makes more sense if Puppet is already being used to manage service deployment in the test cluster.

          representing the abstract topology of the entire cluster (services running on a cluster, nodes hosting the daemons, racks, etc).

          A 4-layer topology like that described in HADOOP-8468 ?

          Show
          Andrew Purtell added a comment - Of course, the ideal solution here would be to give Bigtop tests a programmatic access to a Hadoop cluster management framework such as Cloudera's CM or Apache Ambari. As with any ideal solutions I don't think it is realistic though. This is true but it may be possible to abstract the means of querying topology to plug in support for those management frameworks later. A default plugin for native iTest capability could be one that queries the Puppetmaster via its REST API. Resources can be declared in the scripts or Facter can be extended to publish facts on service locations; the former makes more sense if Puppet is already being used to manage service deployment in the test cluster. representing the abstract topology of the entire cluster (services running on a cluster, nodes hosting the daemons, racks, etc). A 4-layer topology like that described in HADOOP-8468 ?
          Hide
          Roman Shaposhnik added a comment -

          This is very much related to the following HBase discussion: HBASE-6201

          Show
          Roman Shaposhnik added a comment - This is very much related to the following HBase discussion: HBASE-6201
          Roman Shaposhnik created issue -

            People

            • Assignee:
              Stephen Chu
              Reporter:
              Roman Shaposhnik
            • Votes:
              0 Vote for this issue
              Watchers:
              25 Start watching this issue

              Dates

              • Created:
                Updated:

                Development