HBase
  1. HBase
  2. HBASE-4448

HBaseTestingUtilityFactory - pattern for re-using HBaseTestingUtility instances across unit tests

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Setting up and tearing down HBaseTestingUtility instances in unit tests is very expensive. On my MacBook it takes about 10 seconds to set up a MiniCluster, and 7 seconds to tear it down. When multiplied by the number of test classes that use this facility, that's a lot of time in the build.

      This factory assumes that the JVM is being re-used across test classes in the build, otherwise this pattern won't work.

      I don't think this is appropriate for every use, but I think it can be applicable in a great many cases - especially where developers just want a simple MiniCluster with 1 slave.

      1. HBaseTestingUtilityFactory.java
        6 kB
        Doug Meil
      2. hbase_hbaseTestingUtility_uses_2011_09_22.xlsx
        39 kB
        Doug Meil
      3. java_HBASE_4448.patch
        12 kB
        Doug Meil
      4. java_HBASE_4448_v2.patch
        12 kB
        Doug Meil

        Issue Links

          Activity

          Hide
          Doug Meil added a comment -

          Attached a prototype of HBaseTestingUtilityFactory. This is not ready for prime time yet, but I'd like to solicit comments for the general idea.

          Noted issues: there needs to be a configurable "wait" period when the ref-counts get to zero. That should be set from the build, but how? System property? The reason is that while this pattern can be useful for many cases, it won't be suitable for all. Therefore, there could be periods of non-use when another test is running, and we don't want to be too aggressive in tearing down the instances otherwise we'll be back where we started.

          Show
          Doug Meil added a comment - Attached a prototype of HBaseTestingUtilityFactory. This is not ready for prime time yet, but I'd like to solicit comments for the general idea. Noted issues: there needs to be a configurable "wait" period when the ref-counts get to zero. That should be set from the build, but how? System property? The reason is that while this pattern can be useful for many cases, it won't be suitable for all. Therefore, there could be periods of non-use when another test is running, and we don't want to be too aggressive in tearing down the instances otherwise we'll be back where we started.
          Hide
          Jesse Yates added a comment -

          Initial thoughts: in general, I'm digging the idea (but I'd +1ed it before).

          Concerns:
          Major:
          The timeout removal could be a pretty dangerous thing to do. In the general case, it might not be too bad since it seems like a node should request a cluster and use it in a timeframe. But what about long running tests? It would be nice to modify the mini cluster (or maybe subclass it into a TestingUtilMiniCluster) that actually tracks recent usage (last access time?) and then use that as the 'timeout' number for cleanup.

          Minor:
          Going to need to make sure we tie back requests for a certain size cluster with the type (right now hard coding 1)

          Besides those, I'm liking the patch Doug!

          Show
          Jesse Yates added a comment - Initial thoughts: in general, I'm digging the idea (but I'd +1ed it before). Concerns: Major: The timeout removal could be a pretty dangerous thing to do. In the general case, it might not be too bad since it seems like a node should request a cluster and use it in a timeframe. But what about long running tests? It would be nice to modify the mini cluster (or maybe subclass it into a TestingUtilMiniCluster) that actually tracks recent usage (last access time?) and then use that as the 'timeout' number for cleanup. Minor: Going to need to make sure we tie back requests for a certain size cluster with the type (right now hard coding 1) Besides those, I'm liking the patch Doug!
          Hide
          Doug Meil added a comment -

          The timeout behavior was intended to be "since the usage counts went to zero", so I think we're generally talking about the same idea. How to pass this variable from the build? System property?

          Show
          Doug Meil added a comment - The timeout behavior was intended to be "since the usage counts went to zero", so I think we're generally talking about the same idea. How to pass this variable from the build? System property?
          Hide
          Jesse Yates added a comment -

          I was more worried about if people don't clean up their tests properly and leave the cluster hanging around. But I guess we can just assume that they do it right?

          We could make it a system property (maybe settable via maven on run) or do it with a special test-config.xml

          Show
          Jesse Yates added a comment - I was more worried about if people don't clean up their tests properly and leave the cluster hanging around. But I guess we can just assume that they do it right? We could make it a system property (maybe settable via maven on run) or do it with a special test-config.xml
          Hide
          Doug Meil added a comment -

          The exposure of not stopping the miniclusters exists now. You need to explicitly stop it, so it's not really any different this way. System property works for me.

          Show
          Doug Meil added a comment - The exposure of not stopping the miniclusters exists now. You need to explicitly stop it, so it's not really any different this way. System property works for me.
          Hide
          stack added a comment -

          How would we pass this factory for test to test?

          How is this different from a fat class of tests that has a @Before that spins up the cluster and then an @After to shut it down as TestAdmin or TestFromClientSide do currently?

          Show
          stack added a comment - How would we pass this factory for test to test? How is this different from a fat class of tests that has a @Before that spins up the cluster and then an @After to shut it down as TestAdmin or TestFromClientSide do currently?
          Hide
          Jesse Yates added a comment -

          I think that these tests would be run all in the same jvm (non-forked mode) - that way they can all reuse the same static testing util.

          Running it in forked mode really wouldn't help with this issue. How sure how running in parallel is actually managed - I'm assuming its all out of the same jvm, just on different threads. Using the cluster the proposed way would again be a win.

          Show
          Jesse Yates added a comment - I think that these tests would be run all in the same jvm (non-forked mode) - that way they can all reuse the same static testing util. Running it in forked mode really wouldn't help with this issue. How sure how running in parallel is actually managed - I'm assuming its all out of the same jvm, just on different threads. Using the cluster the proposed way would again be a win.
          Hide
          Doug Meil added a comment -

          As Jesse said, we have to reuse JVMs for this to work. Rather than doing this...

            @BeforeClass
            public static void setUpBeforeClass() throws Exception {
              TEST_UTIL.startMiniCluster(1);
          

          ... you would do something like this...

            @BeforeClass
            public static void setUpBeforeClass() throws Exception {
               TEST_UTIL = HBaseTestingUtilityFactory.get().getMiniCluster(1);
          

          ... and it would already be started.

          And rather than calling an explicit shutdown on the HBaseTestingUtility instance, you'd call a "return" on the factory...

          HBaseTestingUtilityFactory.get().returnMiniCluster(instance);
          

          ... where it would also blow away and tables that have been created so it's clean for the next person that uses it.

          Show
          Doug Meil added a comment - As Jesse said, we have to reuse JVMs for this to work. Rather than doing this... @BeforeClass public static void setUpBeforeClass() throws Exception { TEST_UTIL.startMiniCluster(1); ... you would do something like this... @BeforeClass public static void setUpBeforeClass() throws Exception { TEST_UTIL = HBaseTestingUtilityFactory.get().getMiniCluster(1); ... and it would already be started. And rather than calling an explicit shutdown on the HBaseTestingUtility instance, you'd call a "return" on the factory... HBaseTestingUtilityFactory.get().returnMiniCluster(instance); ... where it would also blow away and tables that have been created so it's clean for the next person that uses it.
          Hide
          Doug Meil added a comment -

          I'm also working on some analysis to show the different uses of HBaseTestingUtility - not all the tests use it the same way. Some do 1 or 3 slave MiniClusters, and some do ZkClusters.

          Show
          Doug Meil added a comment - I'm also working on some analysis to show the different uses of HBaseTestingUtility - not all the tests use it the same way. Some do 1 or 3 slave MiniClusters, and some do ZkClusters.
          Hide
          Doug Meil added a comment -

          Still working on this... it's taking me a bit longer than I thought.

          Show
          Doug Meil added a comment - Still working on this... it's taking me a bit longer than I thought.
          Hide
          Doug Meil added a comment -

          Attaching spreadsheet documenting HBaseTestingUtility configurations by package

          Show
          Doug Meil added a comment - Attaching spreadsheet documenting HBaseTestingUtility configurations by package
          Hide
          Jesse Yates added a comment -

          Doug, had the same thoughts you did on whether or not all the classes really need to have all the region servers, etc.

          With the reusability, in some cases I have some doubts about what is actually practical. In some cases (eg. coproc), they are doing a bunch of configuration on the table, which has implications on whether or not the table/cluster can be immediately/concurrently reused.

          There are a couple things we could do to make usage easier.

          (1) A lot of times people are setting the configuration statically, so we want to remove that if people are going to reuse it (though different jvms solves that for the moment).

          (2) Create unique clusters - these would not be cached and really good for situations where people are injecting faults, etc.

          (3) Add some modularization for table configuration etc - a lot seems to be setting up properties for the db, but really making properties that you are just testing on some table. This may be a little pie in the sky...

          (4) Reset properties method - so are allowed to make changes to a table when you use it, but then when done/released, we just reset the properties/config.

          Show
          Jesse Yates added a comment - Doug, had the same thoughts you did on whether or not all the classes really need to have all the region servers, etc. With the reusability, in some cases I have some doubts about what is actually practical. In some cases (eg. coproc), they are doing a bunch of configuration on the table, which has implications on whether or not the table/cluster can be immediately/concurrently reused. There are a couple things we could do to make usage easier. (1) A lot of times people are setting the configuration statically, so we want to remove that if people are going to reuse it (though different jvms solves that for the moment). (2) Create unique clusters - these would not be cached and really good for situations where people are injecting faults, etc. (3) Add some modularization for table configuration etc - a lot seems to be setting up properties for the db, but really making properties that you are just testing on some table. This may be a little pie in the sky... (4) Reset properties method - so are allowed to make changes to a table when you use it, but then when done/released, we just reset the properties/config.
          Hide
          Doug Meil added a comment -

          I agree that some tests will need special configuration for special situations.

          But based on the sheet, I think there is still a benefit in going this route. There are still 39 MiniCluster instances that could be shared between tests, and that's not even considering MiniZk instances which could also be shared. That's roughly 10 or 11 minutes of extra startup & teardown time right there, and we're just getting started.

          I think it might be worth reviewing the tests in terms of what the tests need vs. what they were coded for, especially with Client and REST packages. Does the REST unit test really need a MiniCluster with 3 slaves? I would hazard a guess that there was some copy-paste going on.

          There isn't any single thing that can fix the build, but I still think this approach seems like a reasonable start.

          Show
          Doug Meil added a comment - I agree that some tests will need special configuration for special situations. But based on the sheet, I think there is still a benefit in going this route. There are still 39 MiniCluster instances that could be shared between tests, and that's not even considering MiniZk instances which could also be shared. That's roughly 10 or 11 minutes of extra startup & teardown time right there, and we're just getting started. I think it might be worth reviewing the tests in terms of what the tests need vs. what they were coded for, especially with Client and REST packages. Does the REST unit test really need a MiniCluster with 3 slaves? I would hazard a guess that there was some copy-paste going on. There isn't any single thing that can fix the build, but I still think this approach seems like a reasonable start.
          Hide
          Jesse Yates added a comment -

          Definitely agree that doing this still makes sense.

          I was just thinking about adding some functionality to help people testing down the road.

          Show
          Jesse Yates added a comment - Definitely agree that doing this still makes sense. I was just thinking about adding some functionality to help people testing down the road.
          Hide
          Doug Meil added a comment -

          Agreed.

          Show
          Doug Meil added a comment - Agreed.
          Hide
          Doug Meil added a comment -

          Uploading updated spreadsheet to describe HBaseClusterTestCase usage

          Show
          Doug Meil added a comment - Uploading updated spreadsheet to describe HBaseClusterTestCase usage
          Hide
          Doug Meil added a comment -

          Submitting patch for HBaseTestingUtilityFactory. Also, I changed one existing unit test to use this to show example usage.

          Show
          Doug Meil added a comment - Submitting patch for HBaseTestingUtilityFactory. Also, I changed one existing unit test to use this to show example usage.
          Hide
          Jesse Yates added a comment -

          I've got a couple concerns about the patch - is it already applied to trunk or can we put it up in RB?

          Off the bat:
          (1) A lot of times tests set the conf on the minicluster, before starting it - this could lead to some issues when requesting the cluster. Still thinking a requestUnique with a conf would be a good addition
          (2) If we are going to just reuse the TestingUtility, it would make sense to also also some reuse of the dfs and zk clusters.
          (3) We probably want to check for utils that haven't been used for the given time, rather than waiting for all the possible utils to not be used.

          If you want, we can open up sub tickets for these issues. I would be happy to help out with any/all of the above.

          Show
          Jesse Yates added a comment - I've got a couple concerns about the patch - is it already applied to trunk or can we put it up in RB? Off the bat: (1) A lot of times tests set the conf on the minicluster, before starting it - this could lead to some issues when requesting the cluster. Still thinking a requestUnique with a conf would be a good addition (2) If we are going to just reuse the TestingUtility, it would make sense to also also some reuse of the dfs and zk clusters. (3) We probably want to check for utils that haven't been used for the given time, rather than waiting for all the possible utils to not be used. If you want, we can open up sub tickets for these issues. I would be happy to help out with any/all of the above.
          Hide
          Jesse Yates added a comment -

          Not that I don't think this is good stuff Doug, I'm on board with a majority of the code. It's just that I think some review/discussion on the design and implementation would be good.

          Show
          Jesse Yates added a comment - Not that I don't think this is good stuff Doug, I'm on board with a majority of the code. It's just that I think some review/discussion on the design and implementation would be good.
          Hide
          Doug Meil added a comment -

          This is not applied on trunk. No apology required! Feedback is expected/necessary for something like this.

          re: #1 Per the spreadsheet, I'm not sure that customized MiniClusters can be re-used. I left those out on purpose at least for this pass to focus on the generic "Mini 1, 2, 3" ones. Also, a review of anything that isn't "Mini 1" is required - I don't think all those REST tests need "Mini 3" - I think they can be "Mini 1".

          re: #2 Re-using DFS and ZK Clusters is something that I thought could come later. I named the existing "getMiniCluster" (i.e., as opposed to 'get()') so that the DFS and ZK clusters could be obtained later. Probably just a follow-on ticket rather than a sub-ticket. I'd rather get the kinks out of this round first.

          re: #3 In terms of the waiting logic, I was trying not to over-complicate this and track last-usage-by-specific MiniCluster config.

          Show
          Doug Meil added a comment - This is not applied on trunk. No apology required! Feedback is expected/necessary for something like this. re: #1 Per the spreadsheet, I'm not sure that customized MiniClusters can be re-used. I left those out on purpose at least for this pass to focus on the generic "Mini 1, 2, 3" ones. Also, a review of anything that isn't "Mini 1" is required - I don't think all those REST tests need "Mini 3" - I think they can be "Mini 1". re: #2 Re-using DFS and ZK Clusters is something that I thought could come later. I named the existing "getMiniCluster" (i.e., as opposed to 'get()') so that the DFS and ZK clusters could be obtained later. Probably just a follow-on ticket rather than a sub-ticket. I'd rather get the kinks out of this round first. re: #3 In terms of the waiting logic, I was trying not to over-complicate this and track last-usage-by-specific MiniCluster config.
          Hide
          Jesse Yates added a comment -

          Quick synopsis:
          re:re: #1
          So you are thinking that people would just the TestingUtility if they need to have their own unique cluster? I guess I was thinking that there would be a gain by reusing those objects, but thinking about it, I doubt it (also thinking everything would be brokered by the Factory, but it really needn't).

          Also was thinking we need to do a review of mini-cluster usage on things like REST - had the same hunch when I was grepping through the tests.

          re:re #2
          +1

          re:re #3
          I don't think this actually runs multiple times - run() just has a try method and won't loop.
          I'm ok with making it simple right now, lets just make a note to complicate it later

          Overall, I'm ok with it, except for the looping in #3

          Show
          Jesse Yates added a comment - Quick synopsis: re:re: #1 So you are thinking that people would just the TestingUtility if they need to have their own unique cluster? I guess I was thinking that there would be a gain by reusing those objects, but thinking about it, I doubt it (also thinking everything would be brokered by the Factory, but it really needn't). Also was thinking we need to do a review of mini-cluster usage on things like REST - had the same hunch when I was grepping through the tests. re:re #2 +1 re:re #3 I don't think this actually runs multiple times - run() just has a try method and won't loop. I'm ok with making it simple right now, lets just make a note to complicate it later Overall, I'm ok with it, except for the looping in #3
          Hide
          Ted Yu added a comment -

          I haven't gone through every line of the patch.
          Minor comment:

          +      i = new Integer(ONE);
          

          ONE is already an Integer, right ?

          I wasn't involved in the early discussion of this ticket. So I hope Jesse and Doug can reach agreement and create other tickets if needed.

          If improvement in running time of tests by using the patch can be shown, that would be more convincing.

          Show
          Ted Yu added a comment - I haven't gone through every line of the patch. Minor comment: + i = new Integer (ONE); ONE is already an Integer, right ? I wasn't involved in the early discussion of this ticket. So I hope Jesse and Doug can reach agreement and create other tickets if needed. If improvement in running time of tests by using the patch can be shown, that would be more convincing.
          Hide
          Doug Meil added a comment -

          Jesse- sounds like we're cool on #1 and #2. Let me look at #3.

          Ted- performance changes were in the attached spreadsheet (and earlier in the ticket), but that won't all come with this change... this is about getting the utility in place and then the changes identified in the spreadsheet will come in later tickets. Probably 10 minutes just for the identified changes, with more possible.

          Show
          Doug Meil added a comment - Jesse- sounds like we're cool on #1 and #2. Let me look at #3. Ted- performance changes were in the attached spreadsheet (and earlier in the ticket), but that won't all come with this change... this is about getting the utility in place and then the changes identified in the spreadsheet will come in later tickets. Probably 10 minutes just for the identified changes, with more possible.
          Hide
          Ted Yu added a comment -

          I think the following method should be enhanced to deal with the test which uses more than one cluster whose numbers of slaves are the same:

          + protected synchronized HBaseTestingUtility getMiniClusterImpl(int slaves) throws Exception {
          

          This can be done in another JIRA.

          Show
          Ted Yu added a comment - I think the following method should be enhanced to deal with the test which uses more than one cluster whose numbers of slaves are the same: + protected synchronized HBaseTestingUtility getMiniClusterImpl( int slaves) throws Exception { This can be done in another JIRA.
          Hide
          Jesse Yates added a comment -

          Hmm, interesting idea Ted. Definitely worth looking into in another patch. Would definitely have to look into the performance benefits of that. Interestingly, its probably going to highly correlated to the order of tests being run (which is really test naming).

          Show
          Jesse Yates added a comment - Hmm, interesting idea Ted. Definitely worth looking into in another patch. Would definitely have to look into the performance benefits of that. Interestingly, its probably going to highly correlated to the order of tests being run (which is really test naming).
          Hide
          Doug Meil added a comment -

          More than one cluster in the same test? That would have to be a replication test. I'm not sure that's going to work, I think the best-case is to re-use one cluster, and then fire-up another cluster from scratch.

          Show
          Doug Meil added a comment - More than one cluster in the same test? That would have to be a replication test. I'm not sure that's going to work, I think the best-case is to re-use one cluster, and then fire-up another cluster from scratch.
          Hide
          Doug Meil added a comment -

          Fixed point #3.

          Show
          Doug Meil added a comment - Fixed point #3.
          Hide
          Doug Meil added a comment -

          ... with the v2 patch

          Show
          Doug Meil added a comment - ... with the v2 patch
          Hide
          Jesse Yates added a comment -

          Comments on the latest patch:

          In the usage,

          -  private final static HBaseTestingUtility TEST_UTIL = new HBaseTestingUtility();
          +  private static HBaseTestingUtility TEST_UTIL = new HBaseTestingUtility();
          

          Should probably not even assign anything, since it is done in setup.

          Going back to Ted's comment,

           Integer i = mcUsageCount.get(tu);  
          +    if (i == null) {
          +      i = ONE;
          +    } else {
          +      int j = i.intValue() + 1;
          +      i = new Integer(j);
          +    }
          +    mcUsageCount.put(tu, i);
          +  }
          

          This seems overly complex - just use autoboxing here. Maybe we should use specific names for what ONE and ZERO mean (e.g. ONE -> USE_INCREMENT, ZERO -> UNUSED).

          In the UtilityCleaner,

          +     try {
          +        while (true) {
          

          could lead to some issues. We just need to make sure that the thread is a daemon thread when we start.

          Show
          Jesse Yates added a comment - Comments on the latest patch: In the usage, - private final static HBaseTestingUtility TEST_UTIL = new HBaseTestingUtility(); + private static HBaseTestingUtility TEST_UTIL = new HBaseTestingUtility(); Should probably not even assign anything, since it is done in setup. Going back to Ted's comment, Integer i = mcUsageCount.get(tu); + if (i == null ) { + i = ONE; + } else { + int j = i.intValue() + 1; + i = new Integer (j); + } + mcUsageCount.put(tu, i); + } This seems overly complex - just use autoboxing here. Maybe we should use specific names for what ONE and ZERO mean (e.g. ONE -> USE_INCREMENT, ZERO -> UNUSED). In the UtilityCleaner, + try { + while ( true ) { could lead to some issues. We just need to make sure that the thread is a daemon thread when we start.
          Hide
          stack added a comment -

          Looking at startup I see we spend > 1 second deploying the UIs for hdfs, stuff we never look at. No way to turn them off. We could do some stuff around the bootstrapping where we'd just copy into place already made root and meta regions which could save a second or so.

          Could do this to save a second or so:

          stack-7:hbase stack$ git diff src/test/
          diff --git a/src/test/resources/hbase-site.xml b/src/test/resources/hbase-site.xml
          index 2a9ab29..b6ad732 100644
          --- a/src/test/resources/hbase-site.xml
          +++ b/src/test/resources/hbase-site.xml
          @@ -23,6 +23,10 @@
           -->
           <configuration>
             <property>
          +    <name>hbase.master.wait.on.regionservers.interval</name>
          +    <value>1000</value>
          +  </property>
          +  <property>
               <name>hbase.regionserver.msginterval</name>
               <value>1000</value>
               <description>Interval between messages from the RegionServer to HMaster
          

          But looking at this no big savings to be found in startup/shutdown.

          So, yeah, study of the tests themselves is way to go. We should be more parsimonious starting/stopping clusters and do more work once we spin one up (Sorry for stating the obvious).

          Thanks for doing the excel spread sheet Doug. Doesn't sound like big wins getting rid of HBaseClusterTestCase (We should get rid of it though)... though perhaps in hbase package, the case where we run 4 unit tests where we spin up the cluster each time it could help.

          Show
          stack added a comment - Looking at startup I see we spend > 1 second deploying the UIs for hdfs, stuff we never look at. No way to turn them off. We could do some stuff around the bootstrapping where we'd just copy into place already made root and meta regions which could save a second or so. Could do this to save a second or so: stack-7:hbase stack$ git diff src/test/ diff --git a/src/test/resources/hbase-site.xml b/src/test/resources/hbase-site.xml index 2a9ab29..b6ad732 100644 --- a/src/test/resources/hbase-site.xml +++ b/src/test/resources/hbase-site.xml @@ -23,6 +23,10 @@ --> <configuration> <property> + <name>hbase.master.wait.on.regionservers.interval</name> + <value>1000</value> + </property> + <property> <name>hbase.regionserver.msginterval</name> <value>1000</value> <description>Interval between messages from the RegionServer to HMaster But looking at this no big savings to be found in startup/shutdown. So, yeah, study of the tests themselves is way to go. We should be more parsimonious starting/stopping clusters and do more work once we spin one up (Sorry for stating the obvious). Thanks for doing the excel spread sheet Doug. Doesn't sound like big wins getting rid of HBaseClusterTestCase (We should get rid of it though)... though perhaps in hbase package, the case where we run 4 unit tests where we spin up the cluster each time it could help.
          Hide
          Doug Meil added a comment -

          Thanks Stack. In general, do you support the HBaseTestingUtilityFactory approach, where generic MiniClusters can be re-used (where-ever possible)?

          I think that HBaseClusterTestCase could be refactored to use HBaseTestingUtilityFactory to re-use the MiniCluster since they are all the same config. It's a "plus" but not a "huge win", but still should be done. Every little bit helps.

          Show
          Doug Meil added a comment - Thanks Stack. In general, do you support the HBaseTestingUtilityFactory approach, where generic MiniClusters can be re-used (where-ever possible)? I think that HBaseClusterTestCase could be refactored to use HBaseTestingUtilityFactory to re-use the MiniCluster since they are all the same config. It's a "plus" but not a "huge win", but still should be done. Every little bit helps.
          Hide
          stack added a comment -

          I agree with your "Every little bit helps." (I am working on hbase-4503 after going through your excel spread sheet). I'm still going through this issue. On the factory, I'm not sure yet. I'm unclear on how it'll be passed from test to test and also how we prevent one test polluting another tests's run. I also need to see for myself how this approach is different from the approach where we spin up the cluster at start of a suite and then do a bunch of tests as we do in TestFromClientSide and TestAdmin.

          Will be back with more comments.

          A two hour test run is killing us. It means we only get max of 12 runs a day (and never get that many anyways)... and we have four test branches run up on hudson so the long running test suite is a progress killer (as well as a productivity killer). Reminds me of the microsoft build.

          Show
          stack added a comment - I agree with your "Every little bit helps." (I am working on hbase-4503 after going through your excel spread sheet). I'm still going through this issue. On the factory, I'm not sure yet. I'm unclear on how it'll be passed from test to test and also how we prevent one test polluting another tests's run. I also need to see for myself how this approach is different from the approach where we spin up the cluster at start of a suite and then do a bunch of tests as we do in TestFromClientSide and TestAdmin. Will be back with more comments. A two hour test run is killing us. It means we only get max of 12 runs a day (and never get that many anyways)... and we have four test branches run up on hudson so the long running test suite is a progress killer (as well as a productivity killer). Reminds me of the microsoft build.
          Hide
          Doug Meil added a comment -

          re: "passed from test to test"

          This is described in the issue description: this approach depends on JVM re-use, otherwise, it won't work. The factory has a cache of MiniClusters based on the number of slaves, and it can be added down the road for DFS cluster or ZkCluster as needed.

          re: "pollution"

          I hate pollution!

          There are two cases: intra-test and inter-test.
          Inter-test is handled via all the tables getting disabled and whacked when the minicluster is returned to the factory.
          Intra-test (i.e., multiple test methods in the same class) is the same as is now. There is no automatic cleanup between test-methods in the same class even without this utility, so it's no worse.

          Show
          Doug Meil added a comment - re: "passed from test to test" This is described in the issue description: this approach depends on JVM re-use, otherwise, it won't work. The factory has a cache of MiniClusters based on the number of slaves, and it can be added down the road for DFS cluster or ZkCluster as needed. re: "pollution" I hate pollution! There are two cases: intra-test and inter-test. Inter-test is handled via all the tables getting disabled and whacked when the minicluster is returned to the factory. Intra-test (i.e., multiple test methods in the same class) is the same as is now. There is no automatic cleanup between test-methods in the same class even without this utility, so it's no worse.
          Hide
          stack added a comment -

          I think that HBaseClusterTestCase could be refactored to use HBaseTestingUtilityFactory to re-use the MiniCluster since they are all the same config.

          Did as you suggested over in 'HBASE-4503 Purge deprecated HBaseClusterTestCase'

          Thanks Stack. In general, do you support the HBaseTestingUtilityFactory approach, where generic MiniClusters can be re-used (where-ever possible)?

          I haven't reviewied it yet – will do that later – but I'm thinking I'm not in favor.

          While playing with trying to have our tests run in parallel, I was reminded why we fork for each test.

          We do this so there is isolation between test runs, so less chance of a test clobbering those that run later. Having it so the dietrus from one test run could dirty the context of the next test makes debugging test failures harder. Reusing the cluster also implies each test cleans up after itself. That might be too much to expect. Forking makes it hard to use this factory.

          Back with more later.

          Show
          stack added a comment - I think that HBaseClusterTestCase could be refactored to use HBaseTestingUtilityFactory to re-use the MiniCluster since they are all the same config. Did as you suggested over in ' HBASE-4503 Purge deprecated HBaseClusterTestCase' Thanks Stack. In general, do you support the HBaseTestingUtilityFactory approach, where generic MiniClusters can be re-used (where-ever possible)? I haven't reviewied it yet – will do that later – but I'm thinking I'm not in favor. While playing with trying to have our tests run in parallel, I was reminded why we fork for each test. We do this so there is isolation between test runs, so less chance of a test clobbering those that run later. Having it so the dietrus from one test run could dirty the context of the next test makes debugging test failures harder. Reusing the cluster also implies each test cleans up after itself. That might be too much to expect. Forking makes it hard to use this factory. Back with more later.
          Hide
          stack added a comment -

          I took a look at patch. Doesn't look bad. Jesse and Ted seems to have hit all comments I'd have on the code. In my previous comment I raise my larger concerns about the direction this is going in (Firing up the excel sheet to figure what to attack next).

          Show
          stack added a comment - I took a look at patch. Doesn't look bad. Jesse and Ted seems to have hit all comments I'd have on the code. In my previous comment I raise my larger concerns about the direction this is going in (Firing up the excel sheet to figure what to attack next).
          Hide
          Doug Meil added a comment -

          Thanks Stack.

          re: "Reusing the cluster also implies each test cleans up after itself."

          This is the situation now, though. Test-classes do this cluster teardown in the @AfterClass annotation by tearing down the cluster. The factory does all the table-cleanup when the cluster is returned. Calling 'cluster.shutdown' and returning the cluster to the factory is the same amount of code (1 line).

          When tests run in parallel, are we talking Test-classes, or test-methods? It's the former, right? And would these be in separate JVMs? If so, I think we're ok.

          If there are a pool of JVMs that get used throughout a build then this can work. If they are getting setup and torn down then it won't.

          By the way, I think this is exactly the right level of conversation to have on this issue (i.e., not detail code-level, but the larger build issues).

          I'd still like to lobby for something like this because unless cluster-startup and teardown can be amazingly fast, we're leaving 10-15 minutes on the floor easy every time we run the build - and that's just for the common cluster configs.

          Show
          Doug Meil added a comment - Thanks Stack. re: "Reusing the cluster also implies each test cleans up after itself." This is the situation now, though. Test-classes do this cluster teardown in the @AfterClass annotation by tearing down the cluster. The factory does all the table-cleanup when the cluster is returned. Calling 'cluster.shutdown' and returning the cluster to the factory is the same amount of code (1 line). When tests run in parallel, are we talking Test-classes, or test-methods? It's the former, right? And would these be in separate JVMs? If so, I think we're ok. If there are a pool of JVMs that get used throughout a build then this can work. If they are getting setup and torn down then it won't. By the way, I think this is exactly the right level of conversation to have on this issue (i.e., not detail code-level, but the larger build issues). I'd still like to lobby for something like this because unless cluster-startup and teardown can be amazingly fast, we're leaving 10-15 minutes on the floor easy every time we run the build - and that's just for the common cluster configs.
          Hide
          stack added a comment -

          This is the situation now, though.

          Sort of. We kill the JVM out from under it so all context is wiped so for sure all is cleaned up – we're not completely dependent on tests being thorough about cleanup. Even still it looks like we manage to keep around running zk ensembles which has been responsible for some of the fails up on jenkins of late.

          When tests run in parallel, are we talking Test-classes, or test-methods?

          Classes and NOT in separate jvms. I couldn't make it work. Seems like surefire-plugin is a little buggy doing tests in parallel; its a newish feature.

          If there are a pool of JVMs that get used throughout a build then this can work.

          This would imply surefire-plugin modifications.

          I'd still like to lobby for something like this because unless cluster-startup and teardown can be amazingly fast, we're leaving 10-15 minutes on the floor easy every time we run the build - and that's just for the common cluster configs.

          Yeah. Making cluster up and down faster looks like it would take a bit of work (see some notes above where a second is spent putting up hdfs web uis that are never used and another second could probably be saved if we didn't make meta and root regions each startup (I didn't look at shutdown).

          Can we work on aggregating our tests more so that more tests get run everytime we spin up a cluster (too many tests spin up cluster and then run one test only). I was looking through your spreadsheet to see what to coalesce.

          Show
          stack added a comment - This is the situation now, though. Sort of. We kill the JVM out from under it so all context is wiped so for sure all is cleaned up – we're not completely dependent on tests being thorough about cleanup. Even still it looks like we manage to keep around running zk ensembles which has been responsible for some of the fails up on jenkins of late. When tests run in parallel, are we talking Test-classes, or test-methods? Classes and NOT in separate jvms. I couldn't make it work. Seems like surefire-plugin is a little buggy doing tests in parallel; its a newish feature. If there are a pool of JVMs that get used throughout a build then this can work. This would imply surefire-plugin modifications. I'd still like to lobby for something like this because unless cluster-startup and teardown can be amazingly fast, we're leaving 10-15 minutes on the floor easy every time we run the build - and that's just for the common cluster configs. Yeah. Making cluster up and down faster looks like it would take a bit of work (see some notes above where a second is spent putting up hdfs web uis that are never used and another second could probably be saved if we didn't make meta and root regions each startup (I didn't look at shutdown). Can we work on aggregating our tests more so that more tests get run everytime we spin up a cluster (too many tests spin up cluster and then run one test only). I was looking through your spreadsheet to see what to coalesce.
          Hide
          Jesse Yates added a comment -

          we're not completely dependent on tests being thorough about cleanu

          I don't think it would be completely unreasonable to expect tests to cleanup after themselves. It can be a bit of a pain but reseting static variables, etc is good practice. Thats why the cleanup method is there. If we do go through all the tests and make sure they don't leave a dirty jvm, we could switch to running tests in 'parallel' mode, rather than forked. This gets us faster running tests - you don't teardown and create a new jvm each time and things will more easily actually run in parallel (I think on the backend it just fires up a bunch of threads in the same jvm and starts ticking off tests). It also means we get a more meaningful output from maven - not just the number of fails, but also which classes failed and the stack traces (rather than having to grep through for <<<FAILURE); its a minor inconvenience but nice to have.

          Classes and NOT in separate jvms. I couldn't make it work. Seems like surefire-plugin is a little buggy doing tests in parallel; its a newish feature.

          I don't think that maven is really good about parallelizing wrt to forked tests, but I think we can also parallelize the work among several build servers (though I'm not familiar with the Apache build system, so we many not have multiple servers to distribute work to).

          Can we work on aggregating our tests more so that more tests get run everytime we spin up a cluster

          I just want to make sure we don't start slamming things together just so we cut down on creating clusters, but that their functionality actually belongs together. I would rather a slower build and more confidence in the test coverage.

          Show
          Jesse Yates added a comment - we're not completely dependent on tests being thorough about cleanu I don't think it would be completely unreasonable to expect tests to cleanup after themselves. It can be a bit of a pain but reseting static variables, etc is good practice. Thats why the cleanup method is there. If we do go through all the tests and make sure they don't leave a dirty jvm, we could switch to running tests in 'parallel' mode, rather than forked. This gets us faster running tests - you don't teardown and create a new jvm each time and things will more easily actually run in parallel (I think on the backend it just fires up a bunch of threads in the same jvm and starts ticking off tests). It also means we get a more meaningful output from maven - not just the number of fails, but also which classes failed and the stack traces (rather than having to grep through for <<<FAILURE); its a minor inconvenience but nice to have. Classes and NOT in separate jvms. I couldn't make it work. Seems like surefire-plugin is a little buggy doing tests in parallel; its a newish feature. I don't think that maven is really good about parallelizing wrt to forked tests, but I think we can also parallelize the work among several build servers (though I'm not familiar with the Apache build system, so we many not have multiple servers to distribute work to). Can we work on aggregating our tests more so that more tests get run everytime we spin up a cluster I just want to make sure we don't start slamming things together just so we cut down on creating clusters, but that their functionality actually belongs together. I would rather a slower build and more confidence in the test coverage.
          Hide
          stack added a comment -

          ...This gets us faster running tests - you don't teardown and create a new jvm each time and things will more easily actually run in parallel (I think on the backend it just fires up a bunch of threads in the same jvm and starts ticking off tests)

          Sure. I just think this will take a bunch of effort to do this and even then, going by my experience with running the surefire in parallel, I think it'll be a while before that works properly... so I don't see this happening soon. I was hoping for some speedups before that.

          ...but I think we can also parallelize the work among several build servers

          I was hoping to speed up tests for all users, not just apache build

          I just want to make sure we don't start slamming things together just so we cut down on creating clusters....

          I should have added the qualification that we aggregate together tests that properly cohere (as you say later on the above).

          I would rather a slower build and more confidence in the test coverage.

          Agreed

          Show
          stack added a comment - ...This gets us faster running tests - you don't teardown and create a new jvm each time and things will more easily actually run in parallel (I think on the backend it just fires up a bunch of threads in the same jvm and starts ticking off tests) Sure. I just think this will take a bunch of effort to do this and even then, going by my experience with running the surefire in parallel, I think it'll be a while before that works properly... so I don't see this happening soon. I was hoping for some speedups before that. ...but I think we can also parallelize the work among several build servers I was hoping to speed up tests for all users, not just apache build I just want to make sure we don't start slamming things together just so we cut down on creating clusters.... I should have added the qualification that we aggregate together tests that properly cohere (as you say later on the above). I would rather a slower build and more confidence in the test coverage. Agreed
          Hide
          Ioan Eugen Stan added a comment -

          @Jesse
          We also use the MiniCluster with added conf settings, but they usually stay the same in one set of tests. At the end of the tests we kill the MiniCluster and restart it for another batch with different confs.

          I think it is quite hard to create a perfect testing utility that will avoid restarting the MiniCluster all the time. I also think that reducing the number of restarts on a per module basis is more practical. After all users shut down the MiniCluster to clean the tables for the new tests.

          my 2cents.

          Show
          Ioan Eugen Stan added a comment - @Jesse We also use the MiniCluster with added conf settings, but they usually stay the same in one set of tests. At the end of the tests we kill the MiniCluster and restart it for another batch with different confs. I think it is quite hard to create a perfect testing utility that will avoid restarting the MiniCluster all the time. I also think that reducing the number of restarts on a per module basis is more practical. After all users shut down the MiniCluster to clean the tables for the new tests. my 2cents.
          Hide
          Doug Meil added a comment -

          I think the analysis was of assistence to the general cause, but I'm closing this as "won't fix" because this particular approach won't be taken.

          Show
          Doug Meil added a comment - I think the analysis was of assistence to the general cause, but I'm closing this as "won't fix" because this particular approach won't be taken.

            People

            • Assignee:
              Doug Meil
              Reporter:
              Doug Meil
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development