Bigtop
  1. Bigtop
  2. BIGTOP-1388

Use cluster failure tests during other tests with command line parametrization

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.7.0
    • Fix Version/s: 1.0.0
    • Component/s: tests
    • Labels:

      Description

      User can run a series of cluster failures such as killing/restarting a service and shutting down the network during a mapr or longevity test. The goal is to test that the tests complete during the failures. The failures should be able to be specified as a command line parameter.

      1. BIGTOP-1388.patch
        27 kB
        Dawson Choong
      2. BIGTOP-1388.patch
        22 kB
        Dawson Choong

        Issue Links

          Activity

          Hide
          Dawson Choong added a comment - - edited

          Currently only written for TestDFSIO.groovy. Will apply to TestSLive and MapReduce.

          Added FailureVars - a class that manages variables, objects, and command line parameters for cluster failures.

          Added FailureExecutor - a runnable that is executed by a test such as TestDFSIO. This class spawns cluster failure threads that run in parallel to the hadoop/mapreduce tests.

          (By default, all parameters are off or set to false.)
          Features:

          Use -Dhost=name to specify host being tested on
          Use -DremoteHost=name to specify remote host name being tested on.
          Use -DrunAll=true to run all cluster failures
          Use -DserviceRestart=true to perform a cron or crond service restart.
          Use -DserviceKill=true to perform a service kill.
          Use -Dshutdown=true to perform network shutdown and restart.
          Use -Dservice=name to specify which service being used for restart/kill. (default is crond)
          Use -DfailureDelay=time to specify time (in seconds) in between each failure function.
          Use -DstartDelay=time to specify time (in seconds) before first failure.
          Use -DkillDuration=time to specify the duration (in seconds) a service is down for.

          Show
          Dawson Choong added a comment - - edited Currently only written for TestDFSIO.groovy. Will apply to TestSLive and MapReduce. Added FailureVars - a class that manages variables, objects, and command line parameters for cluster failures. Added FailureExecutor - a runnable that is executed by a test such as TestDFSIO. This class spawns cluster failure threads that run in parallel to the hadoop/mapreduce tests. (By default, all parameters are off or set to false.) Features: Use -Dhost=name to specify host being tested on Use -DremoteHost=name to specify remote host name being tested on. Use -DrunAll=true to run all cluster failures Use -DserviceRestart=true to perform a cron or crond service restart. Use -DserviceKill=true to perform a service kill. Use -Dshutdown=true to perform network shutdown and restart. Use -Dservice=name to specify which service being used for restart/kill. (default is crond) Use -DfailureDelay=time to specify time (in seconds) in between each failure function. Use -DstartDelay=time to specify time (in seconds) before first failure. Use -DkillDuration=time to specify the duration (in seconds) a service is down for.
          Hide
          Konstantin Boudnik added a comment -

          I haven't looked at the patch yet, but the first idea would be to provide a way to read these properties from a file. Hence, making the command line shorter and less clogged.

          Show
          Konstantin Boudnik added a comment - I haven't looked at the patch yet, but the first idea would be to provide a way to read these properties from a file. Hence, making the command line shorter and less clogged.
          Hide
          jay vyas added a comment - - edited

          Thanks for adding this patch Dawson Choong ..possibly, adding your classes into the new bigtop-tests/smoke-tests/ framework, leveraging the gradle smoke utilities could make for an easier to maintain system as well. since its one folder per test, you can easily put a stress.properties into a folder dedicated to your specific tests, that any user can easily configure.
          See the smoke-tests/flume/ for an example of that... (this just a thought/suggestion)

          Show
          jay vyas added a comment - - edited Thanks for adding this patch Dawson Choong ..possibly, adding your classes into the new bigtop-tests/smoke-tests/ framework, leveraging the gradle smoke utilities could make for an easier to maintain system as well. since its one folder per test, you can easily put a stress.properties into a folder dedicated to your specific tests, that any user can easily configure. See the smoke-tests/flume/ for an example of that... (this just a thought/suggestion)
          Hide
          Konstantin Boudnik added a comment -

          stress.properties would be a runtime thing, as far as I see it. Hence, it doesn't matter where it is placed as far as the test case read it.

          Show
          Konstantin Boudnik added a comment - stress.properties would be a runtime thing, as far as I see it. Hence, it doesn't matter where it is placed as far as the test case read it.
          Hide
          jay vyas added a comment - - edited

          I also see some gradle modifications which seem to provide a new way of running the maven based tests. Is this somewhat redundant to the existing gradle wrappers in smoke-tests/ i.e. (BIGTOP-1222) ? not a huge deal, but probably would be good to make a note of why the existing gradle wrappers wont work for the task, so we can update them

          Show
          jay vyas added a comment - - edited I also see some gradle modifications which seem to provide a new way of running the maven based tests. Is this somewhat redundant to the existing gradle wrappers in smoke-tests/ i.e. ( BIGTOP-1222 ) ? not a huge deal, but probably would be good to make a note of why the existing gradle wrappers wont work for the task, so we can update them
          Hide
          Dawson Choong added a comment -

          Sure thing. I'll work on a patch that accomodates your recommendations. And jay vyas you're right there is a bit of redundancy in the gradle wrappers, but we want to eventually replace the original, hardcoded wrappers with the dynamically generated tasks.

          Show
          Dawson Choong added a comment - Sure thing. I'll work on a patch that accomodates your recommendations. And jay vyas you're right there is a bit of redundancy in the gradle wrappers, but we want to eventually replace the original, hardcoded wrappers with the dynamically generated tasks.
          Hide
          jay vyas added a comment -

          sounds good! couple more questions though:

          • regarding the wrappers.

            replace the original, hardcoded wrappers

            Not sure what this is referring to ? Can you provide some details or maybe open a JIRA for this task ?

          • regarding this code The comment sais its allowing user to specify build artifacts, but the name of the method is "runIndividualTests" implying that its running a test, not building the tests. In the end, it seems to call mvn clean install, without running the tests (mvn clean verify)
             
            +/**
            + * Allows user to specify which artifacts to build by dynamically generating tasks.
            + */
            +def runIndividualTests = {
            
          • Finally, regarding duplication, if this relates to the smoke-tests module, let me know we can join our efforts. Maybe my smoke tests can call your wrappers, or you can use mine...
          Show
          jay vyas added a comment - sounds good! couple more questions though: regarding the wrappers. replace the original, hardcoded wrappers Not sure what this is referring to ? Can you provide some details or maybe open a JIRA for this task ? regarding this code The comment sais its allowing user to specify build artifacts, but the name of the method is "runIndividualTests" implying that its running a test, not building the tests. In the end, it seems to call mvn clean install , without running the tests ( mvn clean verify ) +/** + * Allows user to specify which artifacts to build by dynamically generating tasks. + */ +def runIndividualTests = { Finally, regarding duplication, if this relates to the smoke-tests module, let me know we can join our efforts . Maybe my smoke tests can call your wrappers, or you can use mine...
          Hide
          Dawson Choong added a comment - - edited

          In the root build.gradle you can see that the tasks are wrapping around the maven clean installs. It is to my knowledge that this is temporary and we want to eventually move away from maven entirely. You are correct about runIndividualTest method name. It should be called something else as it is installing not testing. This patch actually only pertains to the bigtop-test-framework module because thats where all the cluster failure work lies. I hope that answers your questions!

          Show
          Dawson Choong added a comment - - edited In the root build.gradle you can see that the tasks are wrapping around the maven clean installs. It is to my knowledge that this is temporary and we want to eventually move away from maven entirely. You are correct about runIndividualTest method name. It should be called something else as it is installing not testing. This patch actually only pertains to the bigtop-test-framework module because thats where all the cluster failure work lies. I hope that answers your questions!
          Hide
          jay vyas added a comment -

          okay. ill wait for the cleaned up patch and look again.
          i can help test if you add updates to the README in the next patch .

          Show
          jay vyas added a comment - okay. ill wait for the cleaned up patch and look again. i can help test if you add updates to the README in the next patch .
          Hide
          Dawson Choong added a comment -

          Updated with properties file for variable management and updated README. jay vyas I was a little unsure about what you meant in your first post about including the classes in bigtop-tests/smoke-tests. Did you mean bigtop-smoke-tests?

          Show
          Dawson Choong added a comment - Updated with properties file for variable management and updated README. jay vyas I was a little unsure about what you meant in your first post about including the classes in bigtop-tests/smoke-tests. Did you mean bigtop-smoke-tests?
          Hide
          jay vyas added a comment -

          Thanks dawson.... these tests will be quite interesting to run on dynamic places, like cloud hadoop deployments.

          was a little unsure about what you meant in your first post about including the classes in bigtop-tests/smoke-tests. Did you mean bigtop-smoke-tests?

          Dont worry for now. Once this is done we can take a look at it. I just want to make sure the test-artifacts/smoke-tests don't recreate functionality that is somewhere else, and vice versa.

          • Looks like you fixed the method name, makes more sense now
          • possibly more update to the README would be good? This is a pretty advanced test case.

          Unfortunately i dont have a machine on hand i can test with, on paternity leave. but will be back to reality within a few days. If nobody else gets to review the patch by then, i can review it.

          Show
          jay vyas added a comment - Thanks dawson.... these tests will be quite interesting to run on dynamic places, like cloud hadoop deployments. was a little unsure about what you meant in your first post about including the classes in bigtop-tests/smoke-tests. Did you mean bigtop-smoke-tests? Dont worry for now. Once this is done we can take a look at it. I just want to make sure the test-artifacts/smoke-tests don't recreate functionality that is somewhere else, and vice versa. Looks like you fixed the method name, makes more sense now possibly more update to the README would be good? This is a pretty advanced test case. Unfortunately i dont have a machine on hand i can test with, on paternity leave. but will be back to reality within a few days. If nobody else gets to review the patch by then, i can review it.
          Hide
          jay vyas added a comment -

          Actually nvm, REAdME looks quite good .

          Show
          jay vyas added a comment - Actually nvm, REAdME looks quite good .
          Hide
          Roman Shaposhnik added a comment -

          So... anything else holding this up?

          Show
          Roman Shaposhnik added a comment - So... anything else holding this up?
          Hide
          Konstantin Boudnik added a comment -

          My review is holding this: I was spending my cycles on 0.8.0 and now when it is behind I should be able to review this in the next day or two. If someone feels like looking at it as well - please go nuts!

          Show
          Konstantin Boudnik added a comment - My review is holding this: I was spending my cycles on 0.8.0 and now when it is behind I should be able to review this in the next day or two. If someone feels like looking at it as well - please go nuts!
          Hide
          jay vyas added a comment -

          Hi folks... This is ready to go in?
          If so I can review it thoroughly tonite.

          Show
          jay vyas added a comment - Hi folks... This is ready to go in? If so I can review it thoroughly tonite.
          Hide
          jay vyas added a comment - - edited

          +1

          Hi Dawson Choong , the code looks good.

          • applied the patch using patch p1 < .... fyi, so you might want to resubmit using git format-patch to make sure your author info is correct etc. otherwise let me kno what to use for the "-author" flag hen we commit it.
          • Im currently running the smoke tests after applying your patch, will confirm that they still work in the morning. In the meantime, can you update the wiki page above?

          Thanks!

          Show
          jay vyas added a comment - - edited +1 Hi Dawson Choong , the code looks good. applied the patch using patch p1 < .... fyi, so you might want to resubmit using git format-patch to make sure your author info is correct etc. otherwise let me kno what to use for the " -author" flag hen we commit it. Can we please accompany this feature with a comprehensive wiki page which details how to use cluster failure tests? Ive added the stubs for you here https://cwiki.apache.org/confluence/display/BIGTOP/Running+integration+and+system+tests , and mentioned your name in the wiki page. Id like to gate commiting this patch with an update to cos' wiki page on how to run the integration tests. Im currently running the smoke tests after applying your patch, will confirm that they still work in the morning. In the meantime, can you update the wiki page above? Thanks!
          Hide
          jay vyas added a comment - - edited

          done running smoke-tests and they still work, so confirmed that the patch applies and the existing tests still work , which is the only concern i would have. Now, once you
          (1) give me the name for the commit (--author) flag, and
          (2) update the wiki page with explanation of this new feature, why it is useful, etc (dont replicate your README, just explain at a high level the cluster failure tests with an example)
          ... then we commit ! sounds good?

          Show
          jay vyas added a comment - - edited done running smoke-tests and they still work, so confirmed that the patch applies and the existing tests still work , which is the only concern i would have. Now, once you (1) give me the name for the commit (--author) flag, and (2) update the wiki page with explanation of this new feature, why it is useful, etc (dont replicate your README, just explain at a high level the cluster failure tests with an example) ... then we commit ! sounds good?
          Hide
          jay vyas added a comment -

          hi Dawson Choong — let me know ^^ above if you wanted me to commit the patch ?

          Show
          jay vyas added a comment - hi Dawson Choong — let me know ^^ above if you wanted me to commit the patch ?
          Hide
          Dawson Choong added a comment -

          Hi jay vyas

          1. --author=dawson.choong@wandisco.com
          2. I am unable to edit the wiki page. It appears that I no longer have editing priveleges. Should I reapply for these priveleges? or maybe you could copy/paste it for me?

          (begin wiki)
          Cluster Failure Tests is a feature that allows users to test the completion of mapreduce jobs while the cluster's nodes undergo various failures. These failures include killing node services, restarting services, and dropping the network.

          To test with Cluster Failures, run a test with the "useProperties" parameter set to "true." For instance, we run TestDFSIO with -DuseProperties=true:

          mvn verify -f bigtop-tests/test-execution/longevity/pom.xml -DuseProperties=true -Dorg.apache.maven-failsafe-plugin.testInclude=**/TestDFSIO*/)
          

          The behavior of the cluster failures as well as other node configurations can be modified in the properties file. This file can be found in bigtop-test-framework/.../resources/. Here is an example configuration for the file:

          testhost=localhost
          testremotehost=company.org
          runall=true
          servicerestart=true 
          servicekill=true 
          networkshutdown=true
          service=crond
          failuredelay=5
          startdelay=10
          killduration=10
          

          For more information, please refer to the readme (/bigtop-test-framework/README)
          (end wiki)

          Show
          Dawson Choong added a comment - Hi jay vyas 1. --author=dawson.choong@wandisco.com 2. I am unable to edit the wiki page. It appears that I no longer have editing priveleges. Should I reapply for these priveleges? or maybe you could copy/paste it for me? (begin wiki) Cluster Failure Tests is a feature that allows users to test the completion of mapreduce jobs while the cluster's nodes undergo various failures. These failures include killing node services, restarting services, and dropping the network. To test with Cluster Failures, run a test with the "useProperties" parameter set to "true." For instance, we run TestDFSIO with -DuseProperties=true: mvn verify -f bigtop-tests/test-execution/longevity/pom.xml -DuseProperties= true -Dorg.apache.maven-failsafe-plugin.testInclude=**/TestDFSIO*/) The behavior of the cluster failures as well as other node configurations can be modified in the properties file. This file can be found in bigtop-test-framework/.../resources/. Here is an example configuration for the file: testhost=localhost testremotehost=company.org runall= true servicerestart= true servicekill= true networkshutdown= true service=crond failuredelay=5 startdelay=10 killduration=10 For more information, please refer to the readme (/bigtop-test-framework/README) (end wiki)
          Hide
          jay vyas added a comment -

          hi : Thats a good technical descripton, but I think a higher level description of (1) how your killing services, and (2) what expected behaviour is when services die, is what i was looking for.

          Im commiting your patch now !

          Show
          jay vyas added a comment - hi : Thats a good technical descripton, but I think a higher level description of (1) how your killing services, and (2) what expected behaviour is when services die, is what i was looking for. Im commiting your patch now !
          Hide
          jay vyas added a comment -

          okay Dawson Choong commited ! Thanks. and as mentinoed,

          • please provide a nice description of this interesting new feature youve added. in here, i can copy it in to a wiki page for you ...
          • remember to also describe what the outcome for these tests should be : should the yarn jobs fail gracefully ? Should they pass ? Should they only pass in certain cases? if nodes services are being killed, how ?... etc. I realize that some of the work you are providing controllers for isnt in this patch, but i assume you have a handle on ho to describe the implementation ....
          Show
          jay vyas added a comment - okay Dawson Choong commited ! Thanks. and as mentinoed, please provide a nice description of this interesting new feature youve added. in here, i can copy it in to a wiki page for you ... remember to also describe what the outcome for these tests should be : should the yarn jobs fail gracefully ? Should they pass ? Should they only pass in certain cases? if nodes services are being killed, how ?... etc. I realize that some of the work you are providing controllers for isnt in this patch, but i assume you have a handle on ho to describe the implementation ....
          Hide
          Dawson Choong added a comment -

          The purpose of this test is to check whether or not mapreduce jobs complete when failing the nodes of the cluster that is performing the job. When applying these cluster failures, the mapreduce job should complete with no issues. If mapreduce jobs fail as a result of any of the cluster failure tests, the user may not have a functional cluster or implementation of mapreduce.

          The node service behavior and network connection is controlled by a series of shell commands. When the user specifies to kill and start a service, the program will execute a pkill command followed by a service start command. Restarting the network is handled by a set of iptables commands.

          Show
          Dawson Choong added a comment - The purpose of this test is to check whether or not mapreduce jobs complete when failing the nodes of the cluster that is performing the job. When applying these cluster failures, the mapreduce job should complete with no issues. If mapreduce jobs fail as a result of any of the cluster failure tests, the user may not have a functional cluster or implementation of mapreduce. The node service behavior and network connection is controlled by a series of shell commands. When the user specifies to kill and start a service, the program will execute a pkill command followed by a service start command. Restarting the network is handled by a set of iptables commands.
          Hide
          Konstantin Boudnik added a comment - - edited

          The premise, I believe, to have nonfailing job on HA configured cluster, right? Because if you drop the only NN then for all practical purposes you'll end up with the dead cluster.

          Show
          Konstantin Boudnik added a comment - - edited The premise, I believe, to have nonfailing job on HA configured cluster, right? Because if you drop the only NN then for all practical purposes you'll end up with the dead cluster.
          Hide
          jay vyas added a comment -
          Show
          jay vyas added a comment - Konstantin Boudnik and Dawson Choong lets follow up in BIGTOP-1487
          Hide
          Konstantin Boudnik added a comment -

          Dudes, are you aware that the patch isn't compilable because FailureExecutor is sitting under bigtop-tests/test-artifacts/longevity/src/main/groovy/org/apache/bigtop/itest/iolongevity/FailureExecutor.groovy ??

          Show
          Konstantin Boudnik added a comment - Dudes, are you aware that the patch isn't compilable because FailureExecutor is sitting under bigtop-tests/test-artifacts/longevity/src/main/groovy/org/apache/bigtop/itest/iolongevity/FailureExecutor.groovy ??
          Hide
          jay vyas added a comment - - edited

          thanks Konstantin Boudnik for catchint this looks like we have BIGTOP-1513 now to fix it .... i only reviewed that there wasnt a regression for the existing smoke tests, but i didnt actually run the failure executor.

          Show
          jay vyas added a comment - - edited thanks Konstantin Boudnik for catchint this looks like we have BIGTOP-1513 now to fix it .... i only reviewed that there wasnt a regression for the existing smoke tests, but i didnt actually run the failure executor.
          Hide
          Konstantin Boudnik added a comment -

          Understand. The compilation of hadoop smoke is failing though.

          Show
          Konstantin Boudnik added a comment - Understand. The compilation of hadoop smoke is failing though.

            People

            • Assignee:
              Dawson Choong
              Reporter:
              Dawson Choong
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development