Uploaded image for project: 'Bigtop'
  1. Bigtop
  2. BIGTOP-1192

Add utilities to facilitate cluster failure testing into bigtop-test-framework

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.7.0
    • Fix Version/s: 0.8.0
    • Component/s: tests
    • Labels:

      Description

      The goal is to provide Bigtop module maintainers with a set of set of util classes to help develop smoke tests able to simulate certain failures during smoke tests execution on a cluster.

      Summary of what is provided in current patch.

      Following failure types are supported now:

      • Service stopped and restarted (on given set of nodes)
      • Service killed with 'kill -9' and started back up (on given set of nodes)
      • Node inbound/outbound connections are shut down and brought back up (via iptables).

      System requirements to run smoke tests with failures.

      • password-less (PKI-based) root ssh to all nodes in cluster being tested is assumed.
      • for local tests, like ClusterFailuresTest, one should have password-less root ssh to localhost.
      • env variable BIGTOP_SMOKES_CLUSTER_IDENTITY_FILE should point to according private key file.

      Further thoughts (not included in this patch)
      Cluster provisioning

      • Bigtop test framework (failures part of it) doesn't need to know about cluster topology, as it simply executes set of SSH commands on remote hosts (whose addresses are provided by specific
        module smoke test developer). But the actual tests do need to know about cluster topology to run sophisticated failure scenarios.
      1. BIGTOP-1192.1.patch
        23 kB
        Mikhail Antonov
      2. BIGTOP-1192.2.patch
        28 kB
        Mikhail Antonov
      3. BIGTOP-1192.3.patch
        29 kB
        Mikhail Antonov
      4. BIGTOP-1192.4.patch
        36 kB
        Mikhail Antonov
      5. BIGTOP-1192.patch
        27 kB
        Konstantin Boudnik
      6. BIGTOP-1192.patch
        29 kB
        Mikhail Antonov
      7. BIGTOP-1192.patch
        28 kB
        Mikhail Antonov

        Activity

        Hide
        rvs Roman Shaposhnik added a comment -

        This sounds extremely useful! Would love to see the patches (even if work-in-progres ones).

        Also, you may want to take a look at the BIGTOP-635 and see if this is something you may be interested in tackling

        Show
        rvs Roman Shaposhnik added a comment - This sounds extremely useful! Would love to see the patches (even if work-in-progres ones). Also, you may want to take a look at the BIGTOP-635 and see if this is something you may be interested in tackling
        Hide
        mantonov Mikhail Antonov added a comment - - edited

        I'll attach patch as soon as I have my unit tests for this part of test framework passing.

        Having taken a look at BIGTOP-635 - yes, that does sound very interesting to work on (any based on the number of watchers there's a high demand on such functionality ), and anyway some work in this direction is going on. Especially since nobody is going to run smoke tests manually but from Jenkins slaves, automation is needed.

        That JIRA, BIGTOP-635, however, looks way broader in scope and requires more design and discussion. Acting with "GTD" attitude, I'd like to finish this one, which is a few simple util classes, and then discuss BIGTOP-635.

        Show
        mantonov Mikhail Antonov added a comment - - edited I'll attach patch as soon as I have my unit tests for this part of test framework passing. Having taken a look at BIGTOP-635 - yes, that does sound very interesting to work on (any based on the number of watchers there's a high demand on such functionality ), and anyway some work in this direction is going on. Especially since nobody is going to run smoke tests manually but from Jenkins slaves, automation is needed. That JIRA, BIGTOP-635 , however, looks way broader in scope and requires more design and discussion. Acting with "GTD" attitude, I'd like to finish this one, which is a few simple util classes, and then discuss BIGTOP-635 .
        Hide
        rvs Roman Shaposhnik added a comment -

        Mikhail Antonov makes perfect sense. I didn't mean to suggest that BIGTOP-635 is somehow a prerequisite. Just to let you know that it exists and awaits

        Show
        rvs Roman Shaposhnik added a comment - Mikhail Antonov makes perfect sense. I didn't mean to suggest that BIGTOP-635 is somehow a prerequisite. Just to let you know that it exists and awaits
        Hide
        mantonov Mikhail Antonov added a comment -

        First version of patch for review. Details are also added in jira description.

        Show
        mantonov Mikhail Antonov added a comment - First version of patch for review. Details are also added in jira description.
        Hide
        cos Konstantin Boudnik added a comment - - edited
        • expecting passwordless slogin as root in a system is too much to ask, in my opinion. Even development machines won't be lax'd like that. I'd suggest to expect a passwrodless sudo for certain commands for a specific user (e.g. jenkins, testuser).
        • usability wise, it would be good if test can make some assumptions about the location of the identity file, e.g. /.ssh/id_dsa or /.ssh/id_rsa instead of always forcing the setting of the env variable. In other words, if the can be found in the standard location - let's try to run the unit test with it. Otherwise, fall-back to the current approach
        • test requirements need to be expressed upfront, e.g. in a README file or else
        • asserts should have meaningful messages
        • Minor formatting comment: there should be empty lines around ASL license URL (e.g. http://www.apache.org/licenses/LICENSE-2.0.html)
        Show
        cos Konstantin Boudnik added a comment - - edited expecting passwordless slogin as root in a system is too much to ask, in my opinion. Even development machines won't be lax'd like that. I'd suggest to expect a passwrodless sudo for certain commands for a specific user (e.g. jenkins, testuser). usability wise, it would be good if test can make some assumptions about the location of the identity file, e.g. /.ssh/id_dsa or /.ssh/id_rsa instead of always forcing the setting of the env variable. In other words, if the can be found in the standard location - let's try to run the unit test with it. Otherwise, fall-back to the current approach test requirements need to be expressed upfront, e.g. in a README file or else asserts should have meaningful messages Minor formatting comment: there should be empty lines around ASL license URL (e.g. http://www.apache.org/licenses/LICENSE-2.0.html )
        Hide
        mantonov Mikhail Antonov added a comment -

        >>> usability wise, it would be good if test can make some assumptions about the location of the identity file, e.g. /.ssh/id_dsa or /.ssh/id_rsa instead of always forcing the setting of the env variable. In other words, if the can be found in the standard location - let's try to run the unit test with it. Otherwise, fall-back to the current approach

        I thought it may be better to have people to set it explicitly. If someone (like me) has many ssh keypairs and the default one is the wrong one, then (on typical ssh setup) will prompt for password and user would think something is wrong with his SSH server or so?

        Show
        mantonov Mikhail Antonov added a comment - >>> usability wise, it would be good if test can make some assumptions about the location of the identity file, e.g. /.ssh/id_dsa or /.ssh/id_rsa instead of always forcing the setting of the env variable. In other words, if the can be found in the standard location - let's try to run the unit test with it. Otherwise, fall-back to the current approach I thought it may be better to have people to set it explicitly. If someone (like me) has many ssh keypairs and the default one is the wrong one, then (on typical ssh setup) will prompt for password and user would think something is wrong with his SSH server or so?
        Hide
        mantonov Mikhail Antonov added a comment - - edited

        Thanks for comment!

        Second version of patch attached, fixed everything described in Cos's comment (except only env vars usability note for now).

        Show
        mantonov Mikhail Antonov added a comment - - edited Thanks for comment! Second version of patch attached, fixed everything described in Cos's comment (except only env vars usability note for now).
        Hide
        mantonov Mikhail Antonov added a comment -

        Upon some thinking - should we have bash script for all that setup?

        Show
        mantonov Mikhail Antonov added a comment - Upon some thinking - should we have bash script for all that setup?
        Hide
        rvs Roman Shaposhnik added a comment -

        Mikhail Antonov Couple of high level comments:

        1. ssh-ing & the line – this is where connection with BIGTOP-635 comes into play. What I had in mind is that things like NetworkShutdownFailure would actually utilize a generic cluster manipulation framework instead of explicitly calling ssh or whatever. Now, I'm not saying that as a first cut ssh is bad – rather than instead of calling it directly there probably should be an interface (part of BIGTOP-635) methods of which get called to perform these actions. Otherwise you'll end up with thins like new ServiceKilledFailure(["localhost"], "crond") in real tests – IOW things that need to know host names, etc, instead of simply referring to the topology of the cluster in an abstract manner and calling methods on various nodes that are part of that topology
        2. "should we have bash script for all that setup?" – absolutely! In fact, better yet – we probably should create a puppet module. That's how we set up our clusters anyway.
        Show
        rvs Roman Shaposhnik added a comment - Mikhail Antonov Couple of high level comments: ssh-ing & the line – this is where connection with BIGTOP-635 comes into play. What I had in mind is that things like NetworkShutdownFailure would actually utilize a generic cluster manipulation framework instead of explicitly calling ssh or whatever. Now, I'm not saying that as a first cut ssh is bad – rather than instead of calling it directly there probably should be an interface (part of BIGTOP-635 ) methods of which get called to perform these actions. Otherwise you'll end up with thins like new ServiceKilledFailure( ["localhost"] , "crond") in real tests – IOW things that need to know host names, etc, instead of simply referring to the topology of the cluster in an abstract manner and calling methods on various nodes that are part of that topology "should we have bash script for all that setup?" – absolutely! In fact, better yet – we probably should create a puppet module. That's how we set up our clusters anyway.
        Hide
        mantonov Mikhail Antonov added a comment - - edited

        Roman,

        1) Agree that more convenient interface would be better. For now, though, in scope of this jira and with approach "get specific problem addressed", it's probably ok to have 2 layers of abstraction - lower being Shell class to execute arbitrary commands, higher is like "restart service S1 on hosts H1, H2, H2)". If there's a demand to have more types of failures, then will definitely worth further improvement.

        2) Regarding "where to keep and specify host names etc". The code in bigtop-framework-tests (itest) at this level of abstraction as in this patch, doesn't really need to know cluster topology, right - it operates at level - execute certain logical command, like "restart service" S1 on hosts H1, H2, H3. Where the hosts names are coming from itest doesn't care now. But the actual module smoke tests definitely should.

        So I would say - these 2 are different issues we need to address.

        One is having API to execute well defined set of logical command against the specified list of nodes. That is I guess what this jira is about (and probably there are other types of failures to be added later on this list, as need arises - for example, I don't know, "run some program on node H1 which eats up almost all memory/CPU/network bandwidth to softly shake the services).

        Second is to have a way to describe the cluster logical topology - network, nodes, roles, services etc, and be able to access it from the tests. That is what is needed to be able to run real complex smoke tests from jenkins builds in flexible way. I guess that's next step (which I'd also be glad to contribute to).

        Show
        mantonov Mikhail Antonov added a comment - - edited Roman, 1) Agree that more convenient interface would be better. For now, though, in scope of this jira and with approach "get specific problem addressed", it's probably ok to have 2 layers of abstraction - lower being Shell class to execute arbitrary commands, higher is like "restart service S1 on hosts H1, H2, H2)". If there's a demand to have more types of failures, then will definitely worth further improvement. 2) Regarding "where to keep and specify host names etc". The code in bigtop-framework-tests (itest) at this level of abstraction as in this patch, doesn't really need to know cluster topology, right - it operates at level - execute certain logical command, like "restart service" S1 on hosts H1, H2, H3. Where the hosts names are coming from itest doesn't care now. But the actual module smoke tests definitely should. So I would say - these 2 are different issues we need to address. One is having API to execute well defined set of logical command against the specified list of nodes. That is I guess what this jira is about (and probably there are other types of failures to be added later on this list, as need arises - for example, I don't know, "run some program on node H1 which eats up almost all memory/CPU/network bandwidth to softly shake the services). Second is to have a way to describe the cluster logical topology - network, nodes, roles, services etc, and be able to access it from the tests. That is what is needed to be able to run real complex smoke tests from jenkins builds in flexible way. I guess that's next step (which I'd also be glad to contribute to).
        Hide
        cos Konstantin Boudnik added a comment -

        many ssh keypairs and the default one is the wrong one

        Well, sure. But I am talking a case when one runs iTest unit tests, e.g. in case of localhost environment. In this particular case - according to the earlier comment about the use of sudo you won't even need to slogin anywhere.

        For truly distibuted setup your argument might be valid and an extra setup would be required. But I am concerned about unit tests - that should be effortless and not expecting anything special to be set if possible.

        Show
        cos Konstantin Boudnik added a comment - many ssh keypairs and the default one is the wrong one Well, sure. But I am talking a case when one runs iTest unit tests, e.g. in case of localhost environment. In this particular case - according to the earlier comment about the use of sudo you won't even need to slogin anywhere. For truly distibuted setup your argument might be valid and an extra setup would be required. But I am concerned about unit tests - that should be effortless and not expecting anything special to be set if possible.
        Hide
        mantonov Mikhail Antonov added a comment - - edited

        Agree. Will make changes and roll out next version of patch soon. Also I'd say that likewise, if bigtop_smokes_user env var isn't set, it defaults to whoami.

        One thing to clarify - are you suggesting that the local unit tests (part of itest) shouldn't even go thru ssh to localhost, but just directly execute commands?

        Show
        mantonov Mikhail Antonov added a comment - - edited Agree. Will make changes and roll out next version of patch soon. Also I'd say that likewise, if bigtop_smokes_user env var isn't set, it defaults to whoami. One thing to clarify - are you suggesting that the local unit tests (part of itest) shouldn't even go thru ssh to localhost, but just directly execute commands?
        Hide
        cos Konstantin Boudnik added a comment - - edited

        Yes, I think local unit tests should be fast and simple and validate the functionality of your code, rather than an ability of ssh client to connect somewhere. E.g. they should be unit, not system tests.

        Another consideration: in many cases laptops aren't even carrying ssh server. In which case the original implementation won't fly at all.

        Show
        cos Konstantin Boudnik added a comment - - edited Yes, I think local unit tests should be fast and simple and validate the functionality of your code, rather than an ability of ssh client to connect somewhere. E.g. they should be unit, not system tests. Another consideration: in many cases laptops aren't even carrying ssh server. In which case the original implementation won't fly at all.
        Hide
        mantonov Mikhail Antonov added a comment - - edited

        3rd version of patch attached. ClusterFailuresTest doesn't require any env vars or sshd running at all now, fixed documentation, refactoring

        Made sure that after changes made tests-over-ssh are still working

        Show
        mantonov Mikhail Antonov added a comment - - edited 3rd version of patch attached. ClusterFailuresTest doesn't require any env vars or sshd running at all now, fixed documentation, refactoring Made sure that after changes made tests-over-ssh are still working
        Hide
        mantonov Mikhail Antonov added a comment -

        A question to folks - would it be useful for someone to have failure type, which generates high cpu/memory/network/IO load on the node? I can see value in it.

        Show
        mantonov Mikhail Antonov added a comment - A question to folks - would it be useful for someone to have failure type, which generates high cpu/memory/network/IO load on the node? I can see value in it.
        Hide
        cos Konstantin Boudnik added a comment -

        would it be useful for someone to have failure type, which generates high cpu/memory/network/IO

        Indeed, let's open a separate ticket for it

        Show
        cos Konstantin Boudnik added a comment - would it be useful for someone to have failure type, which generates high cpu/memory/network/IO Indeed, let's open a separate ticket for it
        Hide
        mantonov Mikhail Antonov added a comment -
        Show
        mantonov Mikhail Antonov added a comment - BIGTOP-1198
        Hide
        cos Konstantin Boudnik added a comment -

        BTW, you don't need to explicitly mark your patches with suffixes or else - JIRA will take care about everything based on the timestamp of the files.

        Show
        cos Konstantin Boudnik added a comment - BTW, you don't need to explicitly mark your patches with suffixes or else - JIRA will take care about everything based on the timestamp of the files.
        Hide
        mantonov Mikhail Antonov added a comment -

        Thanks, will make a note for future.

        Show
        mantonov Mikhail Antonov added a comment - Thanks, will make a note for future.
        Hide
        cos Konstantin Boudnik added a comment - - edited

        Ok, looks like the 3rd time isn't a charm yet.

        • on Ubuntu, crond doesn't exist. It is called cron
        • I'd recommend to declare a final variable for the service name and use it elsewhere instead of hardcoding the service name
        • output of service command might differ on distro variants. E,g, Ubuntu says cron start/running... where's on CentOS it would be crond ... is running
        • for whatever reason when I am looking into rootShell output all the text I'm getting as 0th element, not as 2nd
        • word running is coming out without parenthesis. Regexp /.(running)./ doesn't match anything on Ubuntu (and looks like it won't do on CentOS as well
        • /.inactive (dead)./ won't work on ubuntu, neither on CentOS (where it will say something like crond is stopped ) if I am not mistaken.
        Show
        cos Konstantin Boudnik added a comment - - edited Ok, looks like the 3rd time isn't a charm yet. on Ubuntu, crond doesn't exist. It is called cron I'd recommend to declare a final variable for the service name and use it elsewhere instead of hardcoding the service name output of service command might differ on distro variants. E,g, Ubuntu says cron start/running... where's on CentOS it would be crond ... is running for whatever reason when I am looking into rootShell output all the text I'm getting as 0th element, not as 2nd word running is coming out without parenthesis. Regexp /. (running). / doesn't match anything on Ubuntu (and looks like it won't do on CentOS as well /. inactive (dead). / won't work on ubuntu, neither on CentOS (where it will say something like crond is stopped ) if I am not mistaken.
        Hide
        cos Konstantin Boudnik added a comment -

        I just checked Fedora 15 and the output of service status command is like

        [root@localhost ~]# service crond status
        Redirecting to /bin/systemctl  status crond.service
        crond.service - Command Scheduler
                  Loaded: loaded (/lib/systemd/system/crond.service)
                  Active: active (running) since Thu, 31 Oct 2013 08:05:50 -0400; 2 months and 29 days ago
                Main PID: 853 (crond)
                  CGroup: name=systemd:/system/crond.service
                          └ 853 /usr/sbin/crond -n
        

        So, 2nd line won't work on it too.

        I'd resort to use exit codes perhaps?

        Show
        cos Konstantin Boudnik added a comment - I just checked Fedora 15 and the output of service status command is like [root@localhost ~]# service crond status Redirecting to /bin/systemctl status crond.service crond.service - Command Scheduler Loaded: loaded (/lib/systemd/system/crond.service) Active: active (running) since Thu, 31 Oct 2013 08:05:50 -0400; 2 months and 29 days ago Main PID: 853 (crond) CGroup: name=systemd:/system/crond.service └ 853 /usr/sbin/crond -n So, 2nd line won't work on it too. I'd resort to use exit codes perhaps?
        Hide
        mantonov Mikhail Antonov added a comment -

        Yep, true. Will update the patch soon.

        Show
        mantonov Mikhail Antonov added a comment - Yep, true. Will update the patch soon.
        Hide
        mantonov Mikhail Antonov added a comment -

        Attaching new version of patch, addressing last feedback (patch spans 3 local commits).

        Show
        mantonov Mikhail Antonov added a comment - Attaching new version of patch, addressing last feedback (patch spans 3 local commits).
        Hide
        cos Konstantin Boudnik added a comment -

        Looks better!

        • I guess this
              switch (OS.linux_flavor) {
                case ~/(?is).*(ubuntu|debian).*/:
                  CRON_SERVICE = "cron"
                  break
                case ~/(?is).*(redhat|centos|rhel|fedora|enterpriseenterpriseserver).*/:
                  CRON_SERVICE = "crond"
                  break
                case ~/(?is).*(suse|sles|sled).*/:
                  CRON_SERVICE = "cron"
                default:
                  CRON_SERVICE = "cron"
              }
          

          can be simplified to something like

              switch (OS.linux_flavor) {
                case ~/(?is).*(redhat|centos|rhel|fedora|enterpriseenterpriseserver).*/:
                  CRON_SERVICE = "crond"
                  break
                default:
                  CRON_SERVICE = "cron"
              }
          
        • I am seeing testServiceKilled is failing on my machine with
          {{java.lang.AssertionError: cron hasn't been killed as expected:. Expression: this.isCronRunning() }}
          whenever is running along with the other tests. Does it require any particular order of execution between this one and testServiceRestart ? If so, iTest provides an extension for JUnit to setup tests execution order.

        I think we are almost there!

        Show
        cos Konstantin Boudnik added a comment - Looks better! I guess this switch (OS.linux_flavor) { case ~/(?is).*(ubuntu|debian).*/: CRON_SERVICE = "cron" break case ~/(?is).*(redhat|centos|rhel|fedora|enterpriseenterpriseserver).*/: CRON_SERVICE = "crond" break case ~/(?is).*(suse|sles|sled).*/: CRON_SERVICE = "cron" default: CRON_SERVICE = "cron" } can be simplified to something like switch (OS.linux_flavor) { case ~/(?is).*(redhat|centos|rhel|fedora|enterpriseenterpriseserver).*/: CRON_SERVICE = "crond" break default: CRON_SERVICE = "cron" } I am seeing testServiceKilled is failing on my machine with {{java.lang.AssertionError: cron hasn't been killed as expected:. Expression: this.isCronRunning() }} whenever is running along with the other tests. Does it require any particular order of execution between this one and testServiceRestart ? If so, iTest provides an extension for JUnit to setup tests execution order. I think we are almost there!
        Hide
        mantonov Mikhail Antonov added a comment - - edited

        Hm, weird. Definitely shouldn't require particular order of execution, and I've never seen this kind of error on my machine, running tests for this class or whole bigtop-test-framework.

        are you just running it like mvn -Dtest=ClusterFailuresTest test?

        Oh, it's using timeouts, may be that's the reason..But 3 seconds "should be enough for everyone"

        Show
        mantonov Mikhail Antonov added a comment - - edited Hm, weird. Definitely shouldn't require particular order of execution, and I've never seen this kind of error on my machine, running tests for this class or whole bigtop-test-framework. are you just running it like mvn -Dtest=ClusterFailuresTest test? Oh, it's using timeouts, may be that's the reason..But 3 seconds "should be enough for everyone"
        Hide
        cos Konstantin Boudnik added a comment -

        I am just running all tests in failures package from my IDEA.

        Show
        cos Konstantin Boudnik added a comment - I am just running all tests in failures package from my IDEA.
        Hide
        cos Konstantin Boudnik added a comment -

        Well, as it turned out the test case is failing because Ubuntu's init is immediately re-spawning core services like cron and ssh. Hence the test is failing. I am not sure what to do about it.

        Show
        cos Konstantin Boudnik added a comment - Well, as it turned out the test case is failing because Ubuntu's init is immediately re-spawning core services like cron and ssh. Hence the test is failing. I am not sure what to do about it.
        Hide
        mantonov Mikhail Antonov added a comment -

        Unless someone can propose better solution, I'd suggest add a trivial check in the testcase and do not run this particular testcase method on Ubuntu, but on other OS's. Everything else should work.

        Show
        mantonov Mikhail Antonov added a comment - Unless someone can propose better solution, I'd suggest add a trivial check in the testcase and do not run this particular testcase method on Ubuntu, but on other OS's. Everything else should work.
        Hide
        mantonov Mikhail Antonov added a comment -

        For additional check, spawned 2 small boxes in digital ocean with vanilla centos 6.4 and ubuntu 13.10. Mentioned stuff with kill/restart..works on centos, doesn't work on ubuntu (i.e. on ubuntu it restarts). Don't have Suse VM handy.

        Show
        mantonov Mikhail Antonov added a comment - For additional check, spawned 2 small boxes in digital ocean with vanilla centos 6.4 and ubuntu 13.10. Mentioned stuff with kill/restart..works on centos, doesn't work on ubuntu (i.e. on ubuntu it restarts). Don't have Suse VM handy.
        Hide
        cos Konstantin Boudnik added a comment -

        I am fine with the special check stop-gap measure.

        Show
        cos Konstantin Boudnik added a comment - I am fine with the special check stop-gap measure.
        Hide
        mantonov Mikhail Antonov added a comment -

        attached last version of patch, where disabled test for ubuntu and debian users

        Show
        mantonov Mikhail Antonov added a comment - attached last version of patch, where disabled test for ubuntu and debian users
        Hide
        mantonov Mikhail Antonov added a comment -

        added some logging, reattaching

        Show
        mantonov Mikhail Antonov added a comment - added some logging, reattaching
        Hide
        cos Konstantin Boudnik added a comment -

        I guess one last question I have is if the proposed solution works for our purposes on Ubuntu systems, e.g. for killing non-core services such as hadoop namenode or else?

        I presume the answer is yes, because we don't do anything specially crazy about registering those with any sort of system watchdogs. But I would love to head others opinion. Also, I have SLES VM and will check this fix on it tomorrow.

        Show
        cos Konstantin Boudnik added a comment - I guess one last question I have is if the proposed solution works for our purposes on Ubuntu systems, e.g. for killing non-core services such as hadoop namenode or else? I presume the answer is yes, because we don't do anything specially crazy about registering those with any sort of system watchdogs. But I would love to head others opinion. Also, I have SLES VM and will check this fix on it tomorrow.
        Hide
        mantonov Mikhail Antonov added a comment - - edited

        The purpose of all that is to enable smoke tests which validate that after failure of some process, running on the node the cluster service remains available, e.g. failover/quorum etc are working - right? In that sense, in case of hadoop services also being watch-dogged on Ubuntu, this particular "kill -9" test doesn't help much (it will not fail, but will just do nothing meaningful), and Restart-based test would be better used instead.

        (Generally, we also should probably add a failure which reboots or shutdowns the whole box)

        Show
        mantonov Mikhail Antonov added a comment - - edited The purpose of all that is to enable smoke tests which validate that after failure of some process, running on the node the cluster service remains available, e.g. failover/quorum etc are working - right? In that sense, in case of hadoop services also being watch-dogged on Ubuntu, this particular "kill -9" test doesn't help much (it will not fail, but will just do nothing meaningful), and Restart-based test would be better used instead. (Generally, we also should probably add a failure which reboots or shutdowns the whole box)
        Hide
        cos Konstantin Boudnik added a comment -

        Ok, the patch is ok and everything is running smooth on my Ubuntu box.
        I have found a couple of white space changes in the README file, but I've fixed them locally (patch will be attached in a minutes) and will commit myself.

        Show
        cos Konstantin Boudnik added a comment - Ok, the patch is ok and everything is running smooth on my Ubuntu box. I have found a couple of white space changes in the README file, but I've fixed them locally (patch will be attached in a minutes) and will commit myself.
        Hide
        cos Konstantin Boudnik added a comment -

        Committed to master.

        Thanks Michail!

        Show
        cos Konstantin Boudnik added a comment - Committed to master. Thanks Michail!
        Hide
        cos Konstantin Boudnik added a comment -

        fixing the whitespace formatting

        Show
        cos Konstantin Boudnik added a comment - fixing the whitespace formatting

          People

          • Assignee:
            mantonov Mikhail Antonov
            Reporter:
            mantonov Mikhail Antonov
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development