Uploaded image for project: 'Bigtop'
  1. Bigtop
  2. BIGTOP-1521

Bigtop smoke-tests hierarchy and fast failure

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.8.0
    • Fix Version/s: 1.0.0
    • Component/s: tests
    • Labels:
      None

      Description

      Problem Sometimes YARN jobs can hang indefinetly, and in the case of the smoke-tests , we also can get an infinite hang it appears.

      This can be reproduced by simply messing up/deleting the core hadoop components from bigtop-deploy/vm/vagrant-puppet's provision script puppet conf file provision.sh and running vagrant up.

      Solution Let add some smarts to the smoke tester - such that the basic yarn services (i. think hadoop-smoke in test-artifcacts does this maybe ) are confirmed before any yarn based tests are ran.

      1. BIGTOP-1521.patch
        4 kB
        Dasha Boudnik

        Issue Links

          Activity

          Hide
          cos Konstantin Boudnik added a comment -

          jay vyas, I think the easier way to deal with this sort of problem is by introducing timeout to our tests. Thoughts?

          Show
          cos Konstantin Boudnik added a comment - jay vyas , I think the easier way to deal with this sort of problem is by introducing timeout to our tests. Thoughts?
          Hide
          jayunit100 jay vyas added a comment -

          i agree that timeouts can prevent stalls, and are a simple and easy way to solve this.

          but they are indirect... so lets make sure to add a good error message to timeout failures though, so its clear, if a test times out, what the most likely cause is.

          Show
          jayunit100 jay vyas added a comment - i agree that timeouts can prevent stalls, and are a simple and easy way to solve this. but they are indirect... so lets make sure to add a good error message to timeout failures though, so its clear, if a test times out, what the most likely cause is.
          Hide
          cos Konstantin Boudnik added a comment -

          Absolutely: the errors should be as verbose and informative as possible.

          Show
          cos Konstantin Boudnik added a comment - Absolutely: the errors should be as verbose and informative as possible.
          Hide
          dasha.boudnik Dasha Boudnik added a comment -

          jay vyas, which tests did you have in mind for this, exactly?

          Show
          dasha.boudnik Dasha Boudnik added a comment - jay vyas , which tests did you have in mind for this, exactly?
          Hide
          jayunit100 jay vyas added a comment -

          Dasha Boudnik well, for example, i think hadoop fs -ls commands , if failing, should trigger all tests to fail.

          And, like if simple pi 2 2 job fails, all other tests should fail.

          No point in testing pig, hive if mapreduce is not working.

          So we have a DAG of sorts. Should be quite possible to implement given the fine grained and easily scripted controls that gradle allows us w/ task dependencies.

          Show
          jayunit100 jay vyas added a comment - Dasha Boudnik well, for example, i think hadoop fs -ls commands , if failing, should trigger all tests to fail. And, like if simple pi 2 2 job fails, all other tests should fail. No point in testing pig, hive if mapreduce is not working. So we have a DAG of sorts. Should be quite possible to implement given the fine grained and easily scripted controls that gradle allows us w/ task dependencies.
          Hide
          dasha.boudnik Dasha Boudnik added a comment -

          jay vyas, please correct me if I'm wrong, but that's a bit different from what you were initially talking about, isn't it? The problem with indefinitely hanging jobs is that they WON'T fail, so we need a way to MAKE them fail, right?

          Show
          dasha.boudnik Dasha Boudnik added a comment - jay vyas , please correct me if I'm wrong, but that's a bit different from what you were initially talking about, isn't it? The problem with indefinitely hanging jobs is that they WON'T fail, so we need a way to MAKE them fail, right?
          Hide
          cos Konstantin Boudnik added a comment -

          Actually to build a DAG of a sort for tests you'd have to expose the dependency knowledge about the tests all the way up to the build system; and then you'll have to find a way to keep it in sync with the changing tests. Also, arguably, a purpose of the integration tests is run them all and then see what functionality (out of many) has failed. It's like configuring C-compiler not to fail at the first error

          Show
          cos Konstantin Boudnik added a comment - Actually to build a DAG of a sort for tests you'd have to expose the dependency knowledge about the tests all the way up to the build system; and then you'll have to find a way to keep it in sync with the changing tests. Also, arguably, a purpose of the integration tests is run them all and then see what functionality (out of many) has failed. It's like configuring C-compiler not to fail at the first error
          Hide
          jayunit100 jay vyas added a comment - - edited

          okay youre right guys :... my original thought was :

          • Test if a component is basically functioning before doing deeper tests (i.e. confirm nodemanager is up before running a yarn job).
          • then somehow my twisted brain turned that into a DAG which is a bit silly.

          Now coming full circle, the purpose of this jira can be fixed, albeit inelegantly, with a simple timeout as Konstantin Boudnik mentions that should solve the problem for us... leave the fancy test dependency ideas/ pre-testing ideas for another day

          Show
          jayunit100 jay vyas added a comment - - edited okay youre right guys :... my original thought was : Test if a component is basically functioning before doing deeper tests (i.e. confirm nodemanager is up before running a yarn job). then somehow my twisted brain turned that into a DAG which is a bit silly. Now coming full circle, the purpose of this jira can be fixed, albeit inelegantly, with a simple timeout as Konstantin Boudnik mentions that should solve the problem for us... leave the fancy test dependency ideas/ pre-testing ideas for another day
          Hide
          dasha.boudnik Dasha Boudnik added a comment -

          Ha, okay, cool! So now we come back again to my question: which tests did you have in mind?

          Show
          dasha.boudnik Dasha Boudnik added a comment - Ha, okay, cool! So now we come back again to my question: which tests did you have in mind?
          Hide
          jayunit100 jay vyas added a comment -

          As the original purpose, was to prevent infinite starvation...So I think any test which waits on a resource (for example, any YARN job) is a candidate for patching up here.

          Show
          jayunit100 jay vyas added a comment - As the original purpose, was to prevent infinite starvation...So I think any test which waits on a resource (for example, any YARN job) is a candidate for patching up here.
          Hide
          cos Konstantin Boudnik added a comment -

          As an action plan I'd pick a small cluster - say 3 nodes, ran all the Hadoop tests we have and timed them. Then set the timeout at say 300% level on them. Something along these lines should be enough, no?

          Show
          cos Konstantin Boudnik added a comment - As an action plan I'd pick a small cluster - say 3 nodes, ran all the Hadoop tests we have and timed them. Then set the timeout at say 300% level on them. Something along these lines should be enough, no?
          Hide
          dasha.boudnik Dasha Boudnik added a comment -

          Patch attached. I timed each one and gave it 300% of its runtime to time out. Everything's passing with these changes. Edited tests:

          bigtop-tests/test-artifacts/hadoop/src/main/groovy/org/apache/bigtop/itest/hadoop/mapreduce/TestHadoopExamples.groovy
          bigtop-tests/test-artifacts/hadoop/src/main/groovy/org/apache/bigtop/itest/hadoop/mapreduce/TestHadoopSmoke.groovy
          bigtop-tests/test-artifacts/hadoop/src/main/groovy/org/apache/bigtop/itest/hadoop/yarn/TestNode.groovy
          bigtop-tests/test-artifacts/hadoop/src/main/groovy/org/apache/bigtop/itest/hadoop/yarn/TestRmAdmin.groovy
          Show
          dasha.boudnik Dasha Boudnik added a comment - Patch attached. I timed each one and gave it 300% of its runtime to time out. Everything's passing with these changes. Edited tests: bigtop-tests/test-artifacts/hadoop/src/main/groovy/org/apache/bigtop/itest/hadoop/mapreduce/TestHadoopExamples.groovy bigtop-tests/test-artifacts/hadoop/src/main/groovy/org/apache/bigtop/itest/hadoop/mapreduce/TestHadoopSmoke.groovy bigtop-tests/test-artifacts/hadoop/src/main/groovy/org/apache/bigtop/itest/hadoop/yarn/TestNode.groovy bigtop-tests/test-artifacts/hadoop/src/main/groovy/org/apache/bigtop/itest/hadoop/yarn/TestRmAdmin.groovy
          Hide
          cos Konstantin Boudnik added a comment -

          +1

          Show
          cos Konstantin Boudnik added a comment - +1
          Hide
          dasha.boudnik Dasha Boudnik added a comment -

          jay vyas, anything we're missing here? if not, I'll commit

          Show
          dasha.boudnik Dasha Boudnik added a comment - jay vyas , anything we're missing here? if not, I'll commit
          Hide
          jayunit100 jay vyas added a comment -

          go for it

          Show
          jayunit100 jay vyas added a comment - go for it
          Hide
          dasha.boudnik Dasha Boudnik added a comment -

          Committed and pushed. Thanks!

          Show
          dasha.boudnik Dasha Boudnik added a comment - Committed and pushed. Thanks!

            People

            • Assignee:
              dasha.boudnik Dasha Boudnik
              Reporter:
              jayunit100 jay vyas
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development