Hadoop Common
  1. Hadoop Common
  2. HADOOP-6248

Circus: Proposal and Preliminary Code for a Hadoop System Testing Framework

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: test
    • Labels:
      None
    • Environment:

      Python, bash

      Description

      This issue contains a proposal and preliminary source code for Circus, a Hadoop system testing framework. At a high level, Circus will help Hadoop users and QA engineers to run system tests on a configurable Hadoop cluster, or distribution of Hadoop. See the comment below for the proposal itself.

      1. HADOOP-6248.diff
        47 kB
        Alex Loddengaard
      2. HADOOP-6248_v3.diff
        47 kB
        Alex Loddengaard
      3. HADOOP-6248_v2.diff
        47 kB
        Alex Loddengaard

        Issue Links

          Activity

          Hide
          Alex Loddengaard added a comment -

          It has been brought to my attention that Yahoo! has been working on a Java system testing framework that has the same goals as Circus. I'm going to go ahead and resolve this issue as "Won't Fix." I would have appreciated it if a JIRA was opened for this new system to avoid duplicate work.

          Chris, Allen, Owen, Todd, and Steve, I think this new system testing framework would greatly benefit from your feedback. I hope you are all as involved in their JIRA as you were in mine.

          Thanks for all the feedback and involvement, everyone. I'm going to stop work on Circus with the assumption that this new framework will be adopted by the community and continually improved to ultimately make Hadoop's QA process better.

          Show
          Alex Loddengaard added a comment - It has been brought to my attention that Yahoo! has been working on a Java system testing framework that has the same goals as Circus. I'm going to go ahead and resolve this issue as "Won't Fix." I would have appreciated it if a JIRA was opened for this new system to avoid duplicate work. Chris, Allen, Owen, Todd, and Steve, I think this new system testing framework would greatly benefit from your feedback. I hope you are all as involved in their JIRA as you were in mine. Thanks for all the feedback and involvement, everyone. I'm going to stop work on Circus with the assumption that this new framework will be adopted by the community and continually improved to ultimately make Hadoop's QA process better.
          Hide
          Allen Wittenauer added a comment -

          For performance testing, I am not seeing what this gives me over gridmix. There simply isn't enough here to generate a load that can be measured and compared and contrasted.

          For QA testing of my application, why wouldn't I just try to run my job on a smaller chunk of representative data that has a known outcome? Also: if I've already written a mapreduce job, surely I'm competent enough to run get/put/mkdir/etc.

          The fact that it requires python: that's a big fail in my book. Maybe I missed it, but I'm not seeing anything that couldn't be handled in shell, never mind the fact that perl is much more universally installed everywhere. [Yes, I'm on a portability kick. ;) ]

          At this point, I don't think this is ready and/or has a clear goal in mind.

          Show
          Allen Wittenauer added a comment - For performance testing, I am not seeing what this gives me over gridmix. There simply isn't enough here to generate a load that can be measured and compared and contrasted. For QA testing of my application, why wouldn't I just try to run my job on a smaller chunk of representative data that has a known outcome? Also: if I've already written a mapreduce job, surely I'm competent enough to run get/put/mkdir/etc. The fact that it requires python: that's a big fail in my book. Maybe I missed it, but I'm not seeing anything that couldn't be handled in shell, never mind the fact that perl is much more universally installed everywhere. [Yes, I'm on a portability kick. ;) ] At this point, I don't think this is ready and/or has a clear goal in mind.
          Hide
          Chris Douglas added a comment -

          Just to be clear: the current code requires more than a compelling proposal. Progress toward one or more of these roles must be implemented before a version of this tool is committed. I remain -1.

          Todd and Steve's suggestions characterize this tool as a mix of Chukwa (and related causal tracing projects), HOD, dynamic analysis, system benchmarks, Hudson, and a unit tests for the bin scripts. Practically, a "scaled back" version of Circus aspires to fewer, concrete goals so that it may achieve some of them.

          I think the two of us have different expectations for a tool like this, and perhaps we'll never agree. You want a strict framework and find an "execution engine" to be uninteresting for the Hadoop distribution. I think this tool doubles as a system testing tool that we as a community can contribute tests to (you even said that distcp tests at scale are manual; they don't have to be), in addition to an execution engine that Hadoop users can take advantage of for reasons I've already stated.

          The assertion that this is the correct form is speculative at best. While Todd also identifies its role as a magnet for other system tests as a win, I see no argument for why Hadoop should standardize on this particular driver for its integration tests.

          I can tell you that Circus will be useful within Cloudera, and it will be useful for several of our support customers.

          I'll take your word for it, but that doesn't mitigate the burden of demonstrating why this should be included in every version of Hadoop.

          I think I've done all I can to prove its merit, so perhaps others can weigh in on whether or not such a framework, execution engine, or what have you would be useful for them.

          Given the discussion so far, I don't think it's unfair to point out that "what have you" is where this seems to go off the rails. A blank canvas has unbounded potential, but that doesn't make it priceless. Tomorrow, why wouldn't we accept another, equally mature inner loop?

          Lastly, I would hope that contrib projects (such as Circus) would be more easily accepted into the distribution, as they don't negatively impact the project at all. Their "optional" nature allows users who are interested to use contrib projects at will, while not dirtying or making any other real sacrifices to the rest of the code base.

          These criteria are unrealistically weak and the "adding to contrib is free" justification is patently false. Yes, contrib is not as rigorously screened as core, but it's not a public sandbox, either.

          can you provide specific guidance on how I might scale back Circus to be something more useful?

          Since you ask, there's a huge space for QA tools, as Steve and Todd have demonstrated, but the "driver" space is boilerplate and saturated. Instead of starting from scratch, you might consider writing Hadoop-centric bindings for other tools, like Findbugs. A study of common mistakes made with the framework and corresponding scans of static user code would avoid wasting grid compute resources on, say, output key class mismatches or deserializations during compares. Such a contribution would have obvious applicability and could be included not only in QA pipelines, but also in submission queues. If Circus were a suite of validation tool configurations and extensions to be run over user jobs for performance and correctness violations (single-node), it could easily find a role. It's not everything Circus currently aspires to be, but it's clear how (and why) others would contribute to it and what its users can expect from it. Integrating with two or three tools would also refine its interfaces, so the "context" idea could be fleshed out a little more. Smoke tests, as Owen suggested, would also be useful (and easier to write).

          Show
          Chris Douglas added a comment - Just to be clear: the current code requires more than a compelling proposal. Progress toward one or more of these roles must be implemented before a version of this tool is committed. I remain -1. Todd and Steve's suggestions characterize this tool as a mix of Chukwa (and related causal tracing projects ), HOD, dynamic analysis, system benchmarks, Hudson, and a unit tests for the bin scripts. Practically, a "scaled back" version of Circus aspires to fewer, concrete goals so that it may achieve some of them. I think the two of us have different expectations for a tool like this, and perhaps we'll never agree. You want a strict framework and find an "execution engine" to be uninteresting for the Hadoop distribution. I think this tool doubles as a system testing tool that we as a community can contribute tests to (you even said that distcp tests at scale are manual; they don't have to be), in addition to an execution engine that Hadoop users can take advantage of for reasons I've already stated. The assertion that this is the correct form is speculative at best. While Todd also identifies its role as a magnet for other system tests as a win, I see no argument for why Hadoop should standardize on this particular driver for its integration tests. I can tell you that Circus will be useful within Cloudera, and it will be useful for several of our support customers. I'll take your word for it, but that doesn't mitigate the burden of demonstrating why this should be included in every version of Hadoop. I think I've done all I can to prove its merit, so perhaps others can weigh in on whether or not such a framework, execution engine, or what have you would be useful for them. Given the discussion so far, I don't think it's unfair to point out that "what have you" is where this seems to go off the rails. A blank canvas has unbounded potential, but that doesn't make it priceless. Tomorrow, why wouldn't we accept another, equally mature inner loop? Lastly, I would hope that contrib projects (such as Circus) would be more easily accepted into the distribution, as they don't negatively impact the project at all. Their "optional" nature allows users who are interested to use contrib projects at will, while not dirtying or making any other real sacrifices to the rest of the code base. These criteria are unrealistically weak and the "adding to contrib is free" justification is patently false. Yes, contrib is not as rigorously screened as core, but it's not a public sandbox, either. can you provide specific guidance on how I might scale back Circus to be something more useful? Since you ask, there's a huge space for QA tools, as Steve and Todd have demonstrated, but the "driver" space is boilerplate and saturated. Instead of starting from scratch, you might consider writing Hadoop-centric bindings for other tools, like Findbugs. A study of common mistakes made with the framework and corresponding scans of static user code would avoid wasting grid compute resources on, say, output key class mismatches or deserializations during compares. Such a contribution would have obvious applicability and could be included not only in QA pipelines, but also in submission queues. If Circus were a suite of validation tool configurations and extensions to be run over user jobs for performance and correctness violations (single-node), it could easily find a role. It's not everything Circus currently aspires to be, but it's clear how (and why) others would contribute to it and what its users can expect from it. Integrating with two or three tools would also refine its interfaces, so the "context" idea could be fleshed out a little more. Smoke tests, as Owen suggested, would also be useful (and easier to write).
          Hide
          Todd Lipcon added a comment -

          Performance testing is not something you can do on a virtual system, tricky w/ functional tests. Benchmarking is a separate problem. You shoudn't be using your functional tests to assess performance, as the functional tests are looking at the corner cases, trying to break things, not simulate well-behaved code.

          +1 - however, I think Circus should focus n being a general enough testing framework that it can be used for both functional tests and performance tests. Some tests can certainly be geared towards testing corner cases - for example, starting a three-node pseudo-distributed cluster and then kill -STOPping one of the nodes at an inopportune time. Other tests can be handy wrappers for existing benchmark suites like TestDFSIO or Gridmix. Yahoo may have this already internally, but it would be swell to have a day-by-day graph of trunk's performance on a variety of standard jobs.

          Show
          Todd Lipcon added a comment - Performance testing is not something you can do on a virtual system, tricky w/ functional tests. Benchmarking is a separate problem. You shoudn't be using your functional tests to assess performance, as the functional tests are looking at the corner cases, trying to break things, not simulate well-behaved code. +1 - however, I think Circus should focus n being a general enough testing framework that it can be used for both functional tests and performance tests. Some tests can certainly be geared towards testing corner cases - for example, starting a three-node pseudo-distributed cluster and then kill -STOPping one of the nodes at an inopportune time. Other tests can be handy wrappers for existing benchmark suites like TestDFSIO or Gridmix. Yahoo may have this already internally, but it would be swell to have a day-by-day graph of trunk's performance on a variety of standard jobs.
          Hide
          Steve Loughran added a comment -
          1. I do functional testing with my smartfrog wrapper around the lifecycle-enabled version of Hadoop. what I have not done yet but which fully automated test frameworks can do is have the test framework try out different configuration options, to explore the configuration space. That's fairly bleeding edge in system testing, especially when that configuration space includes the (virtualized) network infrastructure too, but it is a fantastic way to find bugs.
          1. I do think it's critical that Hadoop tests the means by which it gets started up. For hadoop.sh that means the shell scripts need to run. For me, that means bring up some machines with the right RPMs installed, push out the config to them, and then run functional tests against a live cluster.
          1. There's lots of scope for doing really interesting reporting here. You do want the logs from 8 different machines all displayed in the test reports, all in the best temporal order that Lamport will allow.
          1. Long term, lots of scope for datamining the test results.
          1. Log analysis is tricky. You don't want tests that are brittle against log messages, leads to many false failures.
          1. Performance testing is not something you can do on a virtual system, tricky w/ functional tests. Benchmarking is a separate problem. You shoudn't be using your functional tests to assess performance, as the functional tests are looking at the corner cases, trying to break things, not simulate well-behaved code.
          Show
          Steve Loughran added a comment - I do functional testing with my smartfrog wrapper around the lifecycle-enabled version of Hadoop. what I have not done yet but which fully automated test frameworks can do is have the test framework try out different configuration options, to explore the configuration space . That's fairly bleeding edge in system testing, especially when that configuration space includes the (virtualized) network infrastructure too, but it is a fantastic way to find bugs. I do think it's critical that Hadoop tests the means by which it gets started up. For hadoop.sh that means the shell scripts need to run. For me, that means bring up some machines with the right RPMs installed, push out the config to them, and then run functional tests against a live cluster. There's lots of scope for doing really interesting reporting here. You do want the logs from 8 different machines all displayed in the test reports, all in the best temporal order that Lamport will allow. Long term, lots of scope for datamining the test results. Log analysis is tricky. You don't want tests that are brittle against log messages, leads to many false failures. Performance testing is not something you can do on a virtual system, tricky w/ functional tests. Benchmarking is a separate problem. You shoudn't be using your functional tests to assess performance, as the functional tests are looking at the corner cases, trying to break things, not simulate well-behaved code.
          Hide
          Todd Lipcon added a comment -

          Hey Alex,

          Here are a couple ideas I think would make Circus a more useful tool beyond being a simple wrapper for scripts:

          • Automatic log analysis - if it can collect the logs from the daemons involved in a test and turn out a report of any WARN or ERROR level messages, that would be very useful. With this I think each test would need some kind of whitelist ability for expected errors (eg a test of write pipeline failure would expect warnings that the connection had been lost). This is starting to tread on territory of other contrib projects, so I think the differentiator would have to be that it's clearly aimed towards testing and scaling to historical data is not within scope.
          • Automatic performance metrics - the simplest of this would be keeping track of the wall clock time for the tests. This would help identify performance regressions on a macro scale. Additionally, one advantage of not using Java is that you can easily wrap the daemons themselves with the "time" command to keep track of CPU time, sys time, and wall clock time separately and reliably. In a similar vein, you could add hooks for running all of the daemons or tasks inside a profiler.
          • Automatic configuration parameter testing - users (and developers) often go through a workflow of Configure -> Deploy -> Test -> Reconfigure -> Deploy -> etc. Being able to very easily write a script that can run a given circus test with bracketed configurations and dump the results in a standard (eg TSV) format would be incredibly useful.

          That said, I agree that it's best to solicit feedback early and include this as a contrib project even in its current state. Though it currently does little beyond what a skilled user can do with bash, it holds a lot of promise for future extension. Getting users in the habit of developing their system tests within this framework now will benefit everyone down the road when some of the above features have been implemented.

          Show
          Todd Lipcon added a comment - Hey Alex, Here are a couple ideas I think would make Circus a more useful tool beyond being a simple wrapper for scripts: Automatic log analysis - if it can collect the logs from the daemons involved in a test and turn out a report of any WARN or ERROR level messages, that would be very useful. With this I think each test would need some kind of whitelist ability for expected errors (eg a test of write pipeline failure would expect warnings that the connection had been lost). This is starting to tread on territory of other contrib projects, so I think the differentiator would have to be that it's clearly aimed towards testing and scaling to historical data is not within scope. Automatic performance metrics - the simplest of this would be keeping track of the wall clock time for the tests. This would help identify performance regressions on a macro scale. Additionally, one advantage of not using Java is that you can easily wrap the daemons themselves with the "time" command to keep track of CPU time, sys time, and wall clock time separately and reliably. In a similar vein, you could add hooks for running all of the daemons or tasks inside a profiler. Automatic configuration parameter testing - users (and developers) often go through a workflow of Configure -> Deploy -> Test -> Reconfigure -> Deploy -> etc. Being able to very easily write a script that can run a given circus test with bracketed configurations and dump the results in a standard (eg TSV) format would be incredibly useful. That said, I agree that it's best to solicit feedback early and include this as a contrib project even in its current state. Though it currently does little beyond what a skilled user can do with bash, it holds a lot of promise for future extension. Getting users in the habit of developing their system tests within this framework now will benefit everyone down the road when some of the above features have been implemented.
          Hide
          Alex Loddengaard added a comment -

          Thanks for the feedback, Chris. I think we're having a Buddhism vs. Christianity argument. Read: I think the two of us have different expectations for a tool like this, and perhaps we'll never agree. You want a strict framework and find an "execution engine" to be uninteresting for the Hadoop distribution. I think this tool doubles as a system testing tool that we as a community can contribute tests to (you even said that distcp tests at scale are manual; they don't have to be), in addition to an execution engine that Hadoop users can take advantage of for reasons I've already stated. I'm also unsure how a "scaled back" system would be any more interesting or useful. Perhaps you could provide some suggestions on how I might scale this back, and how such a scaled back version would be handy?

          I can tell you that Circus will be useful within Cloudera, and it will be useful for several of our support customers. I think I've done all I can to prove its merit, so perhaps others can weigh in on whether or not such a framework, execution engine, or what have you would be useful for them. Lastly, I would hope that contrib projects (such as Circus) would be more easily accepted into the distribution, as they don't negatively impact the project at all. Their "optional" nature allows users who are interested to use contrib projects at will, while not dirtying or making any other real sacrifices to the rest of the code base.

          In summary, Chris, can you provide specific guidance on how I might scale back Circus to be something more useful? Others, could you please comment on whether or not Circus would be useful to you, and at least provide some guidance for how I can make it better?

          Thanks again for your feedback, Chris.

          Show
          Alex Loddengaard added a comment - Thanks for the feedback, Chris. I think we're having a Buddhism vs. Christianity argument. Read: I think the two of us have different expectations for a tool like this, and perhaps we'll never agree. You want a strict framework and find an "execution engine" to be uninteresting for the Hadoop distribution. I think this tool doubles as a system testing tool that we as a community can contribute tests to (you even said that distcp tests at scale are manual; they don't have to be), in addition to an execution engine that Hadoop users can take advantage of for reasons I've already stated. I'm also unsure how a "scaled back" system would be any more interesting or useful. Perhaps you could provide some suggestions on how I might scale this back, and how such a scaled back version would be handy? I can tell you that Circus will be useful within Cloudera, and it will be useful for several of our support customers. I think I've done all I can to prove its merit, so perhaps others can weigh in on whether or not such a framework, execution engine, or what have you would be useful for them. Lastly, I would hope that contrib projects (such as Circus) would be more easily accepted into the distribution, as they don't negatively impact the project at all. Their "optional" nature allows users who are interested to use contrib projects at will, while not dirtying or making any other real sacrifices to the rest of the code base. In summary, Chris, can you provide specific guidance on how I might scale back Circus to be something more useful? Others, could you please comment on whether or not Circus would be useful to you, and at least provide some guidance for how I can make it better? Thanks again for your feedback, Chris.
          Hide
          Chris Douglas added a comment -

          As the proposal states, this is a framework, with enough context examples and tests to show how the framework is used

          Frameworks impose a discipline on the end user. They make decisions about the admissible form of a solution and propose a model for conceiving of problems within its space. In return, the user isn't merely relieved of the burden of writing boilerplate code, but they're offered a compelling way to think about their problem. Map/reduce is a good example. It forces a particular model of parallel execution on the user, frustrating people who want to use it as a resource allocator for a different parallel model, but for some problems, it's an admissible, productive abstraction in addition to a way to avoid writing all the intermediate code. The latter is nice, but the former is what makes it successful.

          The concepts "context" and "test" in Circus are too vague to admit the possibility of discipline, and because the tool makes no bold choices, it has no taste. It's an execution engine equal to any other, a generic "for" loop with semantics. What is the case for selecting these semantics over any others?

          Circus will let an organization write a context that uses a development cluster of some sort, along with tests that emulate their production jobs, to ensure that their jobs are running as expected on their development cluster. Then, by simply switching contexts, the organization can run all of their jobs on a different version of Hadoop.

          This solves the wrong side of the problem, unless the deltas are small, e.g. one is trying to test whether a release of Hadoop 0.x from provider P will work like release 0.x from provider Q: a contribution of questionable interest to Apache Hadoop. Cross-version compatibility still has too many corner cases to usefully distill into a "context", and similarly "as expected" has too many dimensions to express as a binary state. Whether performance is acceptable, configuration appropriate, results accurate, SLAs satisfied, etc. are all useful questions to ask. "The end user can write a shell script to verify any of these" is exactly the point I make above. Organizations need to evaluate all these factors, but I'm skeptical of an attempt to roll all of these questions into a single, automated tool, particularly if the tool begins with this ambition.

          My game plan is to use Circus to write some interesting system tests that aren't currently in Hadoop's test plan. [...] I expect to tackle testing distcp across different versions of Hadoop and HDFS upgrades.

          Of course this is tested. It's often tested manually and at scale, but the problem is deployment and any necessary investigation, not selecting the distribution and configuring the submitting client.

          What are your specific objections to calling bin/hadoop-daemon.sh and bin/hadoop, except that doing so is one more level of indirection?

          It's only one more layer of indirection. As I said earlier, part of this is a packaging problem: if we had a service API to start/stop/etc. Hadoop from the client, then one could more easily develop tools like this while adhering to some sort of contract. Because services are started and stopped via opaque shell scripts, Hadoop is failing in the way I describe above, by not providing tool writers with a coherent model for the service. This is the some of the motivation behind the service lifecycle branch.

          So this doesn't just need "more;" its premise is unlikely to yield a tool that can be evaluated and included in the distribution. If it were scaled back to solve a particular problem and propose a model for it, it is more likely to find success, acceptance, and adoption.

          Show
          Chris Douglas added a comment - As the proposal states, this is a framework, with enough context examples and tests to show how the framework is used Frameworks impose a discipline on the end user. They make decisions about the admissible form of a solution and propose a model for conceiving of problems within its space. In return, the user isn't merely relieved of the burden of writing boilerplate code, but they're offered a compelling way to think about their problem. Map/reduce is a good example. It forces a particular model of parallel execution on the user, frustrating people who want to use it as a resource allocator for a different parallel model, but for some problems, it's an admissible, productive abstraction in addition to a way to avoid writing all the intermediate code. The latter is nice, but the former is what makes it successful. The concepts "context" and "test" in Circus are too vague to admit the possibility of discipline, and because the tool makes no bold choices, it has no taste. It's an execution engine equal to any other, a generic "for" loop with semantics. What is the case for selecting these semantics over any others? Circus will let an organization write a context that uses a development cluster of some sort, along with tests that emulate their production jobs, to ensure that their jobs are running as expected on their development cluster. Then, by simply switching contexts, the organization can run all of their jobs on a different version of Hadoop. This solves the wrong side of the problem, unless the deltas are small, e.g. one is trying to test whether a release of Hadoop 0.x from provider P will work like release 0.x from provider Q: a contribution of questionable interest to Apache Hadoop. Cross-version compatibility still has too many corner cases to usefully distill into a "context", and similarly "as expected" has too many dimensions to express as a binary state. Whether performance is acceptable, configuration appropriate, results accurate, SLAs satisfied, etc. are all useful questions to ask. "The end user can write a shell script to verify any of these" is exactly the point I make above. Organizations need to evaluate all these factors, but I'm skeptical of an attempt to roll all of these questions into a single, automated tool, particularly if the tool begins with this ambition. My game plan is to use Circus to write some interesting system tests that aren't currently in Hadoop's test plan. [...] I expect to tackle testing distcp across different versions of Hadoop and HDFS upgrades. Of course this is tested. It's often tested manually and at scale, but the problem is deployment and any necessary investigation, not selecting the distribution and configuring the submitting client. What are your specific objections to calling bin/hadoop-daemon.sh and bin/hadoop, except that doing so is one more level of indirection? It's only one more layer of indirection. As I said earlier, part of this is a packaging problem: if we had a service API to start/stop/etc. Hadoop from the client, then one could more easily develop tools like this while adhering to some sort of contract. Because services are started and stopped via opaque shell scripts, Hadoop is failing in the way I describe above, by not providing tool writers with a coherent model for the service. This is the some of the motivation behind the service lifecycle branch. So this doesn't just need "more;" its premise is unlikely to yield a tool that can be evaluated and included in the distribution. If it were scaled back to solve a particular problem and propose a model for it, it is more likely to find success, acceptance, and adoption.
          Hide
          Alex Loddengaard added a comment -

          This is not ready to commit. It may be useful at somepoint, but there is currently no value there. It is hard to evaluate what you are trying to accomplish, since this is just an empty shell. There needs to be substantial added value over the current distribution to be worth including a patch in our code base.

          Thanks for the feedback, Owen. My game plan is to use Circus to write some interesting system tests that aren't currently in Hadoop's test plan. Hopefully then I'll have added good value, both from the tests themselves, but also from the framework. I expect to tackle testing distcp across different versions of Hadoop and HDFS upgrades. I'd appreciate your input on other interesting tests I could write.

          Please review the MiniMRCluster and MiniHDFSCluster. We already have a lot of system tests.

          What I like about Circus is that it uses the shell scripts, which currently don't get tested at all. I also think that Circus is more usable than these Java classes for people that aren't familiar with the Hadoop code base.

          Again, thanks for your feedback. Stay tuned.

          Show
          Alex Loddengaard added a comment - This is not ready to commit. It may be useful at somepoint, but there is currently no value there. It is hard to evaluate what you are trying to accomplish, since this is just an empty shell. There needs to be substantial added value over the current distribution to be worth including a patch in our code base. Thanks for the feedback, Owen. My game plan is to use Circus to write some interesting system tests that aren't currently in Hadoop's test plan. Hopefully then I'll have added good value, both from the tests themselves, but also from the framework. I expect to tackle testing distcp across different versions of Hadoop and HDFS upgrades. I'd appreciate your input on other interesting tests I could write. Please review the MiniMRCluster and MiniHDFSCluster. We already have a lot of system tests. What I like about Circus is that it uses the shell scripts, which currently don't get tested at all. I also think that Circus is more usable than these Java classes for people that aren't familiar with the Hadoop code base. Again, thanks for your feedback. Stay tuned.
          Hide
          Owen O'Malley added a comment -

          This is not ready to commit. It may be useful at somepoint, but there is currently no value there. It is hard to evaluate what you are trying to accomplish, since this is just an empty shell. There needs to be substantial added value over the current distribution to be worth including a patch in our code base.

          Please review the MiniMRCluster and MiniHDFSCluster. We already have a lot of system tests.

          It would actually be more useful to have a mock object framework to support real unit tests...

          I'd suggest making less ambitious goals. A smoke test script that submits some of the examples might make sense, but it would just look like:

          • Write a file to hdfs
          • Run some of the examples from the examples jar (wordcount, rand-writer, sort, sort-validation)
          • Remove files

          But it has nothing to do with system testing, just a light smoke test.

          It certainly shouldn't be doing downloads or installs. That is just overkill.

          On the other hand, it would be nice to have a patch to build.xml to generate RPMs (HADOOP-6255).

          Show
          Owen O'Malley added a comment - This is not ready to commit. It may be useful at somepoint, but there is currently no value there. It is hard to evaluate what you are trying to accomplish, since this is just an empty shell. There needs to be substantial added value over the current distribution to be worth including a patch in our code base. Please review the MiniMRCluster and MiniHDFSCluster. We already have a lot of system tests. It would actually be more useful to have a mock object framework to support real unit tests... I'd suggest making less ambitious goals. A smoke test script that submits some of the examples might make sense, but it would just look like: Write a file to hdfs Run some of the examples from the examples jar (wordcount, rand-writer, sort, sort-validation) Remove files But it has nothing to do with system testing, just a light smoke test. It certainly shouldn't be doing downloads or installs. That is just overkill. On the other hand, it would be nice to have a patch to build.xml to generate RPMs ( HADOOP-6255 ).
          Hide
          Alex Loddengaard added a comment -

          Attaching a new diff that fetches a distribution from apache.org, instead of mirror.cloudera.com. Also making one small change to the hdfs-basic test (changing directory to $(dirname $0) at the start).

          Show
          Alex Loddengaard added a comment - Attaching a new diff that fetches a distribution from apache.org, instead of mirror.cloudera.com. Also making one small change to the hdfs-basic test (changing directory to $(dirname $0) at the start).
          Hide
          Alex Loddengaard added a comment -

          Thanks for the feedback, Chris.

          The idea is interesting and could beget a useful tool, but the current version is principally a wrapper for default scripts and settings.

          As the proposal states, this is a framework, with enough context examples and tests to show how the framework is used. I agree with you that it is currently a wrapper, but it will immediately cease to be a wrapper when more interesting contexts and tests are written. Being a large contributor to Hadoop itself, I would love to hear how you think this tool could make your job easier, if at all. Some of us here at Cloudera, along with at least a few of our customers and users, would value a framework like this. Circus will let an organization write a context that uses a development cluster of some sort, along with tests that emulate their production jobs, to ensure that their jobs are running as expected on their development cluster. Then, by simply switching contexts, the organization can run all of their jobs on a different version of Hadoop. Perhaps I should write a new, more interesting context to prove my point.

          More responses:

          Don't cut and paste code such as examples.

          Agreed it's silly to copy-paste the word count example. This test is a demonstration that users can compile Java MapReduce programs in their tests. I find it useful in that regard, but I can write a new MapReduce job that isn't an example to demonstrate the compilation use case if you'd like. I chose the word count example specifically so users interested in writing tests would have access to a very simple MapReduce program that is compiled on the fly.

          Don't wrap the shell scripts with another level of indirection; they do enough of that on their own

          I assume you're referring to the bin/hadoop-daemon.sh and bin/hadoop scripts, right? I argue that not using these scripts would greatly complicate creating new contexts and tests. I want users of Circus to write contexts and tests in a way that they're familiar with; namely, command line tools. Additionally, Circus is meant to test Hadoop end-to-end. Using the shell scripts helps to achieve this goal, especially because Hadoop's unit tests do not test the shell scripts. What are your specific objections to calling bin/hadoop-daemon.sh and bin/hadoop, except that doing so is one more level of indirection?

          We try not to include references to specific companies. Certainly Hadoop should not be fetched from anywhere but Apache in this distribution.

          Good catch here. While scanning the Apache mirror page, I didn't notice a link to an apache.org site. My mistake.

          Show
          Alex Loddengaard added a comment - Thanks for the feedback, Chris. The idea is interesting and could beget a useful tool, but the current version is principally a wrapper for default scripts and settings. As the proposal states, this is a framework, with enough context examples and tests to show how the framework is used. I agree with you that it is currently a wrapper, but it will immediately cease to be a wrapper when more interesting contexts and tests are written. Being a large contributor to Hadoop itself, I would love to hear how you think this tool could make your job easier, if at all. Some of us here at Cloudera, along with at least a few of our customers and users, would value a framework like this. Circus will let an organization write a context that uses a development cluster of some sort, along with tests that emulate their production jobs, to ensure that their jobs are running as expected on their development cluster. Then, by simply switching contexts, the organization can run all of their jobs on a different version of Hadoop. Perhaps I should write a new, more interesting context to prove my point. More responses: Don't cut and paste code such as examples. Agreed it's silly to copy-paste the word count example. This test is a demonstration that users can compile Java MapReduce programs in their tests. I find it useful in that regard, but I can write a new MapReduce job that isn't an example to demonstrate the compilation use case if you'd like. I chose the word count example specifically so users interested in writing tests would have access to a very simple MapReduce program that is compiled on the fly. Don't wrap the shell scripts with another level of indirection; they do enough of that on their own I assume you're referring to the bin/hadoop-daemon.sh and bin/hadoop scripts, right? I argue that not using these scripts would greatly complicate creating new contexts and tests. I want users of Circus to write contexts and tests in a way that they're familiar with; namely, command line tools. Additionally, Circus is meant to test Hadoop end-to-end. Using the shell scripts helps to achieve this goal, especially because Hadoop's unit tests do not test the shell scripts. What are your specific objections to calling bin/hadoop-daemon.sh and bin/hadoop, except that doing so is one more level of indirection? We try not to include references to specific companies. Certainly Hadoop should not be fetched from anywhere but Apache in this distribution. Good catch here. While scanning the Apache mirror page, I didn't notice a link to an apache.org site. My mistake.
          Hide
          Chris Douglas added a comment -
          +RELEASE_URL = "http://mirror.cloudera.com/apache/hadoop/core/hadoop-0.20.1/hadoop-0.20.1.tar.gz"
          

          -1.

          In its nascent state, this seems to leave a lot as an exercise for the end user. The idea is interesting and could beget a useful tool, but the current version is principally a wrapper for default scripts and settings. It automates some of the lore, which new users will appreciate, but the model is still mostly an idea. Similar issues- where users submit scripts they found themselves circulating- have been rejected in the past.

          Wrapping shell scripts with other scripts is a problem inherent in our current model; if we had Java drivers (e.g. HADOOP-61 yea; it's been discussed ), then frameworks like this one could be written more elegantly using other, driver tools.

          General issues:

          • Don't cut and paste code such as examples.
          • Don't wrap the shell scripts with another level of indirection; they do enough of that on their own
          • We try not to include references to specific companies. Certainly Hadoop should not be fetched from anywhere but Apache in this distribution.
          Show
          Chris Douglas added a comment - +RELEASE_URL = "http://mirror.cloudera.com/apache/hadoop/core/hadoop-0.20.1/hadoop-0.20.1.tar.gz" -1. In its nascent state, this seems to leave a lot as an exercise for the end user. The idea is interesting and could beget a useful tool, but the current version is principally a wrapper for default scripts and settings. It automates some of the lore, which new users will appreciate, but the model is still mostly an idea. Similar issues- where users submit scripts they found themselves circulating- have been rejected in the past. Wrapping shell scripts with other scripts is a problem inherent in our current model; if we had Java drivers (e.g. HADOOP-61 yea; it's been discussed ), then frameworks like this one could be written more elegantly using other, driver tools. General issues: Don't cut and paste code such as examples. Don't wrap the shell scripts with another level of indirection; they do enough of that on their own We try not to include references to specific companies. Certainly Hadoop should not be fetched from anywhere but Apache in this distribution.
          Hide
          Alex Loddengaard added a comment -

          Attaching a new patch with a few modifications:

          1. Changed the canned distribution, which was previously Cloudera's distribution, and is now vanilla Hadoop.
          2. Changed hdfs-basic tests to not use /etc/passwd and instead create a temporary file
          3. bin/circus will now exit with the appropriate exit status when tests pass or fail

          I should also say a little about my intentions of this patch, and Circus in general. I claim that the code I've attached is preliminary, and my proposal has lots of open questions. I'm confident that Circus, as it exists right now, is already useful for Hadoop users and QA engineers. My posed questions are intended to gather useful feedback to improve Circus even more. I'd be completely happy if the preliminary code made it into trunk, with the expectation that I will continue to submit further patches to improve Circus.

          Thanks.

          Show
          Alex Loddengaard added a comment - Attaching a new patch with a few modifications: Changed the canned distribution, which was previously Cloudera's distribution, and is now vanilla Hadoop. Changed hdfs-basic tests to not use /etc/passwd and instead create a temporary file bin/circus will now exit with the appropriate exit status when tests pass or fail I should also say a little about my intentions of this patch, and Circus in general. I claim that the code I've attached is preliminary, and my proposal has lots of open questions. I'm confident that Circus, as it exists right now, is already useful for Hadoop users and QA engineers. My posed questions are intended to gather useful feedback to improve Circus even more. I'd be completely happy if the preliminary code made it into trunk, with the expectation that I will continue to submit further patches to improve Circus. Thanks.
          Hide
          Alex Loddengaard added a comment -

          It seems as though the release audit warnings have to do with input files, output files, and README files not having Apache license headers. I noticed that other contrib projects lack license headers for these files as well, so I'll leave the patch as is, unless someone suggests otherwise.

          Show
          Alex Loddengaard added a comment - It seems as though the release audit warnings have to do with input files, output files, and README files not having Apache license headers. I noticed that other contrib projects lack license headers for these files as well, so I'll leave the patch as is, unless someone suggests otherwise.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12419097/HADOOP-6248.diff
          against trunk revision 812740.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 29 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          -1 release audit. The applied patch generated 5 release audit warnings (more than the trunk's current 0 warnings).

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/27/testReport/
          Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/27/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/27/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/27/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/27/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12419097/HADOOP-6248.diff against trunk revision 812740. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 29 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 release audit. The applied patch generated 5 release audit warnings (more than the trunk's current 0 warnings). +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/27/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/27/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/27/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/27/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-h4.grid.sp2.yahoo.net/27/console This message is automatically generated.
          Hide
          Alex Loddengaard added a comment -

          Attaching preliminary source code.

          Show
          Alex Loddengaard added a comment - Attaching preliminary source code.
          Hide
          Alex Loddengaard added a comment -

          Overview

          This proposal defines a system testing framework for Hadoop, named Circus. Circus will let Hadoop testers and users run one or more tests on a Hadoop "context." A context might be a local pseudo-distributed cluster, an EC2 cluster, two Hadoop clusters-each running a different version of Hadoop-or anything else. A test might be the canned pi example, HDFS manipulations, an HDFS upgrade, custom MapReduce code that is compiled and checked for validity, a distcp across two different versions of Hadoop, or anything else. This proposal will provide more examples of contexts and jobs in its later sections.

          System Testing vs. Unit Testing

          One might wonder how Circus is different from a unit testing framework such as JUnit. A unit testing framework is meant to run short-lived tests, one after another. Unit tests, by definition, are meant to test small, independent pieces of a larger entity. A system test, in general, is larger than a unit test, in that a system test might run for several hours, analyze logs, run on a large cluster of machines, or test the runtime and performance of a job. We will see more use cases later.

          Motivation and Use Cases

          1. Hadoop users want to know if their MapReduce jobs are compatible with new Hadoop releases. They should have a tool that lets them easily spin up a cluster running a particular version of Hadoop and compile and run MapReduce jobs on that cluster. Ideally users will put their MR jobs in Circus and put Circus in their QA pipeline. Then, when considering an upgrade, they can make minimal changes to test those same jobs in a newer release of Hadoop.
          2. Hadoop users want to know how their MapReduce jobs will perform in a new Hadoop release.
          3. Hadoop QA engineers want to run complicated tests such as HDFS upgrades, distcp jobs between two different versions of HDFS, etc. Testers want these tests to all be part of the same framework so a suite of regression tests can be run with a single command.
          4. Hadoop QA engineers want to continually be running Hadoop jobs to ensure API compatibility between versions. Circus could power a community-driven website, where community members upload Hadoop jobs and sample data that help generate reports on backwards compatibility, both at the API level and at the runtime level.

          Expected Usage

          I imagine that large users of Hadoop can port their MapReduce jobs to work in this framework, and add Circus to their QA pipeline. They might also configure the local context to work with their dev cluster. I imagine Hadoop QA engineers can use Circus to test some of the tricky use cases of Hadoop, such as cross-version distcp jobs, HDFS upgrades, or anything else. Testing these use cases will require that the QA engineer write contexts and tests.

          Implementation Details

          Please see the source code. A README file exists that explains interfaces and other implementation details.

          Open Issues and Questions

          1. Can this framework be useful to Hadoop developers? And if so, how?
            1. Should Circus provide hooks for inducing failure in a Hadoop cluster?
            2. Might a trunk context be useful so developers can spin up a cluster very quickly from their local SVN checkout?
            3. Can the local context allow the user to specify a core jar to be used, instead of a release?
          2. What other canned contexts would be useful besides the local context? Probably an EC2 context.
          3. How can this framework cater to performance testing? Is just providing time information sufficient?
          4. Should tests specify the contexts they are supposed to run in? Seems like tests should at least have the option to. This will be particularly interesting for tests that need to run on two Hadoop clusters.
          5. Should the environment variables specified by a context be validated? That is, should the interface for contexts be more rigid?
          6. Should Circus provide a facility for analyzing the Hadoop logs generated during the run of a test?
          Show
          Alex Loddengaard added a comment - Overview This proposal defines a system testing framework for Hadoop, named Circus. Circus will let Hadoop testers and users run one or more tests on a Hadoop "context." A context might be a local pseudo-distributed cluster, an EC2 cluster, two Hadoop clusters-each running a different version of Hadoop-or anything else. A test might be the canned pi example, HDFS manipulations, an HDFS upgrade, custom MapReduce code that is compiled and checked for validity, a distcp across two different versions of Hadoop, or anything else. This proposal will provide more examples of contexts and jobs in its later sections. System Testing vs. Unit Testing One might wonder how Circus is different from a unit testing framework such as JUnit. A unit testing framework is meant to run short-lived tests, one after another. Unit tests, by definition, are meant to test small, independent pieces of a larger entity. A system test, in general, is larger than a unit test, in that a system test might run for several hours, analyze logs, run on a large cluster of machines, or test the runtime and performance of a job. We will see more use cases later. Motivation and Use Cases Hadoop users want to know if their MapReduce jobs are compatible with new Hadoop releases. They should have a tool that lets them easily spin up a cluster running a particular version of Hadoop and compile and run MapReduce jobs on that cluster. Ideally users will put their MR jobs in Circus and put Circus in their QA pipeline. Then, when considering an upgrade, they can make minimal changes to test those same jobs in a newer release of Hadoop. Hadoop users want to know how their MapReduce jobs will perform in a new Hadoop release. Hadoop QA engineers want to run complicated tests such as HDFS upgrades, distcp jobs between two different versions of HDFS, etc. Testers want these tests to all be part of the same framework so a suite of regression tests can be run with a single command. Hadoop QA engineers want to continually be running Hadoop jobs to ensure API compatibility between versions. Circus could power a community-driven website, where community members upload Hadoop jobs and sample data that help generate reports on backwards compatibility, both at the API level and at the runtime level. Expected Usage I imagine that large users of Hadoop can port their MapReduce jobs to work in this framework, and add Circus to their QA pipeline. They might also configure the local context to work with their dev cluster. I imagine Hadoop QA engineers can use Circus to test some of the tricky use cases of Hadoop, such as cross-version distcp jobs, HDFS upgrades, or anything else. Testing these use cases will require that the QA engineer write contexts and tests. Implementation Details Please see the source code. A README file exists that explains interfaces and other implementation details. Open Issues and Questions Can this framework be useful to Hadoop developers? And if so, how? Should Circus provide hooks for inducing failure in a Hadoop cluster? Might a trunk context be useful so developers can spin up a cluster very quickly from their local SVN checkout? Can the local context allow the user to specify a core jar to be used, instead of a release? What other canned contexts would be useful besides the local context? Probably an EC2 context. How can this framework cater to performance testing? Is just providing time information sufficient? Should tests specify the contexts they are supposed to run in? Seems like tests should at least have the option to. This will be particularly interesting for tests that need to run on two Hadoop clusters. Should the environment variables specified by a context be validated? That is, should the interface for contexts be more rigid? Should Circus provide a facility for analyzing the Hadoop logs generated during the run of a test?

            People

            • Assignee:
              Unassigned
              Reporter:
              Alex Loddengaard
            • Votes:
              0 Vote for this issue
              Watchers:
              32 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development