Just to be clear: the current code requires more than a compelling proposal. Progress toward one or more of these roles must be implemented before a version of this tool is committed. I remain -1.
Todd and Steve's suggestions characterize this tool as a mix of Chukwa (and related causal tracing projects), HOD, dynamic analysis, system benchmarks, Hudson, and a unit tests for the bin scripts. Practically, a "scaled back" version of Circus aspires to fewer, concrete goals so that it may achieve some of them.
I think the two of us have different expectations for a tool like this, and perhaps we'll never agree. You want a strict framework and find an "execution engine" to be uninteresting for the Hadoop distribution. I think this tool doubles as a system testing tool that we as a community can contribute tests to (you even said that distcp tests at scale are manual; they don't have to be), in addition to an execution engine that Hadoop users can take advantage of for reasons I've already stated.
The assertion that this is the correct form is speculative at best. While Todd also identifies its role as a magnet for other system tests as a win, I see no argument for why Hadoop should standardize on this particular driver for its integration tests.
I can tell you that Circus will be useful within Cloudera, and it will be useful for several of our support customers.
I'll take your word for it, but that doesn't mitigate the burden of demonstrating why this should be included in every version of Hadoop.
I think I've done all I can to prove its merit, so perhaps others can weigh in on whether or not such a framework, execution engine, or what have you would be useful for them.
Given the discussion so far, I don't think it's unfair to point out that "what have you" is where this seems to go off the rails. A blank canvas has unbounded potential, but that doesn't make it priceless. Tomorrow, why wouldn't we accept another, equally mature inner loop?
Lastly, I would hope that contrib projects (such as Circus) would be more easily accepted into the distribution, as they don't negatively impact the project at all. Their "optional" nature allows users who are interested to use contrib projects at will, while not dirtying or making any other real sacrifices to the rest of the code base.
These criteria are unrealistically weak and the "adding to contrib is free" justification is patently false. Yes, contrib is not as rigorously screened as core, but it's not a public sandbox, either.
can you provide specific guidance on how I might scale back Circus to be something more useful?
Since you ask, there's a huge space for QA tools, as Steve and Todd have demonstrated, but the "driver" space is boilerplate and saturated. Instead of starting from scratch, you might consider writing Hadoop-centric bindings for other tools, like Findbugs. A study of common mistakes made with the framework and corresponding scans of static user code would avoid wasting grid compute resources on, say, output key class mismatches or deserializations during compares. Such a contribution would have obvious applicability and could be included not only in QA pipelines, but also in submission queues. If Circus were a suite of validation tool configurations and extensions to be run over user jobs for performance and correctness violations (single-node), it could easily find a role. It's not everything Circus currently aspires to be, but it's clear how (and why) others would contribute to it and what its users can expect from it. Integrating with two or three tools would also refine its interfaces, so the "context" idea could be fleshed out a little more. Smoke tests, as Owen suggested, would also be useful (and easier to write).