I think the better solution at this point is to move to Hadoop 0.21 as part of the next release.
-1 on this yet. (if i can recollect, Ted had concern about this move as well).
At the risk sounding like a stuck record, nobody is using 0.21 that i know. 0.21 is not production grade which was recognized even by the Hadoop team.
It is true 0.21 is a superset of CDH but it potentially has stuff CDH doesn't have so using 0.21 does not guarantee everything will work with CDH and it almost certainly guarantees nothing will work for bulk stuff on EMR.
We use both EMR and CDH. If you puff up the dependencies, as things are now, it will absolutely preclude us from using further versions of Mahout. I probably could maneuver some code that we use with CDH to verify it still works with CDH but not en masse. If i really wanted to use some of such migrated algorithms and take advantage of various fixes, i would have to create massive private hacks to keep it working (similar to what Cloudera does). Which we probably don't have capacity to do, so i'll just have to drop using trunk or future Mahout distributions until better times.
I know for sure we will never use 0.21 they way it is released.
There's probably more hope for new generation of hadoop that would combine ability to run old MR or new MR or something else. In fact, I am looking forward to porting and using that future Hadoop generation work as it would allow to scrap many unnecessary limitations of MR for parallel use that are holding up performance on many algorithms (esp. lin alg algorithms).