What follows are some thoughts I have on the general situation in Hadoop of adding big projects like Hive to core/contrib. I don't think this is a scalable way forward and would like to use this submission as an opportunity to discuss the general challenges involved in welcoming new projects into the Hadoop family.
We've now seen 3 Hadoop projects take different courses with different results:
1) HBASE - This went into contrib. It sat there for a number of months in active development before becoming a subproject. ADVANTAGES: Good publicity for project. DISADVANTAGES: Since it was very active, it frequently broke the hadoop core build and became a significant fraction of hadoop-dev message traffic. This was somewhat disruptive to core development. IMO this does not scale. If we had several such projects running at once in core/contrib they would drown out the main dev community.
2) Pig - Pig went directly into the apache incubator and has ambitions to graduate to a Hadoop sub-project. ADVANTAGES: Low overhead to the hadoop community, lots of training for the committers on the Apache way. DISADVANTAGES: Less visible than HBASE, high upfront investment in project setup, review, committer training, approval, ...
3) ZooKeeper - It was shared by its developers outside of apache under the BSD & then apache licenses, first as a posting on the Yahoo Research website and then as a source forge project. ADVANTAGES: Super low cost to start, fewer restrictions to share code than incubation, ... DISADVANTAGES: Less visible than HBASE.
From these experiences, I think checking in major projects that build on Hadoop into core contrib is not the most productive way to host them. If they are active, they can be very disruptive during their formation. A project should have its own email lists, tests, branches, etc, independent of Hadoop mainline.
The main advantage of putting projects in core seems to be to increase their visibility to the Hadoop community. I'd suggest we discuss other mechanisms. Long term I hope to see something like cpan.org emerge for hadoop. But short term we have not IDed an entity to host such a site.
Absent that, I'd suggest a project like Hive take either the path ZooKeeper or Pig took. As a community we could take some simple steps to address the short comings of these approaches. An obvious step would be to invest in a well linked Wiki section that provides a directory of such projects.
What do folks this of this? Other thoughts? Suggestions?