Attached dot and png files are what I figured so far (rectangle boxes represent capabilities that will be provided by actual packages and dotted lines represent "optional/recommended" dependencies). Now, I still have a few concerns:
1. I think it is pretty clear by now that mapreduce dependency has to be on a capability, not an actual package (and then we'll have hadoop-mapreduce "Provide: " that capability. The question is whether we are ready to do the same with hadoop-hdfs and what those capabilities should be called (my proposal is to call them "mapreduce" and "dfs" respectively and make the actual packages hadoop-mapreduce and hadoop-hdfs provide those capabilities for now).
2. For pig, hive,sqoop and mahout the real hard dependency is mapreduce. The dependency on dfs is an optional one (they can run just fine in local mode without ever talking to HDFS). The question is – what's the best mechanism to "recommend" dfs? I know we can do that with debian packages (Recommends tag), but what about RPM? Finally, are we doing the right thing here by treating dfs as an optional dependency or should we enforce it to begin with?
3. HBase is a weird case here – at the Maven level they package all of their dependencies (optional or not) into lib/* they end up with a whole bunch of jars there that we're currently replacing by symlinks. Not all of those dependencies are needed by HBase in all cases
(in fact the only hard dependency there is Zookeeper) but having dangling symlinks doesn't seem appealing. The question is – what do we do?