IMO the root issue is that we are not using dependencies correctly.
Absolutely. Hadoop's dependency setup is absolutely atrocious in 0.20.205 and 0.22 I haven't looked at 0.23 in enough detail yet but would love the situation to be fixed.
I have a project that needs to read and write from HDFS. Declaring hadoop pulls in all of Jetty, the tomcat compiler, and a dozen other jars that I manually have to exclude.
The above needs to be avoided for mapreduce.
Building larger jars that package dependencies in them is OK for some use cases but absolutely worthless for any real application that has any chance of dependency conflict. Things like Jetty should be marked as provided not compile scope (or perhaps optional).
There should be a hadoop-client that allows me to code and run HDFS/MR client apps (with the exact set of transitive dependencies, ie you don't need jetty stuff there).
IMO, we need an hdfs-api.jar and mapreduce-api.jar that pull in only what is needed to build an application that uses those APIs as a client. A user should be able to declare those in their project, and have only the transitive dependencies needed for those use cases pulled in, and nothing extra. One could even go to the extreme of having a mapred-api.jar and mapreduce-api.jar with the old and new apis separated (and a mapreduce-common-api.jar they both depend on) if that was a bigger use case. More modularization will be a great benefit to users, when combined with using dependencies properly in hadoop itself.
The fact that under the hood these 'hadoop-client' & 'hadoop-test' component pull 1 or 100 hadoop JARs is irrelevant (although IMO I think we have too many JARs).
Yes, if the artifacts are configured properly with the right dependencies in the correct scope (e.g. jetty in provided scope since only one trying to run the framework needs it, not clients) then there is only one artifact to declare for each use. It is not the total number of jars, it is the total size of jars that matters. Finer grained control of dependencies by users is a good thing. As a user I want to declare what I need as simply as possible ("I need to launch a mini-mr during test, so I need hadoop-mr-test.jar"; "I need to submit a job to a cluster, so I need mr-client.jar"), what that means behind the scenes in total jar count of transitive dependencies is a different issue entirely. As long as this pulls in only what is needed and not useless baggage (jetty, tomcat's compiler, etc).
There is no need to package 'fat jars' unless you wish to have a single artifact for uses where tooling does not build the classpath for you.
Regarding my prev second bullet item, it seems via a classifier this is possible ( http://maven.apache.org/plugins/maven-shade-plugin/examples/attached-artifact.html ), still this is kind of uncommon for commonly used artifacts.
I support using an attached artifact with a classifier for any jars containing dependencies. It is an anti-pattern to put a jar with dependencies into a maven repo as the primary artifact however (unless you move those dependencies into a private scope to avoid conflicts).