I am a little late to the party here but:
It is a big issue for downstream users. Projects that use Hadoop already pick up a lot of jars and increasing the set when all of the versions are the same is a problem. We'll also have users using different versions of the jars, which won't be useful.
Having a source structure that requires an IDE to use isn't making the code easy for people to browse, use and modify. It will also become a maintenance problem as the dependency graph between the components change.
Yes, you can munge the results together into a single jar as part of the build, but I don't see how it makes development easier or faster to have lots of little directories.
I disagree. It is a huge issue as a downstream user when the jar granularity is not fine enough. You don't have to manually pick each jar, so the total number is not the issue. If set up correctly a user will only pick the one or maybe two jars needed for their use case and maven/ivy/etc pulls in the transitive dependencies for you with the correct versions. It is a MUCH bigger risk if as a user I don't have the ability to build the package I want that excludes the stuff I don't need without a lot of trouble. It is not the number of jars that is the problem, it is the total size of all of them and the likelihood of version mismatches with transitive dependencies. The current issue is not that projects that use Hadoop 'pick up a lot of jars' it is that they 'pick up a lot of jars that are not needed at all'.
A few 'top level' jars that are useful for various use cases as single points of inclusion would be perfect. This does not imply few jars total, it implies a few that you choose to declare for your use cases – they can pull in any number of other shared hadoop that are required for those use cases, it doesn't matter if they are 'the same version', the user does not need to know since maven handles that and maven best practices make many jars with the 'same version' a non-issue.
A user pulls in a mapreduce client jar, and that might also pull in a couple 'common' jars from the same project. That is the intended best practice of maven. If the mapreduce client jar were to bundle common stuff in it, and that same common stuff were bundled in say, an hdfs-client jar, then you risk all sorts of trouble as a downstream user with multiple colliding classes on your classpath, the inability to have the tooling (maven) detect and deal with conflicts appropriately, etc. If it were to bundle stuff that is not useful as a client, that would bloat client application jars and potentially pull in useless transitive dependencies.
If the jars are reduced into only a few big blobs, it will end up more like the absolutely atrocious maven dependency management in 0.20.205 and 0.22.x. where a user who just wants to build a mapreduce program pulls in 20MB of downstream jars that are not needed unless they manually exclude them.
Having more source trees is a slight development burden, but enforces the right encapsulation and organization of dependencies. One of the benefits of organizing modules in maven is that the end result almost always leads to more clear code boundaries and better architectural separation of concerns. It also helps define API boundaries and prevent creating leaky abstractions / apis by accident.
Thinking more on it. I am inclined to keeping the modules separate as it is currently, instead of combining the source tree.
I am counting the no of modules to be 10-12. So the source tree should not be 59 or am I missing something.
The separate modules do help identify the boundaries more clearly and help in enforcing those. Separation just based on java packages is loose. I know this based on the unnecessary pain I went thru when I was working on the project split 2 years ago. In future, refactoring code or doing things like rewriting NM in C++ will be least intrusive with the current module structure.
If no of jars is the problem, can we just merge the jars at build time the way we want. Using maven shade plugin or some such ?
I agree. You can use the shade plugin to make a few 'fat' jars for some use cases that live along side the normal artifacts that do not embed any dependencies.
Please, please don't put any jars in a maven repo that bundle dependencies unless they are attached artifacts and not the primary artifact.
Please, please declare the dependencies properly, using 'optional' or 'provided' scope as appropriate to prevent downstream users from pulling in artifacts transitively that a client user does not need.
I believe that too few jars is worst than too many, when the two items above are done correctly (e.g. maven best practices are followed). Then as a downstream user, I can easily select the features I want, and trust that the dependencies that are pulled in to my project transitively as a consequence of say, pulling in a mapreduce client jar, are only the jars needed as a mapreduce client and not the entire freaking hadoop framework or any other extra unnecessary baggage.