Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.23.0
    • Fix Version/s: None
    • Component/s: mrv2
    • Labels:
      None

      Description

      Currently the MR-279 mapreduce project generates 59 jars from 59 source roots, which can be dramatically simplified.

        Issue Links

          Activity

          Hide
          Owen O'Malley added a comment -

          I'd propose that we have:

          mr-client/* -> src/java, src/test
          yarn/yarn-api,yarn-common -> yarn/client
          yarn/yarn-server/* -> yarn/server

          so that we end up withyarn-client, yarn-server, and mapreduce jars. Of course the Java package structure will still separate the different servers from each other.

          Show
          Owen O'Malley added a comment - I'd propose that we have: mr-client/* -> src/java, src/test yarn/yarn-api,yarn-common -> yarn/client yarn/yarn-server/* -> yarn/server so that we end up withyarn-client, yarn-server, and mapreduce jars. Of course the Java package structure will still separate the different servers from each other.
          Hide
          Sharad Agarwal added a comment -

          I would prefer to break mr-client into two instead of one:

          • mr-client -> jobclient and other user libraries
          • mr-runtime -> MR ApplicationMaster, MapTask, ReduceTask etc.

          This will ensure clear separation between user facing libraries and the runtime.

          Show
          Sharad Agarwal added a comment - I would prefer to break mr-client into two instead of one: mr-client -> jobclient and other user libraries mr-runtime -> MR ApplicationMaster, MapTask, ReduceTask etc. This will ensure clear separation between user facing libraries and the runtime.
          Hide
          Sharad Agarwal added a comment -

          Thinking more on it. I am inclined to keeping the modules separate as it is currently, instead of combining the source tree.
          I am counting the no of modules to be 10-12. So the source tree should not be 59 or am I missing something.

          The separate modules do help identify the boundaries more clearly and help in enforcing those. Separation just based on java packages is loose. I know this based on the unnecessary pain I went thru when I was working on the project split 2 years ago. In future, refactoring code or doing things like rewriting NM in C++ will be least intrusive with the current module structure.

          If no of jars is the problem, can we just merge the jars at build time the way we want. Using maven shade plugin or some such ?

          Show
          Sharad Agarwal added a comment - Thinking more on it. I am inclined to keeping the modules separate as it is currently, instead of combining the source tree. I am counting the no of modules to be 10-12. So the source tree should not be 59 or am I missing something. The separate modules do help identify the boundaries more clearly and help in enforcing those. Separation just based on java packages is loose. I know this based on the unnecessary pain I went thru when I was working on the project split 2 years ago. In future, refactoring code or doing things like rewriting NM in C++ will be least intrusive with the current module structure. If no of jars is the problem, can we just merge the jars at build time the way we want. Using maven shade plugin or some such ?
          Hide
          Luke Lu added a comment -

          If no of jars is the problem, can we just merge the jars at build time the way we want. Using maven shade plugin or some such ?

          I agree with Sharad, the current modules layout is fine. It makes working on individual features faster and easier. People who complain about number of source roots should improve their IDE fu and/or use a better IDE, IMO

          According to recent conversations with people involved, I got the impression that it's just a packaging issue, i.e., having 3 combined jars plus dependencies in the distribution tar ball. yarn-client, yarn-servers and hadoop-mapreduce. So just some maven-shade-plugin fu would suffice.

          Show
          Luke Lu added a comment - If no of jars is the problem, can we just merge the jars at build time the way we want. Using maven shade plugin or some such ? I agree with Sharad, the current modules layout is fine. It makes working on individual features faster and easier. People who complain about number of source roots should improve their IDE fu and/or use a better IDE, IMO According to recent conversations with people involved, I got the impression that it's just a packaging issue, i.e., having 3 combined jars plus dependencies in the distribution tar ball. yarn-client, yarn-servers and hadoop-mapreduce. So just some maven-shade-plugin fu would suffice.
          Hide
          Owen O'Malley added a comment -

          It is a big issue for downstream users. Projects that use Hadoop already pick up a lot of jars and increasing the set when all of the versions are the same is a problem. We'll also have users using different versions of the jars, which won't be useful.

          Having a source structure that requires an IDE to use isn't making the code easy for people to browse, use and modify. It will also become a maintenance problem as the dependency graph between the components change.

          Yes, you can munge the results together into a single jar as part of the build, but I don't see how it makes development easier or faster to have lots of little directories.

          That said, I don't have cycles to do the work right now. If no one else does either, we can postpone the debate.

          Show
          Owen O'Malley added a comment - It is a big issue for downstream users. Projects that use Hadoop already pick up a lot of jars and increasing the set when all of the versions are the same is a problem. We'll also have users using different versions of the jars, which won't be useful. Having a source structure that requires an IDE to use isn't making the code easy for people to browse, use and modify. It will also become a maintenance problem as the dependency graph between the components change. Yes, you can munge the results together into a single jar as part of the build, but I don't see how it makes development easier or faster to have lots of little directories. That said, I don't have cycles to do the work right now. If no one else does either, we can postpone the debate.
          Hide
          Luke Lu added a comment -

          I don't see how it makes development easier or faster to have lots of little directories.

          Smaller module means smaller code base to start for a typical feature and much faster to recompile if one doesn't use an IDE. i.e, just mvn clean install in the module directory. yarn only have 5 modules (including an integration test module), the mapreduce runtime has 6 modules. Is this "lots of little directories" that's out of control?

          Show
          Luke Lu added a comment - I don't see how it makes development easier or faster to have lots of little directories. Smaller module means smaller code base to start for a typical feature and much faster to recompile if one doesn't use an IDE. i.e, just mvn clean install in the module directory. yarn only have 5 modules (including an integration test module), the mapreduce runtime has 6 modules. Is this "lots of little directories" that's out of control?
          Hide
          Alejandro Abdelnur added a comment -

          59 JARs for a project seems a bit too much, it seems that JARs are being used instead of Java packages to separate class.

          IMO, a more logical set of JARs is along the lines of Owen described when opening the JIRA: api, client, server, utils

          Even if IDEs can handle several source roots, 59 becomes cumbersome. Plus, from Maven side, that means the reactor will do much more work to resolve module dependency, thus slowing down the build.

          Finally, I advice against merging JARs into one, this complicates significantly troubleshooting.

          Show
          Alejandro Abdelnur added a comment - 59 JARs for a project seems a bit too much, it seems that JARs are being used instead of Java packages to separate class. IMO, a more logical set of JARs is along the lines of Owen described when opening the JIRA: api, client, server, utils Even if IDEs can handle several source roots, 59 becomes cumbersome. Plus, from Maven side, that means the reactor will do much more work to resolve module dependency, thus slowing down the build. Finally, I advice against merging JARs into one, this complicates significantly troubleshooting.
          Hide
          Luke Lu added a comment -

          59 JARs for a project seems a bit too much, it seems that JARs are being used instead of Java packages to separate class.

          No we don't have 59 jars or 59 source roots. We only have 11 source-root/modules, the rest are dependencies. In fact, we do separate modules mostly at package boundaries.

          Show
          Luke Lu added a comment - 59 JARs for a project seems a bit too much, it seems that JARs are being used instead of Java packages to separate class. No we don't have 59 jars or 59 source roots. We only have 11 source-root/modules, the rest are dependencies. In fact, we do separate modules mostly at package boundaries.
          Hide
          Luke Lu added a comment -

          OK 12 modules as of now:

          1. yarn-api
          2. yarn-common
          3. yarn-server-common
          4. yarn-server-nodemanager
          5. yarn-server-resourcemanager
          6. yarn-server-tests (an integration test module)
          7. hadoop-mapreduce-client-core
          8. hadoop-mapreduce-client-common
          9. hadoop-mapreduce-client-shuffle (shuffle plugin for node manager)
          10. hadoop-mapreduce-client-app (MR app master)
          11. hadoop-mapreduce-client-hs (MR job history server)
          12. hadoop-mapreduce-client-jobclient
          Show
          Luke Lu added a comment - OK 12 modules as of now: yarn-api yarn-common yarn-server-common yarn-server-nodemanager yarn-server-resourcemanager yarn-server-tests (an integration test module) hadoop-mapreduce-client-core hadoop-mapreduce-client-common hadoop-mapreduce-client-shuffle (shuffle plugin for node manager) hadoop-mapreduce-client-app (MR app master) hadoop-mapreduce-client-hs (MR job history server) hadoop-mapreduce-client-jobclient
          Hide
          Alejandro Abdelnur added a comment -

          That seems better .

          I'm not familiar with MR2 code distribution, but where do we find out the MapReduce APIs? That should be a separate JAR, just the MR interface, no?

          Also, when using the client API, do I have to define the dependency for one artifact and I'm done (all the other come as transitive dependencies and are implementation specific not exposed to the user)?

          Show
          Alejandro Abdelnur added a comment - That seems better . I'm not familiar with MR2 code distribution, but where do we find out the MapReduce APIs? That should be a separate JAR, just the MR interface, no? Also, when using the client API, do I have to define the dependency for one artifact and I'm done (all the other come as transitive dependencies and are implementation specific not exposed to the user)?
          Hide
          Luke Lu added a comment -

          when using the client API, do I have to define the dependency for one artifact and I'm done (all the other come as transitive dependencies and are implementation specific not exposed to the user)?

          If you just need to use the client API, dependency on org.apache.hadoop:hadoop-mapreduce-client-jobclient would suffice. If it's not so, we need to fix it

          Show
          Luke Lu added a comment - when using the client API, do I have to define the dependency for one artifact and I'm done (all the other come as transitive dependencies and are implementation specific not exposed to the user)? If you just need to use the client API, dependency on org.apache.hadoop:hadoop-mapreduce-client-jobclient would suffice. If it's not so, we need to fix it
          Hide
          Robert Joseph Evans added a comment -

          There has been no discussion on this for over a month. Does that mean the the issue is decided and that we are not going to reduce the number of jars? If not and we do want to change it we should do it sooner rather then later because it will be a big refactor and disrupt development again. I personally think that we have had enough movement in the code layout and would prefer not to rock the boat any more. Maven and ivy seem to be handling the transitive dependency resolution just fine already so I don't see a big reason to make the change.

          Show
          Robert Joseph Evans added a comment - There has been no discussion on this for over a month. Does that mean the the issue is decided and that we are not going to reduce the number of jars? If not and we do want to change it we should do it sooner rather then later because it will be a big refactor and disrupt development again. I personally think that we have had enough movement in the code layout and would prefer not to rock the boat any more. Maven and ivy seem to be handling the transitive dependency resolution just fine already so I don't see a big reason to make the change.
          Hide
          Luke Lu added a comment -

          I talked to Owen at his office a few weeks ago before my vacation. I recall that we agreed that the ideal modules/jars separation would be: yarn-client/server and mapreduce-client/server (mapreduce-server would contain the jobhistory server). But like he mentioned here, he's not pushing the change for 0.23, as the current layout works and he's not working on it

          We also talked about the dependencies issue between shuffle and nodemanager: NodeManager loads a specific version of ShuffleHandler that depends on a specific version of mapreduce-client-core for a specific version of ShuffleHeader. Even though the current separation is made possible via a service plugin mechanism, the undesirable dependency still exists. The solution of that problem is to have a generic shuffle service.

          Show
          Luke Lu added a comment - I talked to Owen at his office a few weeks ago before my vacation. I recall that we agreed that the ideal modules/jars separation would be: yarn-client/server and mapreduce-client/server (mapreduce-server would contain the jobhistory server). But like he mentioned here, he's not pushing the change for 0.23, as the current layout works and he's not working on it We also talked about the dependencies issue between shuffle and nodemanager: NodeManager loads a specific version of ShuffleHandler that depends on a specific version of mapreduce-client-core for a specific version of ShuffleHeader. Even though the current separation is made possible via a service plugin mechanism, the undesirable dependency still exists. The solution of that problem is to have a generic shuffle service .
          Hide
          Luke Lu added a comment -

          I'd like to get some consensus on this issue. What do people think? This is a fairly large source code reorg that's potentially disruptive. We'd better do it sooner than later or not do it at all.

          Show
          Luke Lu added a comment - I'd like to get some consensus on this issue. What do people think? This is a fairly large source code reorg that's potentially disruptive. We'd better do it sooner than later or not do it at all.
          Hide
          Mahadev konar added a comment -

          I'd suggest skipping this, too late .

          Show
          Mahadev konar added a comment - I'd suggest skipping this, too late .
          Hide
          Scott Carey added a comment -

          I am a little late to the party here but:

          It is a big issue for downstream users. Projects that use Hadoop already pick up a lot of jars and increasing the set when all of the versions are the same is a problem. We'll also have users using different versions of the jars, which won't be useful.

          Having a source structure that requires an IDE to use isn't making the code easy for people to browse, use and modify. It will also become a maintenance problem as the dependency graph between the components change.

          Yes, you can munge the results together into a single jar as part of the build, but I don't see how it makes development easier or faster to have lots of little directories.

          I disagree. It is a huge issue as a downstream user when the jar granularity is not fine enough. You don't have to manually pick each jar, so the total number is not the issue. If set up correctly a user will only pick the one or maybe two jars needed for their use case and maven/ivy/etc pulls in the transitive dependencies for you with the correct versions. It is a MUCH bigger risk if as a user I don't have the ability to build the package I want that excludes the stuff I don't need without a lot of trouble. It is not the number of jars that is the problem, it is the total size of all of them and the likelihood of version mismatches with transitive dependencies. The current issue is not that projects that use Hadoop 'pick up a lot of jars' it is that they 'pick up a lot of jars that are not needed at all'.

          A few 'top level' jars that are useful for various use cases as single points of inclusion would be perfect. This does not imply few jars total, it implies a few that you choose to declare for your use cases – they can pull in any number of other shared hadoop that are required for those use cases, it doesn't matter if they are 'the same version', the user does not need to know since maven handles that and maven best practices make many jars with the 'same version' a non-issue.

          A user pulls in a mapreduce client jar, and that might also pull in a couple 'common' jars from the same project. That is the intended best practice of maven. If the mapreduce client jar were to bundle common stuff in it, and that same common stuff were bundled in say, an hdfs-client jar, then you risk all sorts of trouble as a downstream user with multiple colliding classes on your classpath, the inability to have the tooling (maven) detect and deal with conflicts appropriately, etc. If it were to bundle stuff that is not useful as a client, that would bloat client application jars and potentially pull in useless transitive dependencies.

          If the jars are reduced into only a few big blobs, it will end up more like the absolutely atrocious maven dependency management in 0.20.205 and 0.22.x. where a user who just wants to build a mapreduce program pulls in 20MB of downstream jars that are not needed unless they manually exclude them.

          Having more source trees is a slight development burden, but enforces the right encapsulation and organization of dependencies. One of the benefits of organizing modules in maven is that the end result almost always leads to more clear code boundaries and better architectural separation of concerns. It also helps define API boundaries and prevent creating leaky abstractions / apis by accident.

          Thinking more on it. I am inclined to keeping the modules separate as it is currently, instead of combining the source tree.
          I am counting the no of modules to be 10-12. So the source tree should not be 59 or am I missing something.

          The separate modules do help identify the boundaries more clearly and help in enforcing those. Separation just based on java packages is loose. I know this based on the unnecessary pain I went thru when I was working on the project split 2 years ago. In future, refactoring code or doing things like rewriting NM in C++ will be least intrusive with the current module structure.

          If no of jars is the problem, can we just merge the jars at build time the way we want. Using maven shade plugin or some such ?

          I agree. You can use the shade plugin to make a few 'fat' jars for some use cases that live along side the normal artifacts that do not embed any dependencies.

          Please, please don't put any jars in a maven repo that bundle dependencies unless they are attached artifacts and not the primary artifact.
          Please, please declare the dependencies properly, using 'optional' or 'provided' scope as appropriate to prevent downstream users from pulling in artifacts transitively that a client user does not need.
          I believe that too few jars is worst than too many, when the two items above are done correctly (e.g. maven best practices are followed). Then as a downstream user, I can easily select the features I want, and trust that the dependencies that are pulled in to my project transitively as a consequence of say, pulling in a mapreduce client jar, are only the jars needed as a mapreduce client and not the entire freaking hadoop framework or any other extra unnecessary baggage.

          Show
          Scott Carey added a comment - I am a little late to the party here but: It is a big issue for downstream users. Projects that use Hadoop already pick up a lot of jars and increasing the set when all of the versions are the same is a problem. We'll also have users using different versions of the jars, which won't be useful. Having a source structure that requires an IDE to use isn't making the code easy for people to browse, use and modify. It will also become a maintenance problem as the dependency graph between the components change. Yes, you can munge the results together into a single jar as part of the build, but I don't see how it makes development easier or faster to have lots of little directories. I disagree. It is a huge issue as a downstream user when the jar granularity is not fine enough. You don't have to manually pick each jar, so the total number is not the issue. If set up correctly a user will only pick the one or maybe two jars needed for their use case and maven/ivy/etc pulls in the transitive dependencies for you with the correct versions. It is a MUCH bigger risk if as a user I don't have the ability to build the package I want that excludes the stuff I don't need without a lot of trouble. It is not the number of jars that is the problem, it is the total size of all of them and the likelihood of version mismatches with transitive dependencies. The current issue is not that projects that use Hadoop 'pick up a lot of jars' it is that they 'pick up a lot of jars that are not needed at all'. A few 'top level' jars that are useful for various use cases as single points of inclusion would be perfect. This does not imply few jars total, it implies a few that you choose to declare for your use cases – they can pull in any number of other shared hadoop that are required for those use cases, it doesn't matter if they are 'the same version', the user does not need to know since maven handles that and maven best practices make many jars with the 'same version' a non-issue. A user pulls in a mapreduce client jar, and that might also pull in a couple 'common' jars from the same project. That is the intended best practice of maven. If the mapreduce client jar were to bundle common stuff in it, and that same common stuff were bundled in say, an hdfs-client jar, then you risk all sorts of trouble as a downstream user with multiple colliding classes on your classpath, the inability to have the tooling (maven) detect and deal with conflicts appropriately, etc. If it were to bundle stuff that is not useful as a client, that would bloat client application jars and potentially pull in useless transitive dependencies. If the jars are reduced into only a few big blobs, it will end up more like the absolutely atrocious maven dependency management in 0.20.205 and 0.22.x. where a user who just wants to build a mapreduce program pulls in 20MB of downstream jars that are not needed unless they manually exclude them. Having more source trees is a slight development burden, but enforces the right encapsulation and organization of dependencies. One of the benefits of organizing modules in maven is that the end result almost always leads to more clear code boundaries and better architectural separation of concerns. It also helps define API boundaries and prevent creating leaky abstractions / apis by accident. Thinking more on it. I am inclined to keeping the modules separate as it is currently, instead of combining the source tree. I am counting the no of modules to be 10-12. So the source tree should not be 59 or am I missing something. The separate modules do help identify the boundaries more clearly and help in enforcing those. Separation just based on java packages is loose. I know this based on the unnecessary pain I went thru when I was working on the project split 2 years ago. In future, refactoring code or doing things like rewriting NM in C++ will be least intrusive with the current module structure. If no of jars is the problem, can we just merge the jars at build time the way we want. Using maven shade plugin or some such ? I agree. You can use the shade plugin to make a few 'fat' jars for some use cases that live along side the normal artifacts that do not embed any dependencies. Please, please don't put any jars in a maven repo that bundle dependencies unless they are attached artifacts and not the primary artifact. Please, please declare the dependencies properly, using 'optional' or 'provided' scope as appropriate to prevent downstream users from pulling in artifacts transitively that a client user does not need. I believe that too few jars is worst than too many, when the two items above are done correctly (e.g. maven best practices are followed). Then as a downstream user, I can easily select the features I want, and trust that the dependencies that are pulled in to my project transitively as a consequence of say, pulling in a mapreduce client jar, are only the jars needed as a mapreduce client and not the entire freaking hadoop framework or any other extra unnecessary baggage.

            People

            • Assignee:
              Luke Lu
              Reporter:
              Owen O'Malley
            • Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

              • Created:
                Updated:

                Development