Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-3378

Create a single 'hadoop-mapreduce' Maven artifact

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.23.0
    • Fix Version/s: None
    • Component/s: build
    • Labels:
      None

      Description

      In 0.23.0 there are multiple artifacts (hadoop-mapreduce-client-app, hadoop-mapreduce-client-common, hadoop-mapreduce-client-core, etc). It would be simpler for users to declare a dependency on hadoop-mapreduce (much like there's hadoop-common and hadoop-hdfs). (This would also be a step towards MAPREDUCE-2600.)

        Issue Links

          Activity

          Jeff Hammerbacher made changes -
          Link This issue is related to HADOOP-8278 [ HADOOP-8278 ]
          Jeff Hammerbacher made changes -
          Link This issue is related to HADOOP-8009 [ HADOOP-8009 ]
          Jeff Hammerbacher made changes -
          Link This issue is related to MAPREDUCE-2600 [ MAPREDUCE-2600 ]
          Tom White made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Won't Fix [ 2 ]
          Hide
          Tom White added a comment -

          I've opened HADOOP-8278 to track 1. HADOOP-8009 addressed 2. So I'm closing this JIRA now.

          Show
          Tom White added a comment - I've opened HADOOP-8278 to track 1. HADOOP-8009 addressed 2. So I'm closing this JIRA now.
          Hide
          Tom White added a comment -

          Scott, thanks for the great feedback! To summarize, we should do the following (highest priority first):

          1. Make sure that the transitive dependencies are correct for the artifacts that we publish.
          2. Publish client API JARs and document how to use them.
          3. Possibly publish a 'fat jar' for all of Hadoop, but it should have a different classifier.

          Currently in 0.23 the situation for 2 is that hadoop-mapreduce-client-core contains the API, hadoop-mapreduce-client-common contains the local job runner, and hadoop-mapreduce-client-jobclient contains the YARN client, but they all pull in too many dependencies (1). I've just noticed that hadoop-mapreduce-client-core depends on HDFS classes, which isn't right.

          Show
          Tom White added a comment - Scott, thanks for the great feedback! To summarize, we should do the following (highest priority first): 1. Make sure that the transitive dependencies are correct for the artifacts that we publish. 2. Publish client API JARs and document how to use them. 3. Possibly publish a 'fat jar' for all of Hadoop, but it should have a different classifier. Currently in 0.23 the situation for 2 is that hadoop-mapreduce-client-core contains the API, hadoop-mapreduce-client-common contains the local job runner, and hadoop-mapreduce-client-jobclient contains the YARN client, but they all pull in too many dependencies (1). I've just noticed that hadoop-mapreduce-client-core depends on HDFS classes, which isn't right.
          Hide
          Scott Carey added a comment -

          IMO the root issue is that we are not using dependencies correctly.

          Absolutely. Hadoop's dependency setup is absolutely atrocious in 0.20.205 and 0.22 I haven't looked at 0.23 in enough detail yet but would love the situation to be fixed.

          I have a project that needs to read and write from HDFS. Declaring hadoop pulls in all of Jetty, the tomcat compiler, and a dozen other jars that I manually have to exclude.

          The above needs to be avoided for mapreduce.

          Building larger jars that package dependencies in them is OK for some use cases but absolutely worthless for any real application that has any chance of dependency conflict. Things like Jetty should be marked as provided not compile scope (or perhaps optional).

          There should be a hadoop-client that allows me to code and run HDFS/MR client apps (with the exact set of transitive dependencies, ie you don't need jetty stuff there).

          :-D YES!
          IMO, we need an hdfs-api.jar and mapreduce-api.jar that pull in only what is needed to build an application that uses those APIs as a client. A user should be able to declare those in their project, and have only the transitive dependencies needed for those use cases pulled in, and nothing extra. One could even go to the extreme of having a mapred-api.jar and mapreduce-api.jar with the old and new apis separated (and a mapreduce-common-api.jar they both depend on) if that was a bigger use case. More modularization will be a great benefit to users, when combined with using dependencies properly in hadoop itself.

          The fact that under the hood these 'hadoop-client' & 'hadoop-test' component pull 1 or 100 hadoop JARs is irrelevant (although IMO I think we have too many JARs).

          Yes, if the artifacts are configured properly with the right dependencies in the correct scope (e.g. jetty in provided scope since only one trying to run the framework needs it, not clients) then there is only one artifact to declare for each use. It is not the total number of jars, it is the total size of jars that matters. Finer grained control of dependencies by users is a good thing. As a user I want to declare what I need as simply as possible ("I need to launch a mini-mr during test, so I need hadoop-mr-test.jar"; "I need to submit a job to a cluster, so I need mr-client.jar"), what that means behind the scenes in total jar count of transitive dependencies is a different issue entirely. As long as this pulls in only what is needed and not useless baggage (jetty, tomcat's compiler, etc).

          There is no need to package 'fat jars' unless you wish to have a single artifact for uses where tooling does not build the classpath for you.

          Regarding my prev second bullet item, it seems via a classifier this is possible ( http://maven.apache.org/plugins/maven-shade-plugin/examples/attached-artifact.html ), still this is kind of uncommon for commonly used artifacts.

          I support using an attached artifact with a classifier for any jars containing dependencies. It is an anti-pattern to put a jar with dependencies into a maven repo as the primary artifact however (unless you move those dependencies into a private scope to avoid conflicts).

          Show
          Scott Carey added a comment - IMO the root issue is that we are not using dependencies correctly. Absolutely. Hadoop's dependency setup is absolutely atrocious in 0.20.205 and 0.22 I haven't looked at 0.23 in enough detail yet but would love the situation to be fixed. I have a project that needs to read and write from HDFS. Declaring hadoop pulls in all of Jetty, the tomcat compiler, and a dozen other jars that I manually have to exclude. The above needs to be avoided for mapreduce. Building larger jars that package dependencies in them is OK for some use cases but absolutely worthless for any real application that has any chance of dependency conflict. Things like Jetty should be marked as provided not compile scope (or perhaps optional). There should be a hadoop-client that allows me to code and run HDFS/MR client apps (with the exact set of transitive dependencies, ie you don't need jetty stuff there). :-D YES! IMO, we need an hdfs-api.jar and mapreduce-api.jar that pull in only what is needed to build an application that uses those APIs as a client. A user should be able to declare those in their project, and have only the transitive dependencies needed for those use cases pulled in, and nothing extra. One could even go to the extreme of having a mapred-api.jar and mapreduce-api.jar with the old and new apis separated (and a mapreduce-common-api.jar they both depend on) if that was a bigger use case. More modularization will be a great benefit to users, when combined with using dependencies properly in hadoop itself. The fact that under the hood these 'hadoop-client' & 'hadoop-test' component pull 1 or 100 hadoop JARs is irrelevant (although IMO I think we have too many JARs). Yes, if the artifacts are configured properly with the right dependencies in the correct scope (e.g. jetty in provided scope since only one trying to run the framework needs it, not clients) then there is only one artifact to declare for each use. It is not the total number of jars, it is the total size of jars that matters. Finer grained control of dependencies by users is a good thing. As a user I want to declare what I need as simply as possible ("I need to launch a mini-mr during test, so I need hadoop-mr-test.jar"; "I need to submit a job to a cluster, so I need mr-client.jar"), what that means behind the scenes in total jar count of transitive dependencies is a different issue entirely. As long as this pulls in only what is needed and not useless baggage (jetty, tomcat's compiler, etc). There is no need to package 'fat jars' unless you wish to have a single artifact for uses where tooling does not build the classpath for you. Regarding my prev second bullet item, it seems via a classifier this is possible ( http://maven.apache.org/plugins/maven-shade-plugin/examples/attached-artifact.html ), still this is kind of uncommon for commonly used artifacts. I support using an attached artifact with a classifier for any jars containing dependencies. It is an anti-pattern to put a jar with dependencies into a maven repo as the primary artifact however (unless you move those dependencies into a private scope to avoid conflicts).
          Hide
          Alejandro Abdelnur added a comment -

          Regarding my prev second bullet item, it seems via a classifier this is possible ( http://maven.apache.org/plugins/maven-shade-plugin/examples/attached-artifact.html ), still this is kind of uncommon for commonly used artifacts.

          Show
          Alejandro Abdelnur added a comment - Regarding my prev second bullet item, it seems via a classifier this is possible ( http://maven.apache.org/plugins/maven-shade-plugin/examples/attached-artifact.html ), still this is kind of uncommon for commonly used artifacts.
          Hide
          Alejandro Abdelnur added a comment -

          About repackaging multiple JARs into a single one, I see the following issues with this:

          • Developers unknowingly adding (other versions) of the grouped JARs to the classpath.
          • AFAIK there is not POM for the aggregate JARs with all the correct dependencies.

          IMO the root issue is that we are not using dependencies correctly.

          There should be a hadoop-client that allows me to code and run HDFS/MR client apps (with the exact set of transitive dependencies, ie you don't need jetty stuff there).

          There should be a hadoop-test that allows me to run run HDFS/MR minicluster for integration testing.

          The fact that under the hood these 'hadoop-client' & 'hadoop-test' component pull 1 or 100 hadoop JARs is irrelevant (although IMO I think we have too many JARs).

          Show
          Alejandro Abdelnur added a comment - About repackaging multiple JARs into a single one, I see the following issues with this: Developers unknowingly adding (other versions) of the grouped JARs to the classpath. AFAIK there is not POM for the aggregate JARs with all the correct dependencies. IMO the root issue is that we are not using dependencies correctly. There should be a hadoop-client that allows me to code and run HDFS/MR client apps (with the exact set of transitive dependencies, ie you don't need jetty stuff there). There should be a hadoop-test that allows me to run run HDFS/MR minicluster for integration testing. The fact that under the hood these 'hadoop-client' & 'hadoop-test' component pull 1 or 100 hadoop JARs is irrelevant (although IMO I think we have too many JARs).
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Also, because the targeted artifact is a hadoop-all jar, can this be moved to common? Thanks!

          Show
          Vinod Kumar Vavilapalli added a comment - Also, because the targeted artifact is a hadoop-all jar, can this be moved to common? Thanks!
          Hide
          Vinod Kumar Vavilapalli added a comment -

          IIUC, this is only for convenience. As long as the single artifact is only to make simpler for those who need everything, we should be fine. We need the fine grained artifacts for those who want to selectively include modules. Like yarn+DistributedShell as an example app outside of mapreduce.

          Show
          Vinod Kumar Vavilapalli added a comment - IIUC, this is only for convenience. As long as the single artifact is only to make simpler for those who need everything, we should be fine. We need the fine grained artifacts for those who want to selectively include modules. Like yarn+DistributedShell as an example app outside of mapreduce.
          Hide
          Allen Wittenauer added a comment -

          So we broke them apart so that we can merge them all again?

          Show
          Allen Wittenauer added a comment - So we broke them apart so that we can merge them all again?
          Tom White made changes -
          Field Original Value New Value
          Attachment MAPREDUCE-3378.patch [ 12508003 ]
          Hide
          Tom White added a comment -

          Here's a patch that does something slightly more general: it creates a hadoop-all JAR that's convenient for users to consume. E.g. I tested with the following Maven dependency in another project:

          <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-all</artifactId>
            <version>0.24.0-SNAPSHOT</version>
          </dependency>
          
          Show
          Tom White added a comment - Here's a patch that does something slightly more general: it creates a hadoop-all JAR that's convenient for users to consume. E.g. I tested with the following Maven dependency in another project: <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-all</artifactId> <version>0.24.0-SNAPSHOT</version> </dependency>
          Hide
          Alejandro Abdelnur added a comment -

          Another issue I'm facing with the multiple JARs is that the test JARs when included for testing do not pull test scope dependencies, thus I have to include all of them one by one.

          Maybe a solution would be to have a hadoop-mapreduce-test artifact that includes all necessary deps to run testcases in compile mode.

          Show
          Alejandro Abdelnur added a comment - Another issue I'm facing with the multiple JARs is that the test JARs when included for testing do not pull test scope dependencies, thus I have to include all of them one by one. Maybe a solution would be to have a hadoop-mapreduce-test artifact that includes all necessary deps to run testcases in compile mode.
          Tom White created issue -

            People

            • Assignee:
              Unassigned
              Reporter:
              Tom White
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development