Hadoop Common
  1. Hadoop Common
  2. HADOOP-6629

versions of dependencies should be specified in a single place

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: build
    • Labels:
      None

      Description

      Currently the Maven POM file is generated from a template file that includes the versions of all the libraries we depend on. The versions of these libraries are also present in ivy/libraries.properties, so that, when a library is updated, it must be updated in two places, which is error-prone. We should instead only specify library versions in a single place.

      1. HADOOP-6629.patch
        25 kB
        Doug Cutting
      2. HADOOP-6629.patch
        21 kB
        Doug Cutting

        Issue Links

          Activity

          Hide
          Doug Cutting added a comment -

          Here's a patch that generates the POM file from Ivy. It also moves a few dependencies from the "default" category to the "test" category so that they're made optional in the main jar and only required by the test jar.

          Show
          Doug Cutting added a comment - Here's a patch that generates the POM file from Ivy. It also moves a few dependencies from the "default" category to the "test" category so that they're made optional in the main jar and only required by the test jar.
          Hide
          Doug Cutting added a comment -

          A problem with the approach in my patch is that it doesn't (yet) support some exclusions that are in the pom template.

          Show
          Doug Cutting added a comment - A problem with the approach in my patch is that it doesn't (yet) support some exclusions that are in the pom template.
          Hide
          Doug Cutting added a comment -

          It looks like Ivy's makepom task doesn't copy exclusions from ivy.xml into the generated POM.

          So we might instead:

          1. add excludes to the ivy.xml of projects that include this. every project that directly depends on log4j already needs to do this, so it's not really a huge imposition, but it still seems non-optimal.
          2. add a module named "hadoop-log4j" with a manually authored POM that excludes things correctly (requires an upgrade to Ivy 2.1.0 to get IVY-974). Then we can depend on that instead of log4j, which has broken dependencies.

          My instinct is to try (2). Thoughts?

          Show
          Doug Cutting added a comment - It looks like Ivy's makepom task doesn't copy exclusions from ivy.xml into the generated POM. So we might instead: add excludes to the ivy.xml of projects that include this. every project that directly depends on log4j already needs to do this, so it's not really a huge imposition, but it still seems non-optimal. add a module named "hadoop-log4j" with a manually authored POM that excludes things correctly (requires an upgrade to Ivy 2.1.0 to get IVY-974 ). Then we can depend on that instead of log4j, which has broken dependencies. My instinct is to try (2). Thoughts?
          Hide
          Doug Cutting added a comment -

          Here's a version of the patch that avoids log4j's dependencies by making it an optional dependency of hadoop-common. The hadoop-mapreduce module already directly depends on log4j so this generates a pom that works with hadoop-mapreduce, i.e., one can use 'ant mvn-install -Dresolvers=local' and an ivy-created pom is published such that mapreduce can be compiled and tested.

          That said, I still don't get Ivy's configurations, and may have made a hash of them.

          Show
          Doug Cutting added a comment - Here's a version of the patch that avoids log4j's dependencies by making it an optional dependency of hadoop-common. The hadoop-mapreduce module already directly depends on log4j so this generates a pom that works with hadoop-mapreduce, i.e., one can use 'ant mvn-install -Dresolvers=local' and an ivy-created pom is published such that mapreduce can be compiled and tested. That said, I still don't get Ivy's configurations, and may have made a hash of them.
          Hide
          Owen O'Malley added a comment -

          Actually, there is a third copy that must also be handled in .eclipse_templates/.classpath.

          Is there a mechanism to generate the eclipse .classpath too?

          Does it work with mvn-install and mvn-publish?

          Show
          Owen O'Malley added a comment - Actually, there is a third copy that must also be handled in .eclipse_templates/.classpath. Is there a mechanism to generate the eclipse .classpath too? Does it work with mvn-install and mvn-publish?
          Hide
          Doug Cutting added a comment -

          > Actually, there is a third copy that must also be handled in .eclipse_templates/.classpath.

          That's addressed by HADOOP-6407, HDFS-1035 and MAPREDUCE-1592.

          > Does it work with mvn-install and mvn-publish?

          It does work with mvn-install. I've tested this patch with mapreduce consuming the generated pom. But I'd like someone who really understands Ivy categories (Giri?) to look this patch over, since I just hacked the categories to generate the right required dependencies in the pom without fully understanding how the categories meant to work.

          It seems we're not totally consistent in the three projects about inheriting dependencies from common versus declaring them explicitly. We can have common's pom not require log4j, since hdfs and mapreduce already require log4j explicitly in their ivy.xml, and they both block log4j's bad dependencies (as does every other project in the world that uses log4j). That works, but it doesn't quite feel right. Maybe we're stuck with it until log4j fixes its dependencies?

          Show
          Doug Cutting added a comment - > Actually, there is a third copy that must also be handled in .eclipse_templates/.classpath. That's addressed by HADOOP-6407 , HDFS-1035 and MAPREDUCE-1592 . > Does it work with mvn-install and mvn-publish? It does work with mvn-install. I've tested this patch with mapreduce consuming the generated pom. But I'd like someone who really understands Ivy categories (Giri?) to look this patch over, since I just hacked the categories to generate the right required dependencies in the pom without fully understanding how the categories meant to work. It seems we're not totally consistent in the three projects about inheriting dependencies from common versus declaring them explicitly. We can have common's pom not require log4j, since hdfs and mapreduce already require log4j explicitly in their ivy.xml, and they both block log4j's bad dependencies (as does every other project in the world that uses log4j). That works, but it doesn't quite feel right. Maybe we're stuck with it until log4j fixes its dependencies?
          Hide
          Alejandro Abdelnur added a comment -

          How about mavenizing Hadoop, then all this would go away. Plus, tools like Eclipse, Netbeans and IntelliJ understand POM files as project files.

          Show
          Alejandro Abdelnur added a comment - How about mavenizing Hadoop, then all this would go away. Plus, tools like Eclipse, Netbeans and IntelliJ understand POM files as project files.
          Hide
          Doug Cutting added a comment -

          > How about mavenizing Hadoop [ ... ]

          That would certainly make distributing jars to other java projects easier. But Hadoop tends not to be used as a pure java library, but rather also incorporates shell scripts, native code, etc. Perhaps someone familiar with Maven can get all these to work correctly for Hadoop?

          Avro's moved to a model where the primary release artifact is a tarball of the source tree. Other derived, binary artifacts are also published: Java jars to maven, python eggs to pypi, ruby gems, C & C++ pristine tarballs, etc. For Hadoop, .deb and .rpm packages are useful artifacts, but I'm not sure they'd replace the desire to also publish Maven artifacts, nor am I convinced that Maven can build adequate .rpm or .deb packages.

          So I don't yet see that Maven will satisfy all of Hadoop's build needs, but I'd be happy to be proven wrong.

          Show
          Doug Cutting added a comment - > How about mavenizing Hadoop [ ... ] That would certainly make distributing jars to other java projects easier. But Hadoop tends not to be used as a pure java library, but rather also incorporates shell scripts, native code, etc. Perhaps someone familiar with Maven can get all these to work correctly for Hadoop? Avro's moved to a model where the primary release artifact is a tarball of the source tree. Other derived, binary artifacts are also published: Java jars to maven, python eggs to pypi, ruby gems, C & C++ pristine tarballs, etc. For Hadoop, .deb and .rpm packages are useful artifacts, but I'm not sure they'd replace the desire to also publish Maven artifacts, nor am I convinced that Maven can build adequate .rpm or .deb packages. So I don't yet see that Maven will satisfy all of Hadoop's build needs, but I'd be happy to be proven wrong.
          Hide
          Alejandro Abdelnur added a comment -

          Maven allows you do to all that, via project types and assemblies. Look for example at Glassfish, they use Maven.

          It would not be that difficult to mavenize Hadoop.

          The catch is that the directory structure of the source would have to change to use Maven layout.

          But there are many things you get out of Maven for free.

          IMO is worth the move.

          Show
          Alejandro Abdelnur added a comment - Maven allows you do to all that, via project types and assemblies. Look for example at Glassfish, they use Maven. It would not be that difficult to mavenize Hadoop. The catch is that the directory structure of the source would have to change to use Maven layout. But there are many things you get out of Maven for free. IMO is worth the move.
          Hide
          steve_l added a comment -
          1. You can create pom files containing the right versions of dependencies by having a template POM that is then copied with property expansion. Yes, you may duplicate stuff in the ivy files, but it ensures the downstream dependencies are things you want people to get, not random transitive noise
          2. I'm clearly biased against Maven, so taking a step back here. One limitation I've found in the past is its view of testing is/wad fairly simplistic. No test-cases-bring-up-VM-clusters here. It may have moved on.
          3. There is nothing to stop the Hadoop scripts being published as a tar artifact on their own, into the repository. It's not common, but the apache repository charter allows them to do more than just JAR files.
          4. RPM and deb files could go in as something downstream from the main projects; take in the JARs, the script tarball, and the test and example JAR files and create a set of RPM/deb files for installation. Then you need a test process that test installs these into the target VMs, walks the scripts through their lifecycle (ssh can do this), etc, etc. If the Hadoop teams do think that RPM and deb files are the right way to install hadoop artifacts in production, and don't want to hand off creating these files to others (so complicating support calls), then it's worth doing. As usual, I have no knowledge of where Maven stands here; I use the <rpmbuild> task to create my RPMs, then <scp> and <ssh> to test.
          Show
          steve_l added a comment - You can create pom files containing the right versions of dependencies by having a template POM that is then copied with property expansion. Yes, you may duplicate stuff in the ivy files, but it ensures the downstream dependencies are things you want people to get, not random transitive noise I'm clearly biased against Maven, so taking a step back here. One limitation I've found in the past is its view of testing is/wad fairly simplistic. No test-cases-bring-up-VM-clusters here. It may have moved on. There is nothing to stop the Hadoop scripts being published as a tar artifact on their own, into the repository. It's not common, but the apache repository charter allows them to do more than just JAR files. RPM and deb files could go in as something downstream from the main projects; take in the JARs, the script tarball, and the test and example JAR files and create a set of RPM/deb files for installation. Then you need a test process that test installs these into the target VMs, walks the scripts through their lifecycle (ssh can do this), etc, etc. If the Hadoop teams do think that RPM and deb files are the right way to install hadoop artifacts in production, and don't want to hand off creating these files to others (so complicating support calls), then it's worth doing. As usual, I have no knowledge of where Maven stands here; I use the <rpmbuild> task to create my RPMs, then <scp> and <ssh> to test.
          Hide
          Paolo Castagna added a comment -

          The slf4j-log4j12 v1.4.3 POM (http://repo2.maven.org/maven2/org/slf4j/slf4j-log4j12/1.4.3/slf4j-log4j12-1.4.3.pom) does not specify a specific version of Log4J.
          The POM file for Log4J v1.2.14 (http://repo2.maven.org/maven2/log4j/log4j/1.2.14/log4j-1.2.14.pom) does not have unnecessary/broken dependencies as Log4J v1.2.15.
          So, an alternative to solve issues related Log4J dependencies, could be to downgrade Log4J version from v1.2.15 to v1.2.14... until Log4J fixes its dependencies.

          Or, perhaps this isn't unnecessary, since now the Log4J dependency is correctly set to <optional>true</optional>:

          <dependency>
          <groupId>log4j</groupId>
          <artifactId>log4j</artifactId>
          <version>1.2.15</version>
          <optional>true</optional>
          </dependency>

          As well as:

          <dependency>
          <groupId>org.slf4j</groupId>
          <artifactId>slf4j-log4j12</artifactId>
          <version>1.4.3</version>
          <optional>true</optional>
          </dependency>

          Also, if Hadoop Core has been renamed Hadoop Common, shouldn't the artifacts be consistent and named hadoop-common-

          {version}

          [-sources,-javadocs].jar?

          Show
          Paolo Castagna added a comment - The slf4j-log4j12 v1.4.3 POM ( http://repo2.maven.org/maven2/org/slf4j/slf4j-log4j12/1.4.3/slf4j-log4j12-1.4.3.pom ) does not specify a specific version of Log4J. The POM file for Log4J v1.2.14 ( http://repo2.maven.org/maven2/log4j/log4j/1.2.14/log4j-1.2.14.pom ) does not have unnecessary/broken dependencies as Log4J v1.2.15. So, an alternative to solve issues related Log4J dependencies, could be to downgrade Log4J version from v1.2.15 to v1.2.14... until Log4J fixes its dependencies. Or, perhaps this isn't unnecessary, since now the Log4J dependency is correctly set to <optional>true</optional>: <dependency> <groupId>log4j</groupId> <artifactId>log4j</artifactId> <version>1.2.15</version> <optional>true</optional> </dependency> As well as: <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-log4j12</artifactId> <version>1.4.3</version> <optional>true</optional> </dependency> Also, if Hadoop Core has been renamed Hadoop Common, shouldn't the artifacts be consistent and named hadoop-common- {version} [-sources,-javadocs] .jar?
          Hide
          Paolo Castagna added a comment -
            <target name="mvn-test-pom">
              <ivy:makepom ivyfile="${basedir}/ivy.xml" pomfile="${hadoop-core-test.pom}">
                <mapping conf="test" scope="compile"/>
              </ivy:makepom>
            </target>
          

          Shouldn't be scope="test" instead?

          Show
          Paolo Castagna added a comment - <target name= "mvn-test-pom" > <ivy:makepom ivyfile= "${basedir}/ivy.xml" pomfile= "${hadoop-core-test.pom}" > <mapping conf= "test" scope= "compile" /> </ivy:makepom> </target> Shouldn't be scope="test" instead?
          Hide
          Paolo Castagna added a comment -

          There is another place where dependencies and version numbers are "duplicated", it's .eclipse.templates/.classpath used to "generate" the Eclipse .classpath.

          One possibility is to do:

            <target name="eclipse-files" depends="init"
                    description="Generate files for Eclipse">
              <pathconvert property="eclipse.project">
                <path path="${basedir}"/>
                <regexpmapper from="^.*/([^/]+)$$" to="\1" handledirsep="yes"/>
              </pathconvert>
              <copy todir="." overwrite="true">
                <fileset dir=".eclipse.templates">
                	<exclude name="**/README.txt"/>
                </fileset>
                <filterset>
                  <filter token="PROJECT" value="${eclipse.project}"/>
                </filterset>
                <filterset begintoken="{" endtoken="}">
                	<filtersfile file="${ivy.dir}/libraries.properties"/>
                </filterset>
              </copy>
            </target>
          

          ... and have in the .classpath template:

              [...]
          	<classpathentry kind="lib" path="build/ivy/lib/Hadoop-Core/common/avro-{avro.version}.jar"/>
          	<classpathentry kind="lib" path="build/ivy/lib/Hadoop-Core/common/commons-cli-{commons-cli.version}.jar"/>
          	<classpathentry kind="lib" path="build/ivy/lib/Hadoop-Core/common/commons-codec-{commons-codec.version}.jar"/>
          	<classpathentry kind="lib" path="build/ivy/lib/Hadoop-Core/common/commons-el-{commons-el.version}.jar"/>
              [...]
          

          I had a few problems with JUnit, Jetty, JSP APIs and Slf4j dependencies.

          It seems to me that the libraries.property file, as it is right now, is killing the benefit of having a dependency engine such as Ivy to transitively resolve dependencies.

          I had not time to replicate Avro's approach to generate Eclipse files, but IMHO it's better.

          By the way, with Maven, once you have your pom.xml file, to generate Eclipse files you run: mvn eclipse:eclipse -DdownloadSources=true.

          Show
          Paolo Castagna added a comment - There is another place where dependencies and version numbers are "duplicated", it's .eclipse.templates/.classpath used to "generate" the Eclipse .classpath. One possibility is to do: <target name= "eclipse-files" depends= "init" description= "Generate files for Eclipse" > <pathconvert property= "eclipse.project" > <path path= "${basedir}" /> <regexpmapper from= "^.*/([^/]+)$$" to= "\1" handledirsep= "yes" /> </pathconvert> <copy todir= "." overwrite= " true " > <fileset dir= ".eclipse.templates" > <exclude name= "**/README.txt" /> </fileset> <filterset> <filter token= "PROJECT" value= "${eclipse.project}" /> </filterset> <filterset begintoken= "{" endtoken= "}" > <filtersfile file= "${ivy.dir}/libraries.properties" /> </filterset> </copy> </target> ... and have in the .classpath template: [...] <classpathentry kind= "lib" path= "build/ivy/lib/Hadoop-Core/common/avro-{avro.version}.jar" /> <classpathentry kind= "lib" path= "build/ivy/lib/Hadoop-Core/common/commons-cli-{commons-cli.version}.jar" /> <classpathentry kind= "lib" path= "build/ivy/lib/Hadoop-Core/common/commons-codec-{commons-codec.version}.jar" /> <classpathentry kind= "lib" path= "build/ivy/lib/Hadoop-Core/common/commons-el-{commons-el.version}.jar" /> [...] I had a few problems with JUnit, Jetty, JSP APIs and Slf4j dependencies. It seems to me that the libraries.property file, as it is right now, is killing the benefit of having a dependency engine such as Ivy to transitively resolve dependencies. I had not time to replicate Avro's approach to generate Eclipse files, but IMHO it's better. By the way, with Maven, once you have your pom.xml file, to generate Eclipse files you run: mvn eclipse:eclipse -DdownloadSources=true.
          Hide
          Tom White added a comment -

          I had not time to replicate Avro's approach to generate Eclipse files, but IMHO it's better.

          See HADOOP-6407, which follows Avro's way of doing it.

          Show
          Tom White added a comment - I had not time to replicate Avro's approach to generate Eclipse files, but IMHO it's better. See HADOOP-6407 , which follows Avro's way of doing it.
          Hide
          Paolo Castagna added a comment -

          Thanks Tom. It's better now.
          The same can be done for HDFS-1063 and MAPREDUCE-1619.

          One last comment on this...

          That said, I still don't get Ivy's configurations, and may have made a hash of them.

          I find Ivy's configuration confusing as well and I do not understand the benefit of having so many. Steve?

          Also, I am not sure how to map them to Maven's scopes.
          Should we map the "test" configuration to the "test" scope, instead of "compile"?

          <ivy:makepom ivyfile="${basedir}/ivy.xml" pomfile="${hadoop-core-test.pom}">
             <mapping conf="test" scope="compile"/>
          </ivy:makepom>
          
          Show
          Paolo Castagna added a comment - Thanks Tom. It's better now. The same can be done for HDFS-1063 and MAPREDUCE-1619 . One last comment on this... That said, I still don't get Ivy's configurations, and may have made a hash of them. I find Ivy's configuration confusing as well and I do not understand the benefit of having so many. Steve? Also, I am not sure how to map them to Maven's scopes. Should we map the "test" configuration to the "test" scope, instead of "compile"? <ivy:makepom ivyfile= "${basedir}/ivy.xml" pomfile= "${hadoop-core-test.pom}" > <mapping conf= "test" scope= "compile" /> </ivy:makepom>

            People

            • Assignee:
              Doug Cutting
              Reporter:
              Doug Cutting
            • Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:

                Development