Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Duplicate
    • Affects Version/s: 1.4.0
    • Fix Version/s: 1.5.0
    • Component/s: java
    • Labels:
      None

      Description

      MapReduce should not depend on any jars (eg. avro's main jar) that also depend on the MapReduce jar.

      1. avro-545.patch
        2 kB
        Owen O'Malley

        Issue Links

          Activity

          Hide
          Scott Carey added a comment -

          Fixed as part of AVRO-647

          Show
          Scott Carey added a comment - Fixed as part of AVRO-647
          Hide
          Doug Cutting added a comment -

          > If you are really going to ignore my patch...

          I did not ignore your patch. I reviewed it. It is incomplete. Folks who use the MR API in Avro 1.4.0, retrieving it from Maven central, need to have an upgrade path that lets them still get the code from Maven central. A new jar needs a new POM, javadoc, source, plus updates to Ant targets that publish artifacts, etc. I would also vastly prefer that we make such an incompatible change once, rather than piecemeal, providing a complete fix for AVRO-647.

          Show
          Doug Cutting added a comment - > If you are really going to ignore my patch... I did not ignore your patch. I reviewed it. It is incomplete. Folks who use the MR API in Avro 1.4.0, retrieving it from Maven central, need to have an upgrade path that lets them still get the code from Maven central. A new jar needs a new POM, javadoc, source, plus updates to Ant targets that publish artifacts, etc. I would also vastly prefer that we make such an incompatible change once, rather than piecemeal, providing a complete fix for AVRO-647 .
          Hide
          Owen O'Malley added a comment -

          This is a regression. The 1.4.0 pom does not accurately reflect the dependencies. That is broken. Users should be able to depend on your pom and automatically get the transitive dependencies. Clearly, if you weren't breaking the rules by introducing a cycle in the dependency graph you would include the Hadoop jars in your pom.

          I also noticed that your avro-tool jar includes everything from the build classes, which includes everything from the main avro jar. I assume that is unintentional.

          If you are really going to ignore my patch and not fix this until the end of the year. I'll need to fork Avro back into Hadoop and I'd really rather avoid that. I already have more than enough work cleaning up Hadoop.

          Show
          Owen O'Malley added a comment - This is a regression. The 1.4.0 pom does not accurately reflect the dependencies. That is broken. Users should be able to depend on your pom and automatically get the transitive dependencies. Clearly, if you weren't breaking the rules by introducing a cycle in the dependency graph you would include the Hadoop jars in your pom. I also noticed that your avro-tool jar includes everything from the build classes, which includes everything from the main avro jar. I assume that is unintentional. If you are really going to ignore my patch and not fix this until the end of the year. I'll need to fork Avro back into Hadoop and I'd really rather avoid that. I already have more than enough work cleaning up Hadoop.
          Hide
          Doug Cutting added a comment -

          > if Hadoop changes a method, it can require a new release of avro to use the two together

          That seems true regardless of whether Avro's jars are split. If Hadoop changes a method, Hadoop could continue to use its existing version of Avro, since Hadoop itself does not call the Avro methods that might call that changed method. It would break users of Avro's MR API, but they'd be broken anyway until Avro's MR code is updated. Also, the APIs that Avro uses are APIs that Hadoop has promised not to change, and Avro will not consider updating to newer Hadoop APIs until 0.22 has mostly replaced 0.20 in production, i.e., not for a while.

          I don't see that upgrading Hadoop to Avro 1.4.0 would create a regression for users of Hadoop. Without a regression, I don't see how this should be a blocking issue.

          The argument for breaking up jars that Scott made in AVRO-647 is that it would permit our POMs to more precisely specify dependencies, so that, e.g., folks who don't use RPC don't pull in Jetty, which sounds like a useful improvement. So I'd be okay committing a patch that creates POMs for new jars, etc. But I'd rather see this done once, as a complete fix for AVRO-647 rather than piecemeal here, so that we only force Avro users to update their dependencies once. I also believe that such an incompatible change would be more appropriate for Avro 1.5.0 than 1.4.1. By the time that Hadoop removes its 0.20 APIs we'll long since have split Avro into multiple jars.

          So my proposal is that we remove this as a blocker for Avro 1.4.1 but add AVRO-647 to the plan for Avro 1.5.0, which should be out before the end of the year.

          Show
          Doug Cutting added a comment - > if Hadoop changes a method, it can require a new release of avro to use the two together That seems true regardless of whether Avro's jars are split. If Hadoop changes a method, Hadoop could continue to use its existing version of Avro, since Hadoop itself does not call the Avro methods that might call that changed method. It would break users of Avro's MR API, but they'd be broken anyway until Avro's MR code is updated. Also, the APIs that Avro uses are APIs that Hadoop has promised not to change, and Avro will not consider updating to newer Hadoop APIs until 0.22 has mostly replaced 0.20 in production, i.e., not for a while. I don't see that upgrading Hadoop to Avro 1.4.0 would create a regression for users of Hadoop. Without a regression, I don't see how this should be a blocking issue. The argument for breaking up jars that Scott made in AVRO-647 is that it would permit our POMs to more precisely specify dependencies, so that, e.g., folks who don't use RPC don't pull in Jetty, which sounds like a useful improvement. So I'd be okay committing a patch that creates POMs for new jars, etc. But I'd rather see this done once, as a complete fix for AVRO-647 rather than piecemeal here, so that we only force Avro users to update their dependencies once. I also believe that such an incompatible change would be more appropriate for Avro 1.5.0 than 1.4.1. By the time that Hadoop removes its 0.20 APIs we'll long since have split Avro into multiple jars. So my proposal is that we remove this as a blocker for Avro 1.4.1 but add AVRO-647 to the plan for Avro 1.5.0, which should be out before the end of the year.
          Hide
          Scott Carey added a comment -

          Related to: https://issues.apache.org/jira/browse/AVRO-647

          None of the bits of Avro that Hadoop will use depend on Hadoop.
          None of the bits of Avro that use Hadoop will be called by Hadoop.
          If this was not so, then moving it out of the jar would not be a possible solution to the problem.

          There is no circular dependency between Hadoop itself and Avro unless Hadoop decides to use classes in o.a.a.mapred.
          Unless a user decided to call those in a Task. But then this inclusion might actually be desired!

          Because of the way that Hadoop works, putting all of its dependencies in the front of the classpath for a Task, no user user will be able to run a newer version of Avro than what is in Hadoop. With the mapred package broken out, at least a user might have a chance of using a different version of that, provided it was compatible with the 'avro-core' version Hadoop was using; but the safe bet would be to force the exact same version and bundle it with hadoop.

          So at first I thought this was important to break out for Hadoop's sake, but now I don't. Its important for Avro's sake for users and applications that don't use Hadoop.

          It might be the other way around. Using Avro in Hadoop is blocked by Hadoop sorting out its classpath issues since it currently forces all user Tasks to run with its dependencies (there is no separate classloader for Tasks, for instance).
          There is a Hadoop ticket for that, I can't seem to locate it right now.

          Avro does not lie about its dependency on Hadoop. There is never a time that an Avro user needs hadoop. Although "provided" scope might be more appropriate than "optional", its functionally the same in this case.
          The only way that a user can execute any avro.mapred code is to run from inside Hadoop, where the hadoop jars and dependencies come from hadoop and not from any packaging the user may attempt. Specifying the dependency as a runtime dependency would be a lie – the execution context (Hadoop) is expected to provide it.

          Show
          Scott Carey added a comment - Related to: https://issues.apache.org/jira/browse/AVRO-647 None of the bits of Avro that Hadoop will use depend on Hadoop. None of the bits of Avro that use Hadoop will be called by Hadoop. If this was not so, then moving it out of the jar would not be a possible solution to the problem. There is no circular dependency between Hadoop itself and Avro unless Hadoop decides to use classes in o.a.a.mapred. Unless a user decided to call those in a Task. But then this inclusion might actually be desired! Because of the way that Hadoop works, putting all of its dependencies in the front of the classpath for a Task, no user user will be able to run a newer version of Avro than what is in Hadoop. With the mapred package broken out, at least a user might have a chance of using a different version of that, provided it was compatible with the 'avro-core' version Hadoop was using; but the safe bet would be to force the exact same version and bundle it with hadoop. So at first I thought this was important to break out for Hadoop's sake, but now I don't. Its important for Avro's sake for users and applications that don't use Hadoop. It might be the other way around. Using Avro in Hadoop is blocked by Hadoop sorting out its classpath issues since it currently forces all user Tasks to run with its dependencies (there is no separate classloader for Tasks, for instance). There is a Hadoop ticket for that, I can't seem to locate it right now. Avro does not lie about its dependency on Hadoop. There is never a time that an Avro user needs hadoop. Although "provided" scope might be more appropriate than "optional", its functionally the same in this case. The only way that a user can execute any avro.mapred code is to run from inside Hadoop, where the hadoop jars and dependencies come from hadoop and not from any packaging the user may attempt. Specifying the dependency as a runtime dependency would be a lie – the execution context (Hadoop) is expected to provide it.
          Hide
          Owen O'Malley added a comment -

          The problem is the standard one that jars shouldn't recursively depend on each other. It leads to lots of issues. In particular, if Hadoop changes a method, it can require a new release of avro to use the two together. This is a classic anti-pattern for software. You never want projects to recursively depend on each other.

          The fact that your pom file lies about its dependencies makes this obvious. If you made the correct pom file, it would be impossible to release either Avro or Hadoop except simultaneously.

          Show
          Owen O'Malley added a comment - The problem is the standard one that jars shouldn't recursively depend on each other. It leads to lots of issues. In particular, if Hadoop changes a method, it can require a new release of avro to use the two together. This is a classic anti-pattern for software. You never want projects to recursively depend on each other. The fact that your pom file lies about its dependencies makes this obvious. If you made the correct pom file, it would be impossible to release either Avro or Hadoop except simultaneously.
          Hide
          Doug Cutting added a comment -

          A new jar should be a separately publishable maven artifact. So ivy.xml needs to be updated to add this as a separate configuration, a pom and source jar need to be generated, the mvn-install and mvn-deploy targets need to be updated, etc.

          I also don't yet see the functionaly problem that Avro's jar referring to Hadoop classes creates for Hadoop. Can you please describe the problem scenario in more detail? Thanks!

          Show
          Doug Cutting added a comment - A new jar should be a separately publishable maven artifact. So ivy.xml needs to be updated to add this as a separate configuration, a pom and source jar need to be generated, the mvn-install and mvn-deploy targets need to be updated, etc. I also don't yet see the functionaly problem that Avro's jar referring to Hadoop classes creates for Hadoop. Can you please describe the problem scenario in more detail? Thanks!
          Hide
          Owen O'Malley added a comment -

          This blocks Avro 1.4.1's adoption by Hadoop.

          Show
          Owen O'Malley added a comment - This blocks Avro 1.4.1's adoption by Hadoop.
          Hide
          Owen O'Malley added a comment -

          This patch moves the mapred classes to a different jar.

          Show
          Owen O'Malley added a comment - This patch moves the mapred classes to a different jar.
          Hide
          Owen O'Malley added a comment -

          Note that I'm just asking that the Avro jar that Hadoop depends on should not depend on any of the Hadoop jars. Moving the classes to the tools jar or some other jar would be fine.

          Show
          Owen O'Malley added a comment - Note that I'm just asking that the Avro jar that Hadoop depends on should not depend on any of the Hadoop jars. Moving the classes to the tools jar or some other jar would be fine.
          Hide
          Owen O'Malley added a comment -

          I understand the mechanics of how it has been accomplished. It is bad system design to make circular dependencies between projects. I consider this a blocker for Avro 1.4 being used by Hadoop Common, HDFS, or MapReduce.

          Show
          Owen O'Malley added a comment - I understand the mechanics of how it has been accomplished. It is bad system design to make circular dependencies between projects. I consider this a blocker for Avro 1.4 being used by Hadoop Common, HDFS, or MapReduce.
          Hide
          Scott Carey added a comment -

          The mapreduce dependency in Avro is specified as "optional", which means that consumers of avro do not automatically pull it in as a dependency. In other words, its not a transitive dependency.

          We can move it to another jar "avro-mapred.jar", but there is only a KB or two difference in jar size at the moment.

              <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-core</artifactId>
                <version>0.20.2</version>
                <optional>true</optional>
              </dependency>
          

          In short, if Hadoop 0.22 depends on Avro 1.4, it does not imply that Hadoop 0.22 depends on Hadoop 0.20.2 due to the above "optional" flag.

          http://maven.apache.org/guides/introduction/introduction-to-optional-and-excludes-dependencies.html

          From the maven site:

          How do optional dependencies work?

          Project-A -> Project-B

          The diagram above says that Project-A depends on Project-B. When A declares B as an optional dependency in its POM, this relationship remains unchanged. Its just like a normal build where Project-B will be added in its classpath.

          Project-X -> Project-A

          But when another project(Project-X) declares Project-A as a dependency in its POM, the optional dependency takes effect. You'll notice that Project-B is not included in the classpath of Project-X; you will need to declare it directly in your POM in order for B to be included in X's classpath.

          Show
          Scott Carey added a comment - The mapreduce dependency in Avro is specified as "optional", which means that consumers of avro do not automatically pull it in as a dependency. In other words, its not a transitive dependency. We can move it to another jar "avro-mapred.jar", but there is only a KB or two difference in jar size at the moment. <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-core</artifactId> <version>0.20.2</version> <optional> true </optional> </dependency> In short, if Hadoop 0.22 depends on Avro 1.4, it does not imply that Hadoop 0.22 depends on Hadoop 0.20.2 due to the above "optional" flag. http://maven.apache.org/guides/introduction/introduction-to-optional-and-excludes-dependencies.html From the maven site: How do optional dependencies work? Project-A -> Project-B The diagram above says that Project-A depends on Project-B. When A declares B as an optional dependency in its POM, this relationship remains unchanged. Its just like a normal build where Project-B will be added in its classpath. Project-X -> Project-A But when another project(Project-X) declares Project-A as a dependency in its POM, the optional dependency takes effect. You'll notice that Project-B is not included in the classpath of Project-X; you will need to declare it directly in your POM in order for B to be included in X's classpath.

            People

            • Assignee:
              Scott Carey
              Reporter:
              Owen O'Malley
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development