Avro
  1. Avro
  2. AVRO-1170

Avro's new mapreduce APIs don't work with Hadoop 2

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.7.1
    • Fix Version/s: 1.7.3
    • Component/s: java
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Avro does not compile against Hadoop since some interfaces were changed to classes between Hadoop 1 and 2 (e.g. TaskAttemptContext).

      1. AVRO-1170.patch
        14 kB
        Tom White
      2. AVRO-1170.patch
        17 kB
        Tom White
      3. AVRO-1170.patch
        17 kB
        Tom White
      4. AVRO-1170.patch
        12 kB
        Tom White
      5. AVRO-1170.patch
        4 kB
        Tom White

        Activity

        Hide
        Tom White added a comment -

        Here are the compilation failures:

        [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.3.2:compile (default-compile) on project avro-mapred: Compilation failure: Compilation failure:
        [ERROR] /Users/tom/workspace/avro-trunk/lang/java/mapred/src/main/java/org/apache/hadoop/io/SequenceFileBase.java:[44,6] cannot find symbol
        [ERROR] symbol  : constructor BlockCompressWriter(org.apache.hadoop.fs.FileSystem,org.apache.hadoop.conf.Configuration,org.apache.hadoop.fs.Path,java.lang.Class,java.lang.Class,int,short,long,org.apache.hadoop.io.compress.CompressionCodec,org.apache.hadoop.util.Progressable,org.apache.hadoop.io.SequenceFile.Metadata)
        [ERROR] location: class org.apache.hadoop.io.SequenceFile.BlockCompressWriter
        [ERROR] /Users/tom/workspace/avro-trunk/lang/java/mapred/src/main/java/org/apache/hadoop/io/SequenceFileBase.java:[58,6] cannot find symbol
        [ERROR] symbol  : constructor RecordCompressWriter(org.apache.hadoop.fs.FileSystem,org.apache.hadoop.conf.Configuration,org.apache.hadoop.fs.Path,java.lang.Class,java.lang.Class,int,short,long,org.apache.hadoop.io.compress.CompressionCodec,org.apache.hadoop.util.Progressable,org.apache.hadoop.io.SequenceFile.Metadata)
        [ERROR] location: class org.apache.hadoop.io.SequenceFile.RecordCompressWriter
        [ERROR] /Users/tom/workspace/avro-trunk/lang/java/mapred/src/main/java/org/apache/avro/mapreduce/AvroMultipleOutputs.java:[425,37] org.apache.hadoop.mapreduce.TaskAttemptContext is abstract; cannot be instantiated
        [ERROR] /Users/tom/workspace/avro-trunk/lang/java/mapred/src/main/java/org/apache/avro/mapreduce/AvroMultipleOutputs.java:[498,18] org.apache.hadoop.mapreduce.TaskAttemptContext is abstract; cannot be instantiated
        

        The first two are because the constructors for SequenceFile.BlockCompressWriter and SequenceFile.RecordCompressWriter have changed between Hadoop 1 and 2. I'll file a Hadoop JIRA for this.

        The second two are because of the change in TaskAttemptContext. This can be solved via reflection and separate Maven artifacts for the mapred JAR. The same problem was fixed in MRUnit, see MRUNIT-31 and MRUNIT-56 for some background.

        Show
        Tom White added a comment - Here are the compilation failures: [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.3.2:compile (default-compile) on project avro-mapred: Compilation failure: Compilation failure: [ERROR] /Users/tom/workspace/avro-trunk/lang/java/mapred/src/main/java/org/apache/hadoop/io/SequenceFileBase.java:[44,6] cannot find symbol [ERROR] symbol : constructor BlockCompressWriter(org.apache.hadoop.fs.FileSystem,org.apache.hadoop.conf.Configuration,org.apache.hadoop.fs.Path,java.lang.Class,java.lang.Class,int,short,long,org.apache.hadoop.io.compress.CompressionCodec,org.apache.hadoop.util.Progressable,org.apache.hadoop.io.SequenceFile.Metadata) [ERROR] location: class org.apache.hadoop.io.SequenceFile.BlockCompressWriter [ERROR] /Users/tom/workspace/avro-trunk/lang/java/mapred/src/main/java/org/apache/hadoop/io/SequenceFileBase.java:[58,6] cannot find symbol [ERROR] symbol : constructor RecordCompressWriter(org.apache.hadoop.fs.FileSystem,org.apache.hadoop.conf.Configuration,org.apache.hadoop.fs.Path,java.lang.Class,java.lang.Class,int,short,long,org.apache.hadoop.io.compress.CompressionCodec,org.apache.hadoop.util.Progressable,org.apache.hadoop.io.SequenceFile.Metadata) [ERROR] location: class org.apache.hadoop.io.SequenceFile.RecordCompressWriter [ERROR] /Users/tom/workspace/avro-trunk/lang/java/mapred/src/main/java/org/apache/avro/mapreduce/AvroMultipleOutputs.java:[425,37] org.apache.hadoop.mapreduce.TaskAttemptContext is abstract; cannot be instantiated [ERROR] /Users/tom/workspace/avro-trunk/lang/java/mapred/src/main/java/org/apache/avro/mapreduce/AvroMultipleOutputs.java:[498,18] org.apache.hadoop.mapreduce.TaskAttemptContext is abstract; cannot be instantiated The first two are because the constructors for SequenceFile.BlockCompressWriter and SequenceFile.RecordCompressWriter have changed between Hadoop 1 and 2. I'll file a Hadoop JIRA for this. The second two are because of the change in TaskAttemptContext. This can be solved via reflection and separate Maven artifacts for the mapred JAR. The same problem was fixed in MRUnit, see MRUNIT-31 and MRUNIT-56 for some background.
        Hide
        Tom White added a comment -

        I filed HADOOP-8825 for the SequenceFile changes.

        Here's a patch that uses patched Hadoop JARs with the HADOOP-8825 fix. Avro now compiles against Hadoop 2, however there is more work to produce separate Maven artifacts (probably via profiles).

        Show
        Tom White added a comment - I filed HADOOP-8825 for the SequenceFile changes. Here's a patch that uses patched Hadoop JARs with the HADOOP-8825 fix. Avro now compiles against Hadoop 2, however there is more work to produce separate Maven artifacts (probably via profiles).
        Hide
        Doug Cutting added a comment -

        For the createTaskAttemptContext stuff we should add a comment explaining the purpose of this, to work around incompatible API changes.

        For the pom.xml changes, note that the parent pom.xml sets the default version for Hadoop in Avro to 0.20.205.0, so the change here is only for the mapred module, not for the other modules that depend on hadoop (tools and trevni). Perhaps that's intended, so that we end up testing against both versions of Hadoop. If so, we should add a comment to that effect. If its not intended then the version should probably be set in the parent pom.xml. That said, we should not add a dependency on a SNAPSHOT pom, so we'll probably not commit this until there's a Hadoop release that contains the required changes or we figure out another way to fix this.

        Show
        Doug Cutting added a comment - For the createTaskAttemptContext stuff we should add a comment explaining the purpose of this, to work around incompatible API changes. For the pom.xml changes, note that the parent pom.xml sets the default version for Hadoop in Avro to 0.20.205.0, so the change here is only for the mapred module, not for the other modules that depend on hadoop (tools and trevni). Perhaps that's intended, so that we end up testing against both versions of Hadoop. If so, we should add a comment to that effect. If its not intended then the version should probably be set in the parent pom.xml. That said, we should not add a dependency on a SNAPSHOT pom, so we'll probably not commit this until there's a Hadoop release that contains the required changes or we figure out another way to fix this.
        Hide
        Tom White added a comment -

        Here's an updated patch that does away with SequenceFileBase and requires no changes to SequenceFile in Hadoop 2.

        I've also added a comment to createTaskAttemptContext.

        For the pom.xml version numbers - I'll fix that to use a single Hadoop version when I do the work to build dual mapred artifacts.

        Show
        Tom White added a comment - Here's an updated patch that does away with SequenceFileBase and requires no changes to SequenceFile in Hadoop 2. I've also added a comment to createTaskAttemptContext. For the pom.xml version numbers - I'll fix that to use a single Hadoop version when I do the work to build dual mapred artifacts.
        Hide
        Doug Cutting added a comment -

        Nice work, Tom!

        To be clear, the current patch could be committed without the changes to the pom and everything would still work with Hadoop 1, right? Would Hadoop 2 require recompilation?

        Show
        Doug Cutting added a comment - Nice work, Tom! To be clear, the current patch could be committed without the changes to the pom and everything would still work with Hadoop 1, right? Would Hadoop 2 require recompilation?
        Hide
        Tom White added a comment -

        To be clear, the current patch could be committed without the changes to the pom and everything would still work with Hadoop 1, right?

        Yes. I haven't tested that yet, but that's the idea.

        Would Hadoop 2 require recompilation?

        Unfortunately it would. I found this out in MRUNIT-56 - the bytecode for method invocation is different for classes and interfaces, so a separate JAR is needed for each version.

        Show
        Tom White added a comment - To be clear, the current patch could be committed without the changes to the pom and everything would still work with Hadoop 1, right? Yes. I haven't tested that yet, but that's the idea. Would Hadoop 2 require recompilation? Unfortunately it would. I found this out in MRUNIT-56 - the bytecode for method invocation is different for classes and interfaces, so a separate JAR is needed for each version.
        Hide
        Doug Cutting added a comment -

        If we build multiple jars then we might also run the tests twice, once with each version.

        Show
        Doug Cutting added a comment - If we build multiple jars then we might also run the tests twice, once with each version.
        Hide
        Tom White added a comment -

        Here's a new patch with the Maven changes.

        For testing you can run mvn test -Dhadoop.version=2 to test the mapred module with Hadoop 2. If you don't specify the hadoop.version property it defaults to 1 like the current behaviour. The other modules (i.e. not mapred) all build against Hadoop 1 since they use APIs that are binary compatible.

        For building, the idea is to create a mapred jar with a hadoop1 or hadoop2 classifier. We also create a Hadoop 1 artifact with no classifier which is the default (for backwards compatibility). For this to work we build against Hadoop 2 first, then Hadoop 1 so that the JAR with no classifier is the last one build (Hadoop 1). I've changed the top-level build script to implement this.

        For deployment, the instructions at https://cwiki.apache.org/confluence/display/AVRO/How+To+Release would need to change to deploy the JARs with classifiers. Locally deploying twice worked, although I'm not sure if this would work with a repository manager like Nexus:

        mvn deploy -DskipTests=true -Dhadoop.version=2 -DaltDeploymentRepository=mine::default::file:///tmp/myrepo
        mvn deploy -DskipTests=true -DaltDeploymentRepository=mine::default::file:///tmp/myrepo
        

        For consumers of the Maven artifacts, if you didn't specify a classifier in your dependency section then it would use Hadoop 1, as before:

        <dependency>
          <groupId>org.apache.avro</groupId>
          <artifactId>avro-mapred</artifactId>
          <version>1.7.2</version>
        </dependency>
        

        To use Hadoop 2, you would specify a hadoop2 classifier:

        <dependency>
          <groupId>org.apache.avro</groupId>
          <artifactId>avro-mapred</artifactId>
          <version>1.7.2</version>
          <classifier>hadoop2</classifier>
        </dependency>
        
        Show
        Tom White added a comment - Here's a new patch with the Maven changes. For testing you can run mvn test -Dhadoop.version=2 to test the mapred module with Hadoop 2. If you don't specify the hadoop.version property it defaults to 1 like the current behaviour. The other modules (i.e. not mapred) all build against Hadoop 1 since they use APIs that are binary compatible. For building, the idea is to create a mapred jar with a hadoop1 or hadoop2 classifier. We also create a Hadoop 1 artifact with no classifier which is the default (for backwards compatibility). For this to work we build against Hadoop 2 first, then Hadoop 1 so that the JAR with no classifier is the last one build (Hadoop 1). I've changed the top-level build script to implement this. For deployment, the instructions at https://cwiki.apache.org/confluence/display/AVRO/How+To+Release would need to change to deploy the JARs with classifiers. Locally deploying twice worked, although I'm not sure if this would work with a repository manager like Nexus: mvn deploy -DskipTests=true -Dhadoop.version=2 -DaltDeploymentRepository=mine::default::file:///tmp/myrepo mvn deploy -DskipTests=true -DaltDeploymentRepository=mine::default::file:///tmp/myrepo For consumers of the Maven artifacts, if you didn't specify a classifier in your dependency section then it would use Hadoop 1, as before: <dependency> <groupId>org.apache.avro</groupId> <artifactId>avro-mapred</artifactId> <version>1.7.2</version> </dependency> To use Hadoop 2, you would specify a hadoop2 classifier: <dependency> <groupId>org.apache.avro</groupId> <artifactId>avro-mapred</artifactId> <version>1.7.2</version> <classifier>hadoop2</classifier> </dependency>
        Hide
        Doug Cutting added a comment -

        When I apply this patch and run, 'mvn test -Dtest=TestAvroMultipleOutputs' it fails with "java.lang.ClassNotFoundException: org.apache.hadoop.mapreduce.task.TaskAttemptContext". When I add '-Dhadoop.version=2' then tests pass. When I change this to '-Dhadoop.version=1", compilation fails.

        Show
        Doug Cutting added a comment - When I apply this patch and run, 'mvn test -Dtest=TestAvroMultipleOutputs' it fails with "java.lang.ClassNotFoundException: org.apache.hadoop.mapreduce.task.TaskAttemptContext". When I add '-Dhadoop.version=2' then tests pass. When I change this to '-Dhadoop.version=1", compilation fails.
        Hide
        Tom White added a comment -

        There was a mistake in the package name which I've now fixed (org.apache.hadoop.mapreduce.TaskAttemptContext). I've also added a profile that matches when you specify '-Dhadoop.version=1', since in the previous patch this caused neither the hadoop1 nor the hadoop2 profile to match, which caused a compilation error.

        You need to do a 'mvn clean' before running the tests for a different Hadoop version, otherwise the classes are not recompiled against the new Hadoop classes and you get an IncompatibleClassChangeError. Thus:

        mvn clean test -Dtest=TestAvroMultipleOutputs -Dhadoop.version=1
        mvn clean test -Dtest=TestAvroMultipleOutputs -Dhadoop.version=2
        
        Show
        Tom White added a comment - There was a mistake in the package name which I've now fixed (org.apache.hadoop.mapreduce.TaskAttemptContext). I've also added a profile that matches when you specify '-Dhadoop.version=1', since in the previous patch this caused neither the hadoop1 nor the hadoop2 profile to match, which caused a compilation error. You need to do a 'mvn clean' before running the tests for a different Hadoop version, otherwise the classes are not recompiled against the new Hadoop classes and you get an IncompatibleClassChangeError. Thus: mvn clean test -Dtest=TestAvroMultipleOutputs -Dhadoop.version=1 mvn clean test -Dtest=TestAvroMultipleOutputs -Dhadoop.version=2
        Hide
        Doug Cutting added a comment -

        Tests now pass for me with hadoop.version unspecified or with it set to 1 or 2.

        It's unfortunate that we need three profiles. I now see that your previous patch was just intended to work with either version unspecified or 2. That might be preferable, as it removes redundancies in the pom that might later result in problems. Regardless, we should probably add a comment describing the possible values.

        We should update the top-level build.sh so that it builds jars with both classifiers. Also the release instructions will need updating so that we publish both types to Maven.

        Where should we document these classifiers for client applications?

        Thanks for all your work on this, Tom!

        Show
        Doug Cutting added a comment - Tests now pass for me with hadoop.version unspecified or with it set to 1 or 2. It's unfortunate that we need three profiles. I now see that your previous patch was just intended to work with either version unspecified or 2. That might be preferable, as it removes redundancies in the pom that might later result in problems. Regardless, we should probably add a comment describing the possible values. We should update the top-level build.sh so that it builds jars with both classifiers. Also the release instructions will need updating so that we publish both types to Maven. Where should we document these classifiers for client applications? Thanks for all your work on this, Tom!
        Hide
        Tom White added a comment -

        Here's a new patch with just two profiles. I've added a comment to the top-level Java POM explaining the usage.

        We should update the top-level build.sh so that it builds jars with both classifiers. Also the release instructions will need updating so that we publish both types to Maven.

        I updated the top-level build.sh to build both JARs. I'm not sure exactly how the release instructions change (see earlier comment), but it should be possible to figure that out at release time. We might need to put a note in them as a reminder.

        Where should we document these classifiers for client applications?

        How about adding a question to the FAQ? It would look a bit like last part of https://issues.apache.org/jira/browse/AVRO-1170?focusedCommentId=13459646&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13459646. I can add that once a release is made.

        Show
        Tom White added a comment - Here's a new patch with just two profiles. I've added a comment to the top-level Java POM explaining the usage. We should update the top-level build.sh so that it builds jars with both classifiers. Also the release instructions will need updating so that we publish both types to Maven. I updated the top-level build.sh to build both JARs. I'm not sure exactly how the release instructions change (see earlier comment), but it should be possible to figure that out at release time. We might need to put a note in them as a reminder. Where should we document these classifiers for client applications? How about adding a question to the FAQ? It would look a bit like last part of https://issues.apache.org/jira/browse/AVRO-1170?focusedCommentId=13459646&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13459646 . I can add that once a release is made.
        Hide
        Doug Cutting added a comment -

        +1 The latest patch looks great. Thanks, Tom.

        Show
        Doug Cutting added a comment - +1 The latest patch looks great. Thanks, Tom.
        Hide
        Tom White added a comment -

        I just committed this. Thanks for the reviews, Doug.

        (BTW the last patch didn't include the removal of lang/java/mapred/src/main/java/org/apache/hadoop/io/SequenceFileBase.java, so I did that manually when I committed.)

        Show
        Tom White added a comment - I just committed this. Thanks for the reviews, Doug. (BTW the last patch didn't include the removal of lang/java/mapred/src/main/java/org/apache/hadoop/io/SequenceFileBase.java, so I did that manually when I committed.)
        Hide
        Josh Spiegel added a comment -

        I downloaded both jars here:
        wget http://apache.claz.org/avro/avro-1.7.3/java/avro-mapred-1.7.3-hadoop1.jar
        wget http://apache.claz.org/avro/avro-1.7.3/java/avro-mapred-1.7.3-hadoop2.jar

        Using either JAR with cdh3u3 (Hadoop 0.20.2) I get this error:
        java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
        at org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)

        So I think both JARS are not compatible with anything before 0.23.0-mr1-cdh4b1 which appears to be when TaskAttemptContext became an interface.

        Show
        Josh Spiegel added a comment - I downloaded both jars here: wget http://apache.claz.org/avro/avro-1.7.3/java/avro-mapred-1.7.3-hadoop1.jar wget http://apache.claz.org/avro/avro-1.7.3/java/avro-mapred-1.7.3-hadoop2.jar Using either JAR with cdh3u3 (Hadoop 0.20.2) I get this error: java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected at org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) So I think both JARS are not compatible with anything before 0.23.0-mr1-cdh4b1 which appears to be when TaskAttemptContext became an interface.
        Hide
        Tom White added a comment -

        I decompiled AvroKeyValueOutputFormat from both JARs with javap and indeed they both have the interface form of TaskAttemptContext:

        % javap -c -classpath . org/apache/avro/mapreduce/AvroKeyValueOutputFormat
        ...
           16:	invokeinterface	#5,  1; //InterfaceMethod org/apache/hadoop/mapreduce/TaskAttemptContext.getOutputKeyClass:()Ljava/lang/Class;
        ...
        

        For the Hadoop 1 JAR it should be invokevirtual. I opened AVRO-1230 to fix this. Thanks for the report Josh.

        Show
        Tom White added a comment - I decompiled AvroKeyValueOutputFormat from both JARs with javap and indeed they both have the interface form of TaskAttemptContext: % javap -c -classpath . org/apache/avro/mapreduce/AvroKeyValueOutputFormat ... 16: invokeinterface #5, 1; //InterfaceMethod org/apache/hadoop/mapreduce/TaskAttemptContext.getOutputKeyClass:()Ljava/lang/Class; ... For the Hadoop 1 JAR it should be invokevirtual . I opened AVRO-1230 to fix this. Thanks for the report Josh.
        Hide
        Josh Spiegel added a comment -

        Thanks Tom.

        Show
        Josh Spiegel added a comment - Thanks Tom.

          People

          • Assignee:
            Tom White
            Reporter:
            Tom White
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development