Avro
  1. Avro
  2. AVRO-1439

MultipleInputs equivalent for Avro MR

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.8.0
    • Fix Version/s: 1.7.7
    • Component/s: java
    • Labels:
      None

      Description

      We have MultipleOutputs-like functionality for Avro today, but lack a MultipleInputs which would make pure-MR joins possible to do with Specific/Reflect Avro MR.

      1. AVRO-1439.patch
        33 kB
        Harsh J
      2. AVRO-1439.patch
        35 kB
        Harsh J

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          SUCCESS: Integrated in AvroJava #439 (See https://builds.apache.org/job/AvroJava/439/)
          AVRO-1439. Fix to work when -Dhadoop.version=2.
          Hadoop2 depends on Commons Codec 1.3, while Hadoop1 depends on 1.4. (cutting: rev 1564903)

          • /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/AvroMultipleInputs.java
          Show
          Hudson added a comment - SUCCESS: Integrated in AvroJava #439 (See https://builds.apache.org/job/AvroJava/439/ ) AVRO-1439 . Fix to work when -Dhadoop.version=2. Hadoop2 depends on Commons Codec 1.3, while Hadoop1 depends on 1.4. (cutting: rev 1564903) /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/AvroMultipleInputs.java
          Hide
          ASF subversion and git services added a comment -

          Commit 1564903 from Doug Cutting in branch 'avro/trunk'
          [ https://svn.apache.org/r1564903 ]

          AVRO-1439. Fix to work when -Dhadoop.version=2.
          Hadoop2 depends on Commons Codec 1.3, while Hadoop1 depends on 1.4.

          Show
          ASF subversion and git services added a comment - Commit 1564903 from Doug Cutting in branch 'avro/trunk' [ https://svn.apache.org/r1564903 ] AVRO-1439 . Fix to work when -Dhadoop.version=2. Hadoop2 depends on Commons Codec 1.3, while Hadoop1 depends on 1.4.
          Hide
          Doug Cutting added a comment -

          The cause of this failure may be that Jenkins now tests against Hadoop 2, but I only tested against Hadoop 1 before committing.

          Show
          Doug Cutting added a comment - The cause of this failure may be that Jenkins now tests against Hadoop 2, but I only tested against Hadoop 1 before committing.
          Hide
          Hudson added a comment -

          FAILURE: Integrated in AvroJava #438 (See https://builds.apache.org/job/AvroJava/438/)
          AVRO-1439. Java: Add AvroMultipleInputs for mapred. Contributed by Harsh J. (cutting: rev 1564562)

          • /avro/trunk/CHANGES.txt
          • /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/AvroMultipleInputs.java
          • /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/DelegatingInputFormat.java
          • /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/DelegatingMapper.java
          • /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/HadoopMapper.java
          • /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/MapCollector.java
          • /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/TaggedInputSplit.java
          • /avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapred/TestAvroMultipleInputs.java
          Show
          Hudson added a comment - FAILURE: Integrated in AvroJava #438 (See https://builds.apache.org/job/AvroJava/438/ ) AVRO-1439 . Java: Add AvroMultipleInputs for mapred. Contributed by Harsh J. (cutting: rev 1564562) /avro/trunk/CHANGES.txt /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/AvroMultipleInputs.java /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/DelegatingInputFormat.java /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/DelegatingMapper.java /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/HadoopMapper.java /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/MapCollector.java /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/TaggedInputSplit.java /avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapred/TestAvroMultipleInputs.java
          Hide
          Doug Cutting added a comment -

          I committed this. Thanks, Harsh!

          Show
          Doug Cutting added a comment - I committed this. Thanks, Harsh!
          Hide
          ASF subversion and git services added a comment -

          Commit 1564562 from Doug Cutting in branch 'avro/trunk'
          [ https://svn.apache.org/r1564562 ]

          AVRO-1439. Java: Add AvroMultipleInputs for mapred. Contributed by Harsh J.

          Show
          ASF subversion and git services added a comment - Commit 1564562 from Doug Cutting in branch 'avro/trunk' [ https://svn.apache.org/r1564562 ] AVRO-1439 . Java: Add AvroMultipleInputs for mapred. Contributed by Harsh J.
          Hide
          Harsh J added a comment -

          Thanks for the comments! I've addressed most of them in this patch.

          We should be able to compatibly support input format configuration via a new addInputPath call in future if there's need for it.

          I believe GenericData too may work if it works with AvroMapper, on which this is currently dependent.

          Show
          Harsh J added a comment - Thanks for the comments! I've addressed most of them in this patch. We should be able to compatibly support input format configuration via a new addInputPath call in future if there's need for it. I believe GenericData too may work if it works with AvroMapper, on which this is currently dependent.
          Hide
          Doug Cutting added a comment -

          Harsh, this looks great and is a wonderful addition. Some quick comments:

          • 'schemata' is a more commmon plural than 'schemae', but we've generally stuck to simply 'schemas' in Avro
          • some public stuff is missing javadoc. we should probably also expand the AvroMultipleInputs javadoc to include a pseudo-code example, like AvroMultipleOutputs does.
          • i'm okay with the limitations (specific/reflect, no inputformats, etc) as long as we are convinced that we can compatibly remove them later.
          • we might commit the mapred version first and add the mapreduce in a subsequent issue.
          Show
          Doug Cutting added a comment - Harsh, this looks great and is a wonderful addition. Some quick comments: 'schemata' is a more commmon plural than 'schemae', but we've generally stuck to simply 'schemas' in Avro some public stuff is missing javadoc. we should probably also expand the AvroMultipleInputs javadoc to include a pseudo-code example, like AvroMultipleOutputs does. i'm okay with the limitations (specific/reflect, no inputformats, etc) as long as we are convinced that we can compatibly remove them later. we might commit the mapred version first and add the mapreduce in a subsequent issue.
          Hide
          Harsh J added a comment -

          Here is a functional patch for the mapred (Old) APIs with a reflect based test case that illustrates a sample join operation.

          I've not yet delved into the mapreduce (New) APIs, but it would be implemented in nearly the same way.

          Any comments on the approach before I begin work on the mapreduce equivalent?

          Here are some implementation points:

          • Only works for Specific and Reflect based MR that use mapred.AvroInputFormat and mapred.AvroMapper/mapred.AvroReducer classes.
            • Only schema and map classes can be configured per path.
            • No input format class flexibility like its Apache Hadoop equivalent.
          • Passing a schema when adding an input path is mandatory.
          • Passing a mapper class when adding an input path is also mandatory.
          Show
          Harsh J added a comment - Here is a functional patch for the mapred (Old) APIs with a reflect based test case that illustrates a sample join operation. I've not yet delved into the mapreduce (New) APIs, but it would be implemented in nearly the same way. Any comments on the approach before I begin work on the mapreduce equivalent? Here are some implementation points: Only works for Specific and Reflect based MR that use mapred.AvroInputFormat and mapred.AvroMapper / mapred.AvroReducer classes. Only schema and map classes can be configured per path. No input format class flexibility like its Apache Hadoop equivalent. Passing a schema when adding an input path is mandatory. Passing a mapper class when adding an input path is also mandatory.

            People

            • Assignee:
              Harsh J
              Reporter:
              Harsh J
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development