Avro
  1. Avro
  2. AVRO-1052

Support MultipleOutputFormat in Avro

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.6.1
    • Fix Version/s: 1.7.0
    • Component/s: java
    • Labels:
    • Environment:

      Avro 1.6.1, Hadoop 0.20.205.0.3

    • Release Note:
      Adding a new feature to support writing output into multiple files.
    • Tags:
      MultipleAvroOutputFormat

      Description

      Api's for Avro to write records into multiple files similar to the MultipleTextOutputFormat. All the files have the same output schema

      1. AVRO-1052.patch
        12 kB
        Ashish Nagavaram
      2. AVRO-1052.patch
        27 kB
        Ashish Nagavaram
      3. AVRO-1052.patch
        27 kB
        Ashish Nagavaram

        Activity

        Doug Cutting made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        Ashish Nagavaram added a comment -

        Thanks Dough.

        Show
        Ashish Nagavaram added a comment - Thanks Dough.
        Doug Cutting made changes -
        Fix Version/s 1.7.0 [ 12318848 ]
        Affects Version/s 1.7.0 [ 12318848 ]
        Doug Cutting made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Assignee Ashish Nagavaram [ nagav.ashish ]
        Resolution Fixed [ 1 ]
        Hide
        Doug Cutting added a comment -

        I committed this. Thanks, Ashish!

        Show
        Doug Cutting added a comment - I committed this. Thanks, Ashish!
        Ashish Nagavaram made changes -
        Attachment AVRO-1052.patch [ 12528196 ]
        Hide
        Ashish Nagavaram added a comment -

        Patch containing the modifications to the java-doc and Test cases.

        Show
        Ashish Nagavaram added a comment - Patch containing the modifications to the java-doc and Test cases.
        Hide
        Ashish Nagavaram added a comment -

        Thanks for the feedback . Yes, the MOAvroMap and MOAvroReducer are the Mapper and Reducer extending AvroMapper and AvroReducer. I will make these changes and upload a new patch.

        Show
        Ashish Nagavaram added a comment - Thanks for the feedback . Yes, the MOAvroMap and MOAvroReducer are the Mapper and Reducer extending AvroMapper and AvroReducer. I will make these changes and upload a new patch.
        Hide
        Doug Cutting added a comment -

        Thanks for this!

        A few comments:

        • the tests fail for me (on JDK7) unless I add testProjection1() to runTestsInOrder(). testProjection1() depends on the output of testJob(), so to guarantee that testJob() is run first this change must be made.
        • I find the javadoc in AvroMultipleOutputs to be confusing. First, it's indented poorly. Second, it names classes like MOAvroMap, MOAvroReduce and MOReduce which are not clear to me. Perhaps you just mean these to be names like MyAvroMapper and MyAvroReducer, which extend AvroMapper and AvroReducer? Is that correct? It appears so from the tests.
        Show
        Doug Cutting added a comment - Thanks for this! A few comments: the tests fail for me (on JDK7) unless I add testProjection1() to runTestsInOrder(). testProjection1() depends on the output of testJob(), so to guarantee that testJob() is run first this change must be made. I find the javadoc in AvroMultipleOutputs to be confusing. First, it's indented poorly. Second, it names classes like MOAvroMap, MOAvroReduce and MOReduce which are not clear to me. Perhaps you just mean these to be names like MyAvroMapper and MyAvroReducer, which extend AvroMapper and AvroReducer? Is that correct? It appears so from the tests.
        Ashish Nagavaram made changes -
        Affects Version/s 1.7.0 [ 12318848 ]
        Hide
        Ashish Nagavaram added a comment -

        I have uploaded a new path for AvroMultipleOutputs support. This also allows to specify different schemas for different namedOutputs. Can anyone review the code?

        Thanks

        Show
        Ashish Nagavaram added a comment - I have uploaded a new path for AvroMultipleOutputs support. This also allows to specify different schemas for different namedOutputs. Can anyone review the code? Thanks
        Ashish Nagavaram made changes -
        Attachment AVRO-1052.patch [ 12527959 ]
        Hide
        Ashish Nagavaram added a comment -

        AvroMultipleOutputs which allows the features of MultipleOutputs for Avro. Multiple schemas can be specified for different files.

        Show
        Ashish Nagavaram added a comment - AvroMultipleOutputs which allows the features of MultipleOutputs for Avro. Multiple schemas can be specified for different files.
        Hide
        Ashish Nagavaram added a comment -

        I pretty much have an implementation of AvroMultipleoutputs, but was wondering if it would make sense of modifying the existing HADOOP MultipleOutputs code(with Avro as special case) to support this or have a separate AvroMultipleOutputs(works only for AvroOutputFormat)? Any suggestions ?

        Thanks

        Show
        Ashish Nagavaram added a comment - I pretty much have an implementation of AvroMultipleoutputs, but was wondering if it would make sense of modifying the existing HADOOP MultipleOutputs code(with Avro as special case) to support this or have a separate AvroMultipleOutputs(works only for AvroOutputFormat)? Any suggestions ? Thanks
        Ashish Nagavaram made changes -
        Attachment AVRO-1052.patch [ 12521529 ]
        Hide
        Ashish Nagavaram added a comment -

        Path for MultipleAvroOutputFormat written by extending the FileOutputFormat class and provides api's similar to MultipleTextOutput format

        Show
        Ashish Nagavaram added a comment - Path for MultipleAvroOutputFormat written by extending the FileOutputFormat class and provides api's similar to MultipleTextOutput format
        Ashish Nagavaram made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Affects Version/s 1.6.1 [ 12318847 ]
        Hide
        Ashish Nagavaram added a comment -

        Small changes to the previous path works fine with hadoop 0.20. Will start working on multiple outputs for newer hadoop versions

        Show
        Ashish Nagavaram added a comment - Small changes to the previous path works fine with hadoop 0.20. Will start working on multiple outputs for newer hadoop versions
        Ashish Nagavaram made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        Ashish Nagavaram added a comment -

        I tried to find one but was not able to. I will start working on a Avro Specific MultipleOutputs and see how it goes.

        Show
        Ashish Nagavaram added a comment - I tried to find one but was not able to. I will start working on a Avro Specific MultipleOutputs and see how it goes.
        Hide
        Harsh J added a comment -

        MultipleOutputFormat is being replaced by MultipleOutputs as the API evolves upstream in Hadoop.

        Can MultipleOutputs be enhanced upstream to accommodate this request? Or perhaps Avro can provide its own enhanced super-version (I think one was provided recently, dunno if it got merged – will search the JIRA/mail later).

        Show
        Harsh J added a comment - MultipleOutputFormat is being replaced by MultipleOutputs as the API evolves upstream in Hadoop. Can MultipleOutputs be enhanced upstream to accommodate this request? Or perhaps Avro can provide its own enhanced super-version (I think one was provided recently, dunno if it got merged – will search the JIRA/mail later).
        Ashish Nagavaram made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Tags AvroMultipleOutputFormat MultipleAvroOutputFormat
        Hide
        Ashish Nagavaram added a comment -

        This patch contains the implementation of writing output to multiple files. This provides api's to create files based on the key-value pair, name the final output leaf file etc. The code was also tested on actual data with Avro 1.6.1 library.

        Show
        Ashish Nagavaram added a comment - This patch contains the implementation of writing output to multiple files. This provides api's to create files based on the key-value pair, name the final output leaf file etc. The code was also tested on actual data with Avro 1.6.1 library.
        Ashish Nagavaram made changes -
        Field Original Value New Value
        Description Api's for Avro to write into multiple files similar to the MultipleTextOutputFormat. All the files have the same schema (output schema) Api's for Avro to write records into multiple files similar to the MultipleTextOutputFormat. All the files have the same output schema
        Ashish Nagavaram created issue -

          People

          • Assignee:
            Ashish Nagavaram
            Reporter:
            Ashish Nagavaram
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 432h
              432h
              Remaining:
              Remaining Estimate - 432h
              432h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development