Uploaded image for project: 'Avro'
  1. Avro
  2. AVRO-1262

Provide access to the writer schema from the mapper

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.7.4
    • Fix Version/s: None
    • Component/s: java
    • Labels:
      None

      Description

      When using an Avro InputFormat like AvroKeyInputFormat, the writer schema of the container file should be accessible from the mapper. This is useful in cases where a reader schema is not specified.

      A workaround is to use FileSplit#getPath() to access the container file and manually pull out the schema. This workaround is not ideal because internally the writer schema has already been read (see AvroRecordReaderBase#createAvroFileReader(...)) - it is awkward and inefficient for the user to repeat this work.

      See also:
      http://mail-archives.apache.org/mod_mbox/avro-user/201302.mbox/%3CCAOF3b61nFw4ztOo9Q5pHHtoUDFZ3sRrvEdRGbXGV_cscTqd5LA%40mail.gmail.com%3E

      1. AVRO-1262.patch
        3 kB
        Doug Cutting

        Activity

        Hide
        cutting Doug Cutting added a comment -

        Here's a patch that implements this.

        Show
        cutting Doug Cutting added a comment - Here's a patch that implements this.
        Hide
        jojspieg@gmail.com Josh Spiegel added a comment -

        Thanks! It looks like you exposed the writer schema on the RecordReader. Is the RecordReader accessible from the Mapper? I can see that the RecordReader is referenced in the MapContext (*.hadoop.mapreduce.MapContext) but access to it seems to be private. Am I missing something?

        Show
        jojspieg@gmail.com Josh Spiegel added a comment - Thanks! It looks like you exposed the writer schema on the RecordReader. Is the RecordReader accessible from the Mapper? I can see that the RecordReader is referenced in the MapContext (*.hadoop.mapreduce.MapContext) but access to it seems to be private. Am I missing something?
        Hide
        qwertymaniac Harsh J added a comment -

        Josh,

        You're correct. One does not get access to the RR via a Mapper in new API, but can get it if they use the old API's MapRunner implementation.

        I guess one other way would be to have the RR load a defined config key during its initialization, which can be fetched from the Mapper. This would be more "inelegant" (i.e. no API) but would work with both APIs.

        Show
        qwertymaniac Harsh J added a comment - Josh, You're correct. One does not get access to the RR via a Mapper in new API, but can get it if they use the old API's MapRunner implementation. I guess one other way would be to have the RR load a defined config key during its initialization, which can be fetched from the Mapper. This would be more "inelegant" (i.e. no API) but would work with both APIs.
        Hide
        jojspieg@gmail.com Josh Spiegel added a comment -

        Harsh - thanks for confirming! The way I patched it locally was to expose the schema on AvroKey. I think your configuration suggestion would work too but it requires serializing the schema and reparsing it in the mapper. My solution does not require a reparse but it is not perfect either because it might imply to a user that the schema can change per datum (which of course is not true).

        In any case, I am optimistic that Doug will know what to do

        Thanks,
        Josh

        Show
        jojspieg@gmail.com Josh Spiegel added a comment - Harsh - thanks for confirming! The way I patched it locally was to expose the schema on AvroKey. I think your configuration suggestion would work too but it requires serializing the schema and reparsing it in the mapper. My solution does not require a reparse but it is not perfect either because it might imply to a user that the schema can change per datum (which of course is not true). In any case, I am optimistic that Doug will know what to do Thanks, Josh
        Hide
        cutting Doug Cutting added a comment -

        It doesn't seem that the patch I provided is what is required.

        Show
        cutting Doug Cutting added a comment - It doesn't seem that the patch I provided is what is required.

          People

          • Assignee:
            cutting Doug Cutting
            Reporter:
            jojspieg@gmail.com Josh Spiegel
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development