Avro
  1. Avro
  2. AVRO-1262

Provide access to the writer schema from the mapper

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.7.4
    • Fix Version/s: None
    • Component/s: java
    • Labels:
      None

      Description

      When using an Avro InputFormat like AvroKeyInputFormat, the writer schema of the container file should be accessible from the mapper. This is useful in cases where a reader schema is not specified.

      A workaround is to use FileSplit#getPath() to access the container file and manually pull out the schema. This workaround is not ideal because internally the writer schema has already been read (see AvroRecordReaderBase#createAvroFileReader(...)) - it is awkward and inefficient for the user to repeat this work.

      See also:
      http://mail-archives.apache.org/mod_mbox/avro-user/201302.mbox/%3CCAOF3b61nFw4ztOo9Q5pHHtoUDFZ3sRrvEdRGbXGV_cscTqd5LA%40mail.gmail.com%3E

      1. AVRO-1262.patch
        3 kB
        Doug Cutting

        Activity

        Hide
        Doug Cutting added a comment -

        Here's a patch that implements this.

        Show
        Doug Cutting added a comment - Here's a patch that implements this.
        Hide
        Josh Spiegel added a comment -

        Thanks! It looks like you exposed the writer schema on the RecordReader. Is the RecordReader accessible from the Mapper? I can see that the RecordReader is referenced in the MapContext (*.hadoop.mapreduce.MapContext) but access to it seems to be private. Am I missing something?

        Show
        Josh Spiegel added a comment - Thanks! It looks like you exposed the writer schema on the RecordReader. Is the RecordReader accessible from the Mapper? I can see that the RecordReader is referenced in the MapContext (*.hadoop.mapreduce.MapContext) but access to it seems to be private. Am I missing something?
        Hide
        Harsh J added a comment -

        Josh,

        You're correct. One does not get access to the RR via a Mapper in new API, but can get it if they use the old API's MapRunner implementation.

        I guess one other way would be to have the RR load a defined config key during its initialization, which can be fetched from the Mapper. This would be more "inelegant" (i.e. no API) but would work with both APIs.

        Show
        Harsh J added a comment - Josh, You're correct. One does not get access to the RR via a Mapper in new API, but can get it if they use the old API's MapRunner implementation. I guess one other way would be to have the RR load a defined config key during its initialization, which can be fetched from the Mapper. This would be more "inelegant" (i.e. no API) but would work with both APIs.
        Hide
        Josh Spiegel added a comment -

        Harsh - thanks for confirming! The way I patched it locally was to expose the schema on AvroKey. I think your configuration suggestion would work too but it requires serializing the schema and reparsing it in the mapper. My solution does not require a reparse but it is not perfect either because it might imply to a user that the schema can change per datum (which of course is not true).

        In any case, I am optimistic that Doug will know what to do

        Thanks,
        Josh

        Show
        Josh Spiegel added a comment - Harsh - thanks for confirming! The way I patched it locally was to expose the schema on AvroKey. I think your configuration suggestion would work too but it requires serializing the schema and reparsing it in the mapper. My solution does not require a reparse but it is not perfect either because it might imply to a user that the schema can change per datum (which of course is not true). In any case, I am optimistic that Doug will know what to do Thanks, Josh
        Hide
        Doug Cutting added a comment -

        It doesn't seem that the patch I provided is what is required.

        Show
        Doug Cutting added a comment - It doesn't seem that the patch I provided is what is required.

          People

          • Assignee:
            Doug Cutting
            Reporter:
            Josh Spiegel
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development