Avro
  1. Avro
  2. AVRO-808

Add AvroAsTextInputFormat for turning Avro Data Files to text

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.5.1
    • Component/s: java
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      This is the analog of SequenceFileAsTextInputFormat for Avro Data Files. This would be useful for streaming as it converts Avro data to their JSON representation, or the raw bytes in the case of a "bytes" schema.

      1. AVRO-808.patch
        6 kB
        Tom White
      2. AVRO-808.patch
        12 kB
        Tom White

        Activity

        Hide
        Doug Cutting added a comment -

        I committed this. Thanks, Tom!

        Show
        Doug Cutting added a comment - I committed this. Thanks, Tom!
        Hide
        Tom White added a comment -

        Here's an updated patch with a unit test and javadoc.

        There's some code duplication with AvroInputFormat and AvroRecordReader which would be good to eliminate. Do we need to do this? It could be achieved by introducing a common superclass.

        The eagle-eyed reviewer will notice that the test trims output lines, since the job introduces a trailing tab character on lines. I couldn't find a way of avoiding this.

        1. Changing the value type to NullWritable would fix the test, but makes the input format less useful for Streaming, since the input appears with a trailing "(null)" since this is the toString representation of NullWritable instances. (Arguably, Streaming should be fixed to special case NullWritables to ignore them.)
        2. I thought setting "mapred.textoutputformat.separator" to the empty string would be a workaround, but I found that this is interpreted as null, and hence the default value (a tab) is used. (When a Configuration is written to a file and then read back empty properties are read as null, not as empty strings. This is probably a bug - I haven't investigated further.)
        3. I thought changing the key to NullWritable and the value to Text might help, by using the ignore key feature in Streaming (MAPREDUCE-1785). However, this is not desirable for a couple of reasons: it's not available pre-0.22; and also you lose out the sort by key, which is generally expected when using the identity map and reduce.

        Thoughts?

        Show
        Tom White added a comment - Here's an updated patch with a unit test and javadoc. There's some code duplication with AvroInputFormat and AvroRecordReader which would be good to eliminate. Do we need to do this? It could be achieved by introducing a common superclass. The eagle-eyed reviewer will notice that the test trims output lines, since the job introduces a trailing tab character on lines. I couldn't find a way of avoiding this. Changing the value type to NullWritable would fix the test, but makes the input format less useful for Streaming, since the input appears with a trailing "(null)" since this is the toString representation of NullWritable instances. (Arguably, Streaming should be fixed to special case NullWritables to ignore them.) I thought setting "mapred.textoutputformat.separator" to the empty string would be a workaround, but I found that this is interpreted as null, and hence the default value (a tab) is used. (When a Configuration is written to a file and then read back empty properties are read as null, not as empty strings. This is probably a bug - I haven't investigated further.) I thought changing the key to NullWritable and the value to Text might help, by using the ignore key feature in Streaming ( MAPREDUCE-1785 ). However, this is not desirable for a couple of reasons: it's not available pre-0.22; and also you lose out the sort by key, which is generally expected when using the identity map and reduce. Thoughts?
        Hide
        Doug Cutting added a comment -

        Looks good, and will make a useful addition. Thanks! Javadoc would be good too before we commit it...

        Show
        Doug Cutting added a comment - Looks good, and will make a useful addition. Thanks! Javadoc would be good too before we commit it...
        Hide
        Tom White added a comment -

        Here's a patch implementing this. Still needs unit tests.

        Show
        Tom White added a comment - Here's a patch implementing this. Still needs unit tests.

          People

          • Assignee:
            Tom White
            Reporter:
            Tom White
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development