Avro
  1. Avro
  2. AVRO-986

Avro files generated from avro-c dont work with the Java mapred implementation.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.6.2
    • Component/s: c, java
    • Environment:

      avro-c 1.6.2-SNAPSHOT
      avro-java 1.6.2-SNAPSHOT
      hadoop 0.20.2

    • Tags:
      mapreduce hadoop avro sync

      Description

      When a file generated from the Avro-C implementation is fed into Hadoop, it will fail with "Block size invalid or too large for this implementation: -49".

      This is caused by the sync marker, namely the one that Avro-C puts into the header...

      The org.apache.avro.mapred.AvroRecordReader uses a FileSplit object to work out where it should read from, but this class is not particularly smart, it just divides the file up into equal size chunks, the first being with position 0.

      So org.apache.avro.mapred.AvroRecordReader gets 0 as the start of its chunk, and calls

      AvroRecordReader.java
      reader.sync(split.getStart());   // sync to start

      Then the org.apache.avro.file.DataFileReader::seek() goes to 0, then searches for a sync marker....
      It encounters one at position 32, the one in the header metadata map, "avro.sync"

      No other implementations add the sync marker in the metadata map, and none read it from there, not even the C version.

      I suggest we remove this from the header as the simplest solution.
      Another solution would be to create an AvroFileSplit class in mapred that knows where the blocks are, and provides the correct locations in the first place.

      1. 0001-avromod-utility.patch
        7 kB
        Douglas Creager
      2. 0001-Remove-sync-marker-from-metadata-in-header.patch
        1 kB
        Michael Cooper
      3. AVRO-986-java.patch
        2 kB
        Doug Cutting
      4. AVRO-986-java.patch
        0.7 kB
        Doug Cutting
      5. quickstop.db
        22 kB
        Douglas Creager

        Activity

        Hide
        Michael Cooper added a comment -

        Attaching patch to remove sync marker from the metadata in avro-c.

        Show
        Michael Cooper added a comment - Attaching patch to remove sync marker from the metadata in avro-c.
        Hide
        Doug Cutting added a comment -

        +1 This patch sounds like the right way to fix this to me.

        If we were to instead fix this in Java then I don't think we should try to make the splitter smarter, since splitting is single-threaded and that's not scalable. Rather we should make sync(0) skip over the metadata. But there probably shouldn't be any sync markers in the metadata anyway...

        Show
        Doug Cutting added a comment - +1 This patch sounds like the right way to fix this to me. If we were to instead fix this in Java then I don't think we should try to make the splitter smarter, since splitting is single-threaded and that's not scalable. Rather we should make sync(0) skip over the metadata. But there probably shouldn't be any sync markers in the metadata anyway...
        Hide
        Douglas Creager added a comment -

        +1 Should we apply the patch to 1.5, too?

        Also, it'll still be the case that any existing C-produced files will be unreadable in Java mapred. Should we also provide a fixup script of some kind?

        Show
        Douglas Creager added a comment - +1 Should we apply the patch to 1.5, too? Also, it'll still be the case that any existing C-produced files will be unreadable in Java mapred. Should we also provide a fixup script of some kind?
        Hide
        Doug Cutting added a comment -

        Perhaps we should fix the Java code too.

        Here's a patch that should do the trick. To test this we should probably add a file to share/test/data that has "avro.sync" in its metadata and test that reads after a DataFileReader#sync(0) on this work correctly.

        Show
        Doug Cutting added a comment - Perhaps we should fix the Java code too. Here's a patch that should do the trick. To test this we should probably add a file to share/test/data that has "avro.sync" in its metadata and test that reads after a DataFileReader#sync(0) on this work correctly.
        Hide
        Douglas Creager added a comment -

        I've attached a copy of the quickstop.db Avro file; this is generated by one of the C test cases. It contains the avro.sync metadata field. I'm happy to add this to the share directory also, but unfortunately I don't know enough about the Java build scripts to write a test case for Doug's patch.

        Show
        Douglas Creager added a comment - I've attached a copy of the quickstop.db Avro file; this is generated by one of the C test cases. It contains the avro.sync metadata field. I'm happy to add this to the share directory also, but unfortunately I don't know enough about the Java build scripts to write a test case for Doug's patch.
        Hide
        Douglas Creager added a comment -

        Here's a patch that adds a new "avromod" command-line utility. It can be used as a fixup script to remove the avro.sync field from the header (once I commit Michael's patch). It's also useful in its own right since you can create copies of Avro files with different compression codecs and block sizes. Eventually, we can also add options for changing the schema of the data in the file.

        Show
        Douglas Creager added a comment - Here's a patch that adds a new "avromod" command-line utility. It can be used as a fixup script to remove the avro.sync field from the header (once I commit Michael's patch). It's also useful in its own right since you can create copies of Avro files with different compression codecs and block sizes. Eventually, we can also add options for changing the schema of the data in the file.
        Hide
        Doug Cutting added a comment -

        Here's a version of the java changes that includes a test.

        Show
        Doug Cutting added a comment - Here's a version of the java changes that includes a test.
        Hide
        Doug Cutting added a comment -

        It would be good to get this into 1.6.2. Douglas, do you want to commit it, or should I?

        Show
        Doug Cutting added a comment - It would be good to get this into 1.6.2. Douglas, do you want to commit it, or should I?
        Hide
        Douglas Creager added a comment -

        Sure, I'll commit the patches now.

        Show
        Douglas Creager added a comment - Sure, I'll commit the patches now.
        Hide
        Douglas Creager added a comment -

        Actually, it looks like the java patch file you uploaded doesn't contain the contents of the binary syncInMeta.avro file. Can you upload that separately?

        Show
        Douglas Creager added a comment - Actually, it looks like the java patch file you uploaded doesn't contain the contents of the binary syncInMeta.avro file. Can you upload that separately?
        Hide
        Doug Cutting added a comment -

        share/test/data/syncInMeta.avro was just where I placed the quickstop.db file already attached to this issue.

        Show
        Doug Cutting added a comment - share/test/data/syncInMeta.avro was just where I placed the quickstop.db file already attached to this issue.
        Hide
        Douglas Creager added a comment -

        Committed all of these to SVN.

        Show
        Douglas Creager added a comment - Committed all of these to SVN.

          People

          • Assignee:
            Unassigned
            Reporter:
            Michael Cooper
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development