Avro
  1. Avro
  2. AVRO-1234

Avro MapReduce jobs silently ignore input data without '.avro' extension

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.7.3
    • Fix Version/s: 1.7.6
    • Component/s: None
    • Labels:
      None

      Description

      The AvroInputFormat class explicitly checks each input path for a '.avro' extension.

      If only some of the input paths have the correct extension, the remainder are silently ignored and not included in the job. However, if none of the input paths have the extension, the job will continue and succeed even though no map tasks are allocated, and no work is done.

      This only happens using the old mapred API. The new mapreduce API version will happily read files regardless of extension.

      Is the check necessary?

      1. AVRO-1234.patch
        4 kB
        Dave Beech
      2. AVRO-1234-1.patch
        7 kB
        Sandy Ryza
      3. AVRO-1234-2.patch
        9 kB
        Sandy Ryza

        Activity

        Hide
        Dave Beech added a comment -

        Attaching patch to remove the extension check from AvroInputFormat, AvroAsTextInputFormat and TetherInputFormat.

        Show
        Dave Beech added a comment - Attaching patch to remove the extension check from AvroInputFormat, AvroAsTextInputFormat and TetherInputFormat.
        Hide
        Harsh J added a comment -

        The check helps [helped?] avoid pulling in files of other types in the input directory. Note that the new API was added after the old API, and hence this may have as well been a regression in the former?

        If the intention of the input format is to only pick up files it can process, than failing the job with some failed tasks afterwards, then an extension or magic-bytes check is good to have.

        Show
        Harsh J added a comment - The check helps [helped?] avoid pulling in files of other types in the input directory. Note that the new API was added after the old API, and hence this may have as well been a regression in the former? If the intention of the input format is to only pick up files it can process, than failing the job with some failed tasks afterwards, then an extension or magic-bytes check is good to have.
        Hide
        Dave Beech added a comment -

        Well, the fact the old and new APIs differ in their behaviour is clearly not ideal. But I consider this a bugfix that hasn't been backported rather than a regression

        If I had a directory containing a mixture of Avro and non-Avro files, and I gave that path to AvroInputFormat to process, I'd fully expect the job to die a horrible death. Silently discarding input feels wrong, especially so if it's valid Avro which just happens to be named a certain way. A magic bytes check would be better, but on the whole I'm just not sure a check is necessary. As far as I'm aware, none of the standard Hadoop input formats behave in this way (happy to be corrected as I haven't checked them all!).

        Show
        Dave Beech added a comment - Well, the fact the old and new APIs differ in their behaviour is clearly not ideal. But I consider this a bugfix that hasn't been backported rather than a regression If I had a directory containing a mixture of Avro and non-Avro files, and I gave that path to AvroInputFormat to process, I'd fully expect the job to die a horrible death. Silently discarding input feels wrong, especially so if it's valid Avro which just happens to be named a certain way. A magic bytes check would be better, but on the whole I'm just not sure a check is necessary. As far as I'm aware, none of the standard Hadoop input formats behave in this way (happy to be corrected as I haven't checked them all!).
        Hide
        Harsh J added a comment -

        I agree; have not seen any other format do this. +1 from me. Also guessing that no test covered this earlier.

        Show
        Harsh J added a comment - I agree; have not seen any other format do this. +1 from me. Also guessing that no test covered this earlier.
        Hide
        Doug Cutting added a comment -

        This would be an incompatible change so should go into 1.8.0 unless we make it optional.

        Show
        Doug Cutting added a comment - This would be an incompatible change so should go into 1.8.0 unless we make it optional.
        Hide
        Dave Beech added a comment -

        Thanks Harsh, Doug. I'm happy for this to go into 1.8.0

        Show
        Dave Beech added a comment - Thanks Harsh, Doug. I'm happy for this to go into 1.8.0
        Hide
        Harsh J added a comment -

        This also would come handy in Flume areas, where the files are prepared suffix-less sometimes. Painful to rename them in bulk before a job is to be run with factory classes of avro-mapred.

        Would a configurable patch for 1.7 be OK to go in as well, such that default behavior is preserved but carries a toggle as well, for those who search and discover?

        Show
        Harsh J added a comment - This also would come handy in Flume areas, where the files are prepared suffix-less sometimes. Painful to rename them in bulk before a job is to be run with factory classes of avro-mapred. Would a configurable patch for 1.7 be OK to go in as well, such that default behavior is preserved but carries a toggle as well, for those who search and discover?
        Hide
        Santhosh Srinivasan added a comment -

        Dave Beech, can you make this optional to preserve backward compatibility?

        Show
        Santhosh Srinivasan added a comment - Dave Beech , can you make this optional to preserve backward compatibility?
        Hide
        Doug Cutting added a comment -

        It seems folks would like to make this change opt-in so that it can be included in 1.7.6.

        Show
        Doug Cutting added a comment - It seems folks would like to make this change opt-in so that it can be included in 1.7.6.
        Hide
        Dave Beech added a comment -

        Sure - I will upload a new patch soon with the change made opt-in.

        Show
        Dave Beech added a comment - Sure - I will upload a new patch soon with the change made opt-in.
        Hide
        bc Wong added a comment -

        Thanks for the patch, Dave! Would you have time to add a flag for opt-in? (I know it's holidays season. So it's totally fine if you're busy. I'm happy to finish that part for you.)

        Show
        bc Wong added a comment - Thanks for the patch, Dave! Would you have time to add a flag for opt-in? (I know it's holidays season. So it's totally fine if you're busy. I'm happy to finish that part for you.)
        Hide
        Sandy Ryza added a comment -

        Attached a patch that adds a flag for opt-in and adds a test

        Show
        Sandy Ryza added a comment - Attached a patch that adds a flag for opt-in and adds a test
        Hide
        Doug Cutting added a comment -

        Patch looks good. We might add more javadoc, noting that the default is to ignore, plus add doc for the default constant. Other than that +1.

        Show
        Doug Cutting added a comment - Patch looks good. We might add more javadoc, noting that the default is to ignore, plus add doc for the default constant. Other than that +1.
        Hide
        Sandy Ryza added a comment -

        Thanks for instant review Doug. Uploading a patch with additional doc.

        Show
        Sandy Ryza added a comment - Thanks for instant review Doug. Uploading a patch with additional doc.
        Hide
        ASF subversion and git services added a comment -

        Commit 1550605 from Doug Cutting in branch 'avro/trunk'
        [ https://svn.apache.org/r1550605 ]

        AVRO-1234. Java: Permit AvroInputFormat to process files whose names don't end in .avro. Contributed by Dave Beech & Sandy Ryza.

        Show
        ASF subversion and git services added a comment - Commit 1550605 from Doug Cutting in branch 'avro/trunk' [ https://svn.apache.org/r1550605 ] AVRO-1234 . Java: Permit AvroInputFormat to process files whose names don't end in .avro. Contributed by Dave Beech & Sandy Ryza.
        Hide
        Doug Cutting added a comment -

        I committed this. Thanks Dave & Sandy!

        Show
        Doug Cutting added a comment - I committed this. Thanks Dave & Sandy!
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in AvroJava #406 (See https://builds.apache.org/job/AvroJava/406/)
        AVRO-1234. Java: Permit AvroInputFormat to process files whose names don't end in .avro. Contributed by Dave Beech & Sandy Ryza. (cutting: rev 1550605)

        • /avro/trunk/CHANGES.txt
        • /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/AvroAsTextInputFormat.java
        • /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/AvroInputFormat.java
        • /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/tether/TetherInputFormat.java
        • /avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapred/TestAvroInputFormat.java
        Show
        Hudson added a comment - SUCCESS: Integrated in AvroJava #406 (See https://builds.apache.org/job/AvroJava/406/ ) AVRO-1234 . Java: Permit AvroInputFormat to process files whose names don't end in .avro. Contributed by Dave Beech & Sandy Ryza. (cutting: rev 1550605) /avro/trunk/CHANGES.txt /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/AvroAsTextInputFormat.java /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/AvroInputFormat.java /avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/tether/TetherInputFormat.java /avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapred/TestAvroInputFormat.java

          People

          • Assignee:
            Dave Beech
            Reporter:
            Dave Beech
          • Votes:
            1 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development